Simio and Qiang  have suggested the approach of transfer learning. Sometimes, we have to solve a classification problem in a particular domain, but we have an adequate amount of training data in another domain. In such cases, we can transfer the knowledge from one domain to another domain. This approach can accelerate development and reduce the effort required for creating a model from scratch. In many real-world applications, it is not feasible to collect the data needed for training. In such cases, knowledge transfer is highly advantageous. Sinno and Qiang  explained that they had given source domain Ds and learning task Ts, target Domain Dt and learning task Tt, transfer learning tries to improve the learning of the predictive function ft () in Dt using the feature in Ds and Ts.
The authors categorized transfer learning in three categories based on source and target domains and tasks.
• Inductive transfer learning: hi this, the target task is different from the source task, and labeled data in the target domain are required to induce a predictive model ft ().
- • Transductive transfer learning: In this, source and target domains are different, and feature space between the source and target tasks is the same, and the probability distribution of the feature space is different.
- • Unsupervised transfer learning: This is used for dimensionality reduction and clustering. In this, there are no labeled data available in the source and target domains.
If there are no labeled data in the source domain, then it is self-taught learning, and if labeled data are available in the source domain, then it belongs to multitask learning where there are the source and target tasks. When a source domain data task contributes to reduced performance in the target domain, then negative transfer occurs. Clustering techniques are used to avoid the problem of negative transfer.
Guo et al.  proposed that the standard sigmoid needs to use some penalty factors to train a lot of close to 0 redundant data to produce sparse data; this is because the standard sigmoid output does not have sparsity. In addition, unsupervised pretraining is needed. The authors explain that the CNN includes convolutional layers, pooling layer, and fully connected layer.
The convolutional layer extracts the feature from the input given to the whole process. Input features are learned with kernel and passed into a nonlinear activation function.
The pooling layer solves the dimensionality reduction problem. Operations may be average and maximum depending on the feature space. Since the recognition rate is not high, it is a simple structure and involves less computational cost. The shallow network gives the best performance as compared to other networks. System tested on the Modified National Institute of Standards and Technology dataset, which consists of 6000 gray-scale images of handwritten digits from 0 to 9 for classification.
Nisha et al.  have proposed the development of an expert system for diagnosing children’s skin diseases. This expert system is rule-based, and it uses a foiward chaining technique to retrieve inferences from the information store. The rules apply IF-THEN structure, where the block contained in IF is related to information in the THEN block. The system also asks a set of questions regarding the disease symptoms and generates diagnosis accordingly. The algorithm converts each RGB image into color space. The АГ-rneans clustering algorithm receives as input the features extracted from the image. Finally, for recognizing the disease, neural networks are used.
Vinay et al.  explained the system into two stages. Stage 1 consists of computer vision in which the system uses computer vision for the identification of skin diseases based on the features extracted from the images using various image processing techniques. This stage comprises two substages. The first substage filters the input data and extracts the features required, and the second stage identifies the disease by using the maximum entropy model and artificial neural networks (ANNs).
For the identification of the disease, the system creates a model. They used the maximum entropy model in which features extracted were supplied with a maximum entropy model, which gave better results as compared to other models because of the number of feature functions and its pertinence to this problem. As per the principle of maximum entropy stated by Jaynes, in making inferences based on incomplete information, we must use that probability distribution, which maximizes entropy subject to whatever is known
where x = (a,b) a e A, b e В and E = AXB
where v is the model parameter. The conditional log-likelihood is used to train the model parameters as described. Theta, as described by Kevin and Nolia , is the model parameter (vector)
The following equation computes the output by selecting the disease using the highest probability:
For the classification part, the algorithms used were К-nearest neighbors and decision trees. The system extract features from the image; first feature extraction is an infected area or region of the skin disease. The system uses the Sobel operator for detecting the shape of the infected region using image segmentation algorithms, and another is the absence of bumps around hair follicles using an algorithm.
Features are considered and given a probability distribution based on max entropy.
The ANN contains two hidden layers and one input layer. The ANN uses the input layer for feature extraction, the first hidden layer comprises of sigmoid neurons, the second layer is tan function, and the output layer gives the probability distribution among diseases.
The system tested for six dermatological diseases. Due to problems of underfitting, the authors discussed that ANNs are not proper for that dataset. The use of ANN requires substantial dataset training and high computation cost. The system predicts the disease from extracting features of the image and uses various if conditions for classification. The system first converts the images for classification into gray-scale images, and some methods are applied to get the features of the infected region.
Pravin and Shirsat  have proposed a system for early detection of skin cancer, psoriasis, and dennatophytosis. The system uses a support vector machine with radial basis function kernel for image classification. The wavelet tool was used to remove the noise from the dataset. The system used a two-level classifier for classification. The first stage classifies the input as normal or abnormal. It also eliminates the noise. The second stage classifies the categoiy of disease. For classification of skin disease, the system uses statistical analysis. Parameters of each image are calculated and diagnosed, therefore, to be classified. The system uses two-level classifiers; the first level classifies normal or abnormal, and the second level classifies the disease. At the first level, the system extracts the features after detecting and excluding the noise. The system can increase the number of statistical parameters for better performance. The database contains a total of 130 images of each disease. The system requires gray images; otherwise, it converts into a grayscale image; then, the noise is removed using a median filter. Smoothening of an image is done using a median filter. The system calculates mean, standard deviation, and entropy for each image; features are extracted from images and fed into the classifier. The system uses various algorithms for improving the quality of the image. The system uses the AdaBoost classifier to correlate the mean, standard deviation, and entropy. After performing the above operations, the system will be able to classify some diseases with reasonable accuracy.
Kevin and Noha  used methods of task functions for training the model. The system uses a Softmax margin for the training data. The Softmax margin focuses on high-cost outputs functions. The Softmax machine offers a probabilistic interpretation of the training data. The Softmax margin is easy to implement and gives the best results compared to other models. Hyperparameters are applied to data and used to label the data. Jensen risk bound performs similar to risk but takes less time to train. Jensen risk bound uses stochastic gradient ascent with a fixed step size.
Howard  has used the MobileNets model for mobile applications. The MobileNet structure contains depthwise separable filters. Tire system uses MobileNet models in the inception model for performing computation hr layers. We can apply different techniques on the network structure, such as hashing, quantization, factoring, or compressing the networks. Depthwise separable convolution splits into two layers. The system uses the first layer for filtering. The purpose of other layer is to combine input into a new set of outputs. The system uses factored convolutions for reducing the computational cost. Depthwise separable layers handle downsampling. Before the last layer, pooling reduces the resolution to one. The system performs all computations into 1 1 convolutions called a pointwise convolution. For training the MobileNet model, the authors used Tensorflow. In addition, the system uses inception training for limiting the size of inputs, which improves performance. Since the application requires fast performance, the model should be properly compressed and perform computation faster. The system applied hyperparameters to the model for compressing and quantizing the model so that the authors can easily integrate it into mobile applications. A batch norm and rectifier Rectified Linear Unit (ReLU) precedes all layers in the MobileNet model. ReLU is an activation function used in deep learning models. The models that use ReLU are easier to train and often achieve better performance. The rectified linear activation function is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. For converting the model into a compressed model, a width multiplier is used. The width multiplier’s role is to compress each layer uniformly. For reducing the overall computation cost of neural networks, the system applies a resolution multiplier to every input image and each layer. The MobileNet model gives the best performance as compared to several models. The MobileNet model tested on recognition of Stanford dog’s dataset.
Sourav et al.  used pretrained image recognizers for the identification of disease. They used the transfer learning concept; features and classification parts are reused and retrained, respectively, with the dataset, hi transfer learning, the last layer is retrained for the dataset so that we can use it in our application. The system uses the Inception V3 and Inception Resnet V2 networks for feature extraction. In addition, learning algorithms for the training data are used. The MobileNet model is lightweight and performs computation faster. Hence, it can efficiently work with mobile applications. Depthwise convolution is the base for MobileNet architecture. Features are extracted using CNNs, and the classification part is done using a fully connected layer. Pretrained model Inception V3 gives good accuracy while being able to recognize around 1000 classes. It extracts the feature from the image and then classified based on features. The learning algorithms can predict some diseases with good accuracy. Inception V3 gives better results as compared to Inception Resnet V2 and MobileNet.
Rabat et al.  used computer vision techniques for the identification of dermatological skin disease. They use a feedforward, backpropagation ANN for training and various image processing algorithms for feature extraction. Two types of features extracted are given as follows:
- • Feature extracted from the image (color, area, and shape).
- • Feature extracted from the user (elevation, feelings, gender, age, and liquid type).
The system uses an algorithm for finding out the color code of the infected area. In addition, the system uses a Sobel operator for detecting the edge of the infected area. The neural network consists of one input layer, one hidden layer, and an output layer. The input layer receives as input the features extracted from the image. These features are validated and tested using a 10-fold cross-validation process. The output layer gives the predicted disease. The system examines the human infected skin and detects the disease with reasonable accuracy. The system works on nine diseases.
Christian et al. [ 10] have different ways to use the convolutional networks for large-scale use with the goal of faster computation and factorized convolutions and aggr essive regularization. The VGGNet model has the additional feature of architectural simplicity, but it comes at a relatively high cost; evaluating the network is computationally expensive. The computational cost of the inception classifier is much lower than VGGNet or other similar models. This lowered cost has made it possible to utilize the inception model in big-data scenarios, where the data are enormous. It performs computation faster and gives better efficiency in mobile devices. However, still, the inception architecture is complex, which makes it relatively difficult to make changes to the network.
General design principles are as follows.
- • Voiding bottlenecks used for representation. Feedforward networks are represented by a graph, in which connections are not forming cycle. This principle defines a clear flow of information in the system. The amount of data passing through the partition between inputs and outputs can be accessed. One should avoid the bottlenecks with extreme compression for smooth functioning.
- • It is easier to process higher dimensional representation locally. We can improve the performance by increasing the number of activations per tile in a convolutional network. The resulting networks will train faster than it did before.
- • Spatial aggregation is feasible without loss in the representational power. For example, we can reduce the input dimension before the spatial aggr egation without adverse effects. The authors have hypothesized that the reason for this is the strong correlation between adjacent unit results in less loss of information during dimensionality reduction.
- • Balancing the parameters of the network. By balancing the number of filters in every stage and the depth of the network, we can optimize the performance. We can achieve a higher quality network by increasing both the width and the depth of the network. If both the width and the depth have increased in parallel, only then the optimal improvement for a constant amount of computation can be reached. Hence, the computational budget should, therefore, be equally distributed between both the depth and the width of the network.
The authors have proposed a technique to regularize the model's classifier layer by estimating the effect of label dropout during the training.
Let ,v be a sample training example. The following gives the probability of each label k:
where Z are logits.
Cross-entropy is the loss for the example
5.2.1 BUILDING BLOCKS IN THE CNN
Let HxW denote the spatial size of the output feature map, N the number of input channels, К *K the size of the convolutional kernel, and M the number of output channels; the computational cost of a standard convolution evaluates to HWNFC-M. The computational cost of the standard convolution depends on the following:
- • the spatial size of the output feature map H*W
- • size of convolution kernel К2;
- • numbers of input and output channels N *M.
The system is required to calculate the computational cost mentioned above when it performs the convolution on both spatial and channel domains.