In this section, the most common distinguishers used in SCA for correlation analysis and template-matching are presented. Machine learning techniques for classification and clustering are broadly used in SCA, in order to distinguish between traces with high noise ratios.
According to , unsupervised clustering is generally useful in side-channel analysis when profiling information is not available and an exhaustive partitioning is computationally infeasible. The authors presented an attack on an FPGA-based elliptic curve scalar multiplication using the k-means method. In , Perin et al. used unsupervised learning to attack randomized exponentiations.
Lerman et al. showed in  that machine learning techniques give better classification results when there is limited ability of the adversary to perform profiling of the device and in a high dimensionality context, where many parameters affect the leakage of the device. Indeed, combining three side-channel leakages and a clustering-based approach for non-profiled attacks, gives higher success rates than traditional template attacks, as shown by Specht et al. in .
The success results of online template attacks (presented in the next section) are significantly improved in  by using the k-nearest neighbor approach, naive Bayes classification and the support vector machine method for template classification. In order to explain these techniques, we first give the definition of the Euclidean distance and the Pearson correlation coefficient.
Euclidean Distance The Euclidean distance between two points is defined as the square root of the sum of the squares of the differences between the corresponding point values:
In the SCA setting, a realization of a random variable X corresponds to x .A sample of n observations or traces of X is denoted by (xi)1<i<n, where the index i denotes the different observations or the different time when an observation occurs in the same trace.
Pearson correlation The Pearson correlation coefficient measures the linear independence between two observations X and Y:
In SCA, the Pearson correlation coefficient is used to describe the difference in the Hamming weight of the observations.
Naive Bayes Classification The naive Bayes classification method is based on probability concepts, and more precisely on the Bayes theorem for conditional probabilities of independent events . According to the conditional probability model, let x = (x1,... ,xn) be the vector of problem instances (independent variables) to be classified, each one having a feature n. Each instance is assigned a probability p(ck |xi,..., xn), for k possible classes. The set of classes c, c2,..., ck is mutually exclusive and exhaustive.
Using Bayes’ theorem, the posterior conditional probability is p(ck|x) = p(ck)ppX)X|ck). Assuming that each event and posterior probability is independent on each previous event, the conditional distribution over the class variable c is p(ck lx1, ...,xn) = Z p(ck )П n=1 p(xi |ck) where the evidence Z = p(x) is a scaling factor dependent only on x1,...,xn, that is, a constant if the values of the feature variables are known.
The naive Bayes classifier is a function that combines the naive Bayes probability model with a decision rule. One common rule is to pick the hypothesis that is most probable; that is the maximum value of the a posteriori probability. For each class ci, the class index that gives the maximum value for an event is chosen as classifier. K-Nearest Neighbor The k-Nearest Neighbor Classification (kNN) is a classification method based on the closest instances of the training set to the unlabeled data. Basically, according to , it consists of the following two steps:
- 1. Choose the number of k closest instances (from the training set) to the sample.
- 2. The majority label (class) for the chosen closest instances will be class for the unlabeled data.
The distance metric plays an important role, in order to determine the closest instance. In kNN, the Euclidean distance or the Manhattan distance can be used. The value k indicates the number of the already-classified closest instances that are chosen in order to classify the next unlabeled data. The default value is 1, but with larger value for k it is possible to obtain higher success rate with less template traces. Figure3.2 shows an example with k = 2 and k = 4 close instances; if the new sample is closer to A, it will be classified in the “A-class.”
Fig. 3.2 kNN method for k = 2 and k = 4
Fig. 3.3 SVM: distance to the hyperplane for two sets of training data
SVM A Support Vector Machine (SVM) is a supervised learning model that produces a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data, the algorithm outputs an optimal hyperplane which categorizes new examples. Figure3.3 shows such a hyperplane. An optimal hyperplane, as defined in , is the one that gives the largest minimum distance to the training points, because in this way noisy data will still be classified correctly. Therefore, the optimal separating hyperplane maximizes the margin of the training data. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. The classifier in an SVM can be nonlinear for data sets that are not easily separable, but in the following analysis a linear classifier gives very good results.