Analysis with Classification Algorithms
To further analyze the Cyber Insurance dataset, we now use another supervised learning approach - classification. Based on given examples, whose class affiliation is given, a classification of new datasets, whose class affiliation is unknown, is carried out. Classification methods are application-oriented, so that many different methods exist (Alpaydin, 2010).
Logistic Regression models perform very well on linearly separable classes. Therefore, it is one of the most used algorithms for classification. Despite its name, the logistic regression is a widely used binary classifier. As mentioned before, we need One Hot Encoding for calculation of ML algorithms for binary classifier. Any regression option, linear or not linear, can be used for classification. At Logistic Regressions, it can be figured out if an instance belongs to a class if the probability is greater than 50% (1) or not (0) with binaries. Finally, in logistic regression, a linear model is within a logistic function. Logistic Regression also uses L2 regularization, as Ridge does with regression. However, in Logistic Regression, the parameter is called C. Higher values for C mean less regularization.
To get a better understanding of what happened inside the Logistic Regression model, the visualization in Figure 8.3f shows how the model uses the different features and which of them have greater effects. The feature engineering process involves selecting the minimum required features to produce a valid model, because the more features a model contains, the more complex it is, therefore the more sensitive the model is to errors due to variance (Domingos, 2012). A common practice to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features, and re-evaluate to see if the model fairs better during cross-validation.
Regarding feature importance, the following conclusions can be made. First, CC/PII and KRITIS have significant influence on taking decisions. Not only hearing of data breaches and exposure of personal data and/or cardholder data in the news, the same experience will be made following the ML algorithm results. Second, Cyber Investments has a negative influence on the prediction. For instance, higher investments in cyber security are correlated with customers not taking Insurance Claims.
Next, we are looking at the practical implementation of the logistic regression algorithm in Python Scikit-Learn library. At the dataset, the label and features for both training and test datasets have been separated and input normalization has been performed. This helps understanding feature importance. By creating an instance and training the model with the fit function, the accuracy can be generated out of the test data. When adjusting C, different results can generally be obtained. At high C values, the algorithm tries to adjust the training data in the best possible way. Low C values, however, allow the model to find a coefficient vector that is close to 0 (Muller, 2017). We generated the best result with our model approach after adjusting C to 10,000. The best training data accuracy was achieved with 76% with the Logistic Regression algorithm.
To get an even better insight into the Cyber Insurance dataset, let us now introduce the receiver operating characteristic curve (ROC curve). By means of the ROC, the true positive rate is plotted in the context of the false positive rate. The false positive rate is the ratio of the negative instances which were wrongly classified as positive (Fawcett, 2006). In Figure 8.4a, the dotted line is a completely random classification. A good model should be as far away as possible from this line and as close to the upper left corner as possible. Therefore, we consider that for the Cyber Insurance datasets the Logistic Regression is not further considered, because this model is also underfitted.
Support Vector Machines
Support Vector Machines (SVMs) are powerful tools for classification. SVMs divide a set of objects into classes in such a way that the widest possible area around the class boundaries remains free of objects (Cortes and Vapnik, 1995). The starting point for building a SVM is the available set of training objects, for each of which it is known which class it belongs to. Each object is represented by a
Figure 8.4 ROC for different approaches and relevance.
vector in a vector space. The task of the SVM is to fit a hyperplane into this space, which acts as a dividing surface and divides the training dataset into two classes. The distance between the vectors closest to the hyperplane is maximized (Raschka and Mirjalili, 2019). This wide, empty border will later ensure that even objects that do not correspond exactly to the training objects are classified as reliably as possible.
A vector is a straight line through the coordinate origin and hyperplanes run perpendicular to this straight line. Each intersects the line at a certain distance b from the origin measured in the opposite direction to the vector. This distance is called bias (Raschka and Mirjalili, 2019). Together, the vector and the bias uniquely determine a hyperplane.
For points that do not lie on the hyperplane, the value is not 0, but positive (on the side to which the vector is pointing to). If the hyperplane separates the two classes, then the sign for all points of one class is +1 and for all points of the other class is -1. In order to carry out the simplified binary classification, we define the classification function as a vector composed by two parts: the first one, called class +1, containing the elements whose value is greater than 0, and the second one, called class -1, containing the elements whose value is lower than 0.
Input vectors which can be separated by such a hyperplane are called linearly separable (Boyd and Vandenberghe, 2004). Hard margin is possible at clear linear separable data. But, as not every dataset is linearly separable and outliers are also common, therefore, there must be adjustments within the ML algorithm. At Scikit-Learn SVM classes, we are able to control the margin with the C hyper parameter - which is the same as it was described at Logistic Regression. Here, a lower C leads to a higher margin but with a lot of outliers. A higher C will lead to a lower margin but fewer outliers (Raschka and Mirjalili, 2019). It is important to select the correct kernel to fit the problem, because the vector space contains little relevant data if the kernel function is chosen badly (Cortes and Vapnik, 1995). By parameter optimization, the kernels can be further refined to find the optimal spaces. The different kernel methods and their parameters are described in more detail below.