K-Nearest Neighbors Classifier
К-Nearest Neighbors is an algorithm for supervised learning, where the data is “trained” with data points corresponding to their classification. Once a point is to be predicted, it considers the “K” nearest points to it to determine its classification. The typical dataset of this type of algorithm is made up of several descriptive attributes and a single objective attribute (also called class).
In our problem, we are going to show how it is applied on the Turnover. Its probability distribution function was depicted in Figure 9.1a. The goal is to build a classifier to be able to predict the class of unknown cases. To do this, we select a specific type of classifier, which is called К-Nearest Neighbors. From the set of available data, we arrange them in the following way:
- • Train set: 1,036 couples of values, with К = 6.
- • Test set: 260 couples of values, with К = 6.
- • Train set accuracy: 0.28667953667953666.
- • Test set accuracy: 0.05384615384615385.
"Die result of the algorithm for classification draws that the best accuracy is 0.07307692307692308, which is obtained with К = 2.
Decision Trees are a non-parametric supervised learning method used for classification and regression. The objective is to create a model that predicts the value of an objective variable by learning simple decision rules inferred from the characteristics of the data.
For example, decision trees learning from the data can approximate a sinusoidal curve with a set of if-then-else decision rules. The deeper the tree is, the more complex the decision rules will be and the more appropriate the model will be.
The decision trees have a first node called root (root), and then the other input attributes are broken down into two branches (they could be more, but we will not get into that now) posing a condition that may be true or false. Each node is forked in two, and they are subdivided again until they reach the leaves that are the final nodes and that are equivalent to answers to the solution: Yes/No, Buy/Sell, or whatever we are classifying.
Some of the advantages of decision trees are as follows:
- • Easy to understand and interpret. Trees can be visualized.
- • They require little data preparation. Other techniques often require data normalization, the creation of dummy variables, and the elimination of blank values. Note, however, that this module does not support the missing values.
- • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
- • Able to handle numerical and categorical data. Other techniques are usually specialized in the analysis of datasets that only have one type of variable.
- • Able to handle multiple exit problems.
- • Use a white box model. If a given situation is observable in a model, the explanation of the condition is easily explained by Boolean logic. On the contrary, in a black box model (e.g., in an artificial neural network), the results may be more difficult to interpret.
- • Possibility of validating a model through statistical tests. This allows to account for the reliability of the model.
- • It works well even if your assumptions are somewhat violated by the true model from which the data was generated.
Disadvantages of decision trees include the following:
- • Parameters in the decision tree can create too complex trees that do not generalize the data well. This is called over- equipment. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required in a leaf knot, or setting the maximum depth of the tree are necessary to avoid this problem.
- • Decision trees can be unstable because small variations in the data can result in the generation of a completely different tree. This problem is mitigated by using decision trees within a set.
- • The problem of learning an optimal decision tree is known to be nondeterministic polynomial (NP)-complete under various aspects of optimization and even for simple concepts. Consequently, the practical learning algorithms of the decision tree are based on heuristic algorithms such as the greedy algorithm, in which optimal decisions are made at the local level at each node. Such algorithms cannot guarantee that they return the globally optimal decision tree. This can be mitigated by training multiple trees in a group of students, where characteristics and samples are randomly sampled with replacement.
- • There are concepts that are difficult to learn because decision trees do not express them easily, such as OR exclusive (XOR), parity, or multiplexer problems.
- • Decision tree participants create biased trees if some classes dominate. Therefore, it is recommended to balance the dataset before adjusting it to the decision tree.
In this case, we focus our attention on the assessment result, the Rating, based on the input parameters, Turnover, which gives an idea about how is a company exposed to a cyber-attack, Other IT insurances that the company has contracted, CC/PII data hosted by the company, the Cyber Investment which the company has spent to improve its information and communications technology (ICT) infrastructures and security to reduce cyber risk, KRITIS, the Insurance claim the company has faced due to successful cyber-attacks. The Train set is composed of 907 inputs-target sets, while the Test set is composed of 389 inputs-target sets.
In the approach shown in this work, we used the given Rating parameter as a goal. This finally leads to the decision contract/do- not-contract based on the achieved Rating. So, in future investigations and developments, we will first categorize and map the Rating score into YES/NO decision, which is the first step toward the process automation.
To optimize the process, further analysis of the different parameters used to build the decision tree is needed to find out features that lead to better and optimal solutions.
Figure 9.4 shows the resulting decision tree. Some remarks must be drawn to fully understand the given result. When analyzing the problem and implementing a solution, we faced two possibilities: either using a Decision Tree Classifier or a Decision Tree Regressor. The first approach requires the target to be clearly organized in categories or, for example, binary decisions: YES/NO. Given the intention of this work is using the raw data provided by companies, that idea was left for further investigation, and we use the raw Rating data. So, we focus on the second approach, that is, using the Decision Tree Regressor to avoid preprocessing the data and get an open solution.
Tie decision involves some advantages/drawbacks, as follows:
- • The main advantage is getting the decision in terms of a score in the range [0,4]. This criterion gives the opportunity to review the scores which are applied so far to make better and accurate decisions.
- • The main drawback is that the score could not make the best decision.
Support Vector Machines
Support vector machines (SVMs) have their origin in the work on the theory of statistical learning and were introduced in the 90s by Vapnik and his collaborators (Boser et al., 1992; Cortes and Vapnik, 1995). Although SVMs were originally intended to solve binary classification problems, they are currently used to solve other types of problems (regression, grouping, and multiclassification). There are also diverse fields in which they have been used successfully, such as artificial vision, character recognition, categorization of hypertext text, protein
Figure 9.4 Decision tree resulting from training and test.
classification, natural language processing, and time series analysis. In fact, since its introduction, they have been earning a deserved recognition, thanks to their solid theoretical foundations.
SVM is a machine learning technique that finds the best possible separation between classes. With two dimensions, it is easy to understand what you are doing. Normally, machine learning problems have many dimensions. So, instead of finding the optimal line, the SVM finds the hyperplane that maximizes the margin of separation between classes.
The dataset used in this investigation consists of sample records concerning several hundred cyber risk assessment reports, each of which contains the values of a set of features related to cyber risk. The fields in each record are the ones described above, that is, Turnover, Other IT Insurance, CC/PII, Rating, KRITIS, Cyber Invest.
In the data analytics applied to this dataset, first we will analyze how Cyber Investment and CC/PII impact on the Insurance claims registered by companies which were affected in some way.
Figure 9.5a shows the relation of the insurance claims due to damages produced by succeeding cyber-attacks based on the company investment on cyber security and the risk of holding credit cards and personal data which are usually a target of cyber crimes. As expected, the more the investment in cyber security, the lower the success probability.
Further, continuing with the analysis, Figure 9.5b shows the scores which the algorithm obtains after its analysis.
Finally, Figure 9.5c shows the distribution of the confusion matrix based on the values of the Cyber Investment of the company and the CC/PII. It is worthy to note that the similarity score is 0.5423076923076923, when comparing the estimated values versus the test dataset.