# Data Splitting

The success of building computationally sound ML models with high prediction accuracy is based on its ability to generalize and be applied to different scenarios.

FIGURE 2.4 Manually reconstructed basic Relief algorithm with a sample data.

Therefore, in ML. the available labeled datasets are partitioned into training and testing datasets. The training data subset is used to train the ML model by recognizing the pattern in the dataset and learning the relationship between the features and their respective class labels, while the testing data subset, which serves as a proxy to the incoming new dataset in the future, is used to estimate the generalization quality of the ML model (prediction accuracy) beyond the training dataset (Das et al., 2020c).

There are different data splitting methods, such as simple random sampling, trial- and-error, cross-validation, systematic sampling, convenience sampling, and stratified sampling. Among these methods, the simple random sampling is commonly used as it is easy and efficient to implement; however, the simple random sampling works more efficiently for datasets that had uniform distribution (Reitermanova, 2010). The Kolmogorov-Smirnov test conducted, using the R function ks.test(), revealed that most of the features selected from the datasets of soybean aphids, weed species, and iris were uniformly distributed *(p >* 0.01), hence making data subsetting valid for analyses. Therefore, in this study the simple random sampling method was used to select the data for training and testing from the original dataset.

Random sampling was performed using the R function createDataPartitionO under the caret package. The split ratio of the dataset for training and testing usually is in the range of 60-90% and 40-10%, respectively. The split ratio is selected based on the prediction accuracy of the ML classifier model using the validation data, which is the subsample random selection of the training data. The iris dataset was used to demonstrate the working of the function createDataPartitionO in splitting the original dataset into training and testing subset data. A ratio of 70% (training subset) and 30% (testing subset) was considered for data splitting. The R code snippet that achieves package loading, data partitioning, splitting into training and testing data are given below:

library (caret) # R package for Classification And REgression Training # Data partitioning with 70% splitting

splitDat <- createDataPartition(y = data$Species, p = 0.7, list = FALSE)

trainDat <- data [splitDat, ] # A split 70% assigned as training data

testDat <- data [-splitDat, ] # The rest of 70% i.e., 30% assigned as testing data

# The ML Methods

The two major categories of ML classification are unsupervised and supervised algorithms. Unsupervised ML performs classification on the unlabeled data (only input variables and no output variables with identified labels passed to the algorithm). The unsupervised algorithm learns the underlying structure and distribution in the data and clusters them as groups (e.g., clustering and association), while the supervised ML performs classification and regression tasks on the labeled data (input variables and labeled output variables passed to the algorithm), which helps the models to understand the relation between the input and output variables (Das et al., 2015,2018).

Classification models are employed if the outcome variable is categorical, and regression models are used if continuous. Practical machine learning applications predominantly use supervised classification. The agricultural application datasets used in this chapter (soybean, aphids, and weed species) are labeled and have a categorical output variable. Therefore, common supervised classification ML algorithms, such as LDA and kNN, were considered, decoded, methods developed and based on their algorithms, and discussed.

## Linear Discriminant Analysis

The LDA is a popular technique for dimensionality reduction and supervised classification and was originally developed to address the two-class problem (Fisher, 1936). It provides two regions separated by a line which aids in the classification of data and the regions, and the separation line is defined by the linear discriminant score function. This method was later generalized for multiclass problems (Rao, 1948). A linear discriminant score function is employed by the LDA method to predict the class labels upon feeding the test data instance. The discriminant score function is derived from Bayes theorem, which determines the class labels based on the probability of each class and the probability of the test data instance belonging to the class (Equation (2.14)).

where P(zr,l.v) is the probability of test data instance *x* belonging to *л,* class, *P(n _{n}x* is the joint probability of я, and

*x*, and

*P(x)*is the prior probability of

*x.*

The linear discriminant score function of Bayes theorem is expressed in the form of *у* = *ax* + *b,* where *a* and *b* are the parameters obtained from the training data and their respective classes, while *x* is the test data instance. A detailed derivation of linear score function from Bayes theorem can be found elsewhere (Konishi, 2014; Naik and Kiran, 2019). The linear discriminant score function for the class *i* (L,) is given as:

where A is the mean vector of each class in the training dataset, £ is the covariance matrix of each class in the training dataset, *%* is the test data instance, and *p,* is the prior probability of class *i.* Some of the major assumptions of performing LDA are: (i) the training and test data are normally distributed, and (ii) the covariance matrix of each class are equal. However, in reality, the covariance matrix of the classes is not equal, therefore to account for the variances in the dataset, a pooled covariance matrix (*S _{r}* ) is considered instead.

where *S _{p}* is the pooled covariance matrix for all the classes,

*m*is the number of classes in the dataset, and «, is the number of observations in each class. Substituting Equation (2.16) in Equation (2.15), the

*L*, function equation for a class is:

The L, is evaluated for all the classes *(m* ) present in the dataset to predict the class of the incoming test data instance (). The class that yields the maximum *Ц* score is the class for the test data instance. The relevant processes involved in the algorithm of LDA are presented as pseudocode (Algorithm 2.2).

**Algorithm 2.2 **The pseudocode of the LDA algorithm 1: **procedure **LDAftraining dataset, test data instance)

- 2: let 3c
_{(}= test data instance > data from the test dataset - 3:
**loop**> using training dataset - 4: estimate the mean (Д,) vectors for classes > for l to m
- 5: determine the covariance matrix (E) > for all m classes
- 6:
**end loop** - 7:
**for**(1 to m)**do**> m - total number of classes - 8: find linear discriminant scores
*(L*_{u}..., L_{m}) - 9: estimate linear discriminant function for > using Eq. (1.17); test data
- 10: class of X, = max(L
_{1(}..., L_{m}) > class identified; test data - 11
**: end for** - 12:
**end procedure**

The developed R program demonstrates the working of LDA in classifying the iris dataset (Figure 2.5). The features, such as petal length and width, and sepal length, were selected based on the result of employing the ReleifF algorithm on the iris dataset. The selected influential features were all normally distributed and were used to perform the LDA.

**Advantages of LDA:**

- • Simple, easy to implement, and provides fast classification.
- • Provides a linear decision boundary.
- • Feature scaling is not required.

**Disadvantages of LDA:**

- • The data should be normally distributed.
- • The LDA assumes equal covariance for the classes.

## k-Nearest Neighbor

The kNN is one of the simplest methods employed for classification applications. This method is also referred to as a lazy or instance-based learner, since like most ML methods, this method does not undergo a training phase before classification.

FIGURE 2.5 Manually reconstructed linear discriminant analysis (LDA decoded) for the iri dataset using the linear discriminant score function with evaluation using the confusion matrix

It delays the modeling of the training dataset unless it is needed to classify the incoming test data instance. The method estimates the similarity between the features in the test and training dataset. Lesser the similarity difference greater chances the attributes belong to the same class. The similarity between the test data instance and every instance in the training dataset is calculated using Euclidean distance, which is the distance between the two points in a plane and which can be calculated using the following equation:

where *D* is the Euclidean distance value between the test and training data instances, *n* denotes the number of feature attributes in the training dataset, л', is the training data instance, and y, is the testing data instance.

The calculated Euclidean distance value is sorted in ascending order, and “ft” closest data points are selected from the sorted results. The class of the “ft” sorted points that hold the maximum frequency is assigned as the class of the test data instance. The “ft” value plays a crucial role in determining the predictive capability of the kNN algorithm. A rule-of-thumb approach for fixing the “ft” value is by estimating the square root value of the total training data observations. It is important to note that the predictive performance of the kNN model also depends on the scale of the features. Feature scaling procedures such as normalization or standardization should be performed before employing the kNN algorithm for features existing at different scales (units); this step, however, can be disregarded if the features already have the same units. The pseudocode decoding the kNN algorithm is presented (Algorithm 2.3).

**Algorithm 2.3 **The pseudocode of the kNN algorithm 1: **procedure **KNN(training dataset, test data instance)

- 2: let Xj = test data instance > from the test dataset
- 3:
**for (1**to t)**do >***i -*number of observations in testing dataset - 4:
**for (1**to*n)***do**> n - number of observations in training dataset - 5: find Euclidean dist. between test & train instances > using Eq. (1.18)
- 6: for every test data against
*‘n’*in training dataset - 7:
**end for** - 8: sort the Euclidean distances in ascending order
- 9: select
*к*value*>*thumb rule:*k = -fn* - 10: pick the ft lowest distance (nearest neighbor) > from data points
- 11: find respective classes of the ft closest data points
- 12: class of it, = maximum occurring class for ft closest data points
- 13:
**end for** - 14:
**end procedure**

The user-coded algorithm in R demonstrates the working of the kNN method for classification using the selected features with the ReliefF algorithm employing the iris dataset (Figure 2.6). The selected prominent features, petal width and length, and sepal length, had the same units (cm), therefore feature scaling was not performed for this demonstration.

FIGURE 2.6 Manually reconstructed linear к-nearest neighbor (kNN decoded) algorithm for the iris dataset with evaluation using a confusion matrix.

Advantages of kNN:

- • Simple and easy to employ for classification applications since it works only when based on the
*к*value and Euclidean distance function. - • The kNN algorithm does not require any training phase because it is an instance-based learner and therefore is faster than the ones that require training such as LDA and naive Bayes.
- • The accuracy of the algorithm is not impacted by adding new data to the existing dataset, therefore data can be seamlessly added.

Disadvantages of kNN:

- • The kNN does not perform well with large datasets, since it is computationally expensive to calculate the distance between the new test data instance with every other instance in the training dataset.
- • Similarly for datasets with high dimensions (many feature attributes) the algorithm becomes more complicated when calculating the distance in each dimension.
- • It is sensitive to noisy data, therefore data preprocessing measures should be taken to eliminate any missing values or outliers before employing kNN.

# Evaluation of ML Methods

Evaluating the efficiency of the developed model is an essential part of ML. Different metrics are available to evaluate the efficiency of the model, among which confusion matrix, accuracy, precision, recall, and F-score are often used for classification applications.

## Confusion Matrix

The confusion matrix conveys the correctness of the model in the form of a matrix. It is mostly used for determining the effectiveness of the classification of ML models with two or more classes. The matrix contains the actual values in rows and predicted values in columns (Figure 2.7).

The following are the four essential terms associated with the confusion matrix.

- • True positive (TP)—Cases where the actual data belonged to class 1 and the model correctly classified as class 1.
- • False negative (FN)—Cases where the actual data belonged to class 1 but the model wrongly classified as class 2.
- • False positive (FP)—Cases where the actual data belonged to class 2 but the model wrongly classified as class 1.
- • True negative (TN)—Cases where the actual data belonged to class 2 and the model correctly classified as class 2.

The following performance parameters, namely accuracy, precision, recall, F-score, macro-average (based on number of classes), and weighted average (weighting each class by number of samples in each class) are calculated based on the TP, FN, FP, and TN values present in the confusion matrix (Figure 2.7).

## Accuracy

Accuracy in classification problems is the ratio of the total number of correct predictions by the model (TP and TN) to the total number of input data. It is a good measure for evaluating the model if the classes in the data are nearly balanced.

FIGURE 2.7 Confusion matrix for a binary class of data.

## Precision

Precision is a metric that quantifies the number of cases correctly classified as positive out of the total cases classified as positive. It is the ratio of TP to the sum of TP and FP. The precision result is a value between the range 0.0 and 1.0, where 0.0 represents no precision and 1.0 represents full or perfect precision.

## Recall

Recall is the proportion of cases correctly classified as positive in the class of interest. It is the ratio of TP to the total instances (TP and FN) of a class. It is a value between 0.0 and 1.0.

## F-score

F-score is the balance between the precision and recall and is determined using their harmonic mean. This measure is more reliable than accuracy to validate a model if the data has uneven class distribution. However, with even class distribution (balanced dataset), F-score is the same as the accuracy value.