# Supervised Learning Classification Methods

The classification problem has a long history in the statistical literature; it has reappeared in the machine learning field as "supervised learning" but the goal is the same: create "rules" by which to categorize new observations into groups. We provide a brief overview of some common classification methods as potential models for eyewitness identification accuracy. The algorithms result in predicted decisions that can be compared to the underlying truth, as well as the influential predictors for the decisions. The goal is to minimize all types of errors. The resulting model can be adjusted by changing the thresholds of errors, depend- ing on whiche error is considered more grievous. Once the model has been trained properly under the supervised learning framework, and validated with representative test data, it can be applied to real world data.

The difference between the methods mentioned in this section and the methods men- tioned in the previous sections is the lack of a meta-analytic framework. At this time, the described methods cannot accomodate the meta-analytic framework. Some researchers are exploring methods of integrating machine learning algorithms to aid in study selection and data extraction for systematic reviews and meta-analysis. Methods have not yet been developed for computational purposes.

In classification methods, point estimates, which can be characterized by finding variance estimates using simulation and/or repetition, are obtained per data set. The true value in classification methods is how easily they are applied, which could be helpful for law enforcement agents, lawyers, and jurors. How well these models work in practice is yet unknown, but can be determined through simulation or application to real data sets.

## Machine Learning Classification Models

Common classification methods include linear discriminant analysis (LDA), quadratic dis- criminant analysis (QDA), boosted logistic regression (in addition the standard logistic regression), decision trees, random forests, graphical models via Bayesian networks, sup- port vector machines (SVMs), and neural networks. Brief descriptions of these methods, as well as graphical approaches, are provided in the following sections; see also *The Elements of Statistical Machine Learning, 2nd Edition* by Hastie et al. (2016) for in-depth discussions on all methods. Some of these methods (SVMs, random forests, and neural networks) suf- fer from "black box" syndrome, where the the results are not necessarily interpretable due to injected randomness, etc. Machine learning researchers have developed methodolo- gies to mitigate this issue, which is beyond the scope of this chapter. These methodologies include the partial dependence plot (PDP) from Friedman (2001), local interpretable model- agnostic explanations (LIME) from Ribeiro et al. (2016), and Shapley additive explanations (SHAP) from Lundberg and Lee (2017).

*Discriminant Analysis.* LDA and QDA are conventional classification methods proposed by Fisher (1936) that use linear and quadratic decision boundaries, respectively, in the space spanned by the covariates that influence the outcome. In the framework of EWID, the outcome is "accuracy" (hit rate or 1 - false alarm rate), using vectors of covariates to predict eyewitness' decisions. The choice between LDA and QDA depends heavily on the struc- ture and amount of data, and the assumption of normally-distributed covariates; QDA for an underlying linear model results in highly biased predictions.

*(Boosted) Logistic Regression.* Logistic regression (Section 21.3.4 in 21.2.1) can be made more powerful by "boosting," which was originally proposed Schapire (1990). The idea was further adapted to gradient boosting machines by Friedman et al. (2000). Boosting combines the performance of many "weak" classifiers to produce a more powerful "com- mittee." For EWID analysis, covariates are added to account for differences in probability for a correct or incorrect identification. For more on boosting, see Hastie et al. (2016).

*Decision Trees and Random Forests.* Decision or classification trees provide the foundation for random forests. The goal of decision trees is to create a model that predicts a value of a target variable based on several covariates. Nodes on the tree are the decision points that provide the path for the particular datum considered. Decision trees are simple to understand and easy to interpret. Classification trees are the individual units of random forests. Given the data, the covariates will be used as splitting variables to branch the data into sorted clusters. The splits are determined based on the homogeneity of observations in the resulting child nodes from the parent node. The resulting terminal nodes will be the decision determined by classification and regression tree (CART) algorithm (Breiman et al., 1984).

Random forests are ensemble classifiers based on decision trees. Votes arise from groups of decision trees. Tree bagging (bootstrap aggregating) draws repeated samples from the original data. Each sample is drawn randomly with replacement, and creates a classifi- cation tree. One generates *B* such trees. When one wants to classify a new observation, one uses each of the *B* trees in the "forest" (collection of de-correlated trees) and uses majority (or plurality) rule to assign the classification. This decreases the variance in the model. Random forests are also generated using feature bagging, where random samples of covariates are used for each tree rather than the entire set of covariates. For each can- didate (observation), a random subset of features is obtained. An observation is classified by majority vote from all the trees. Explaining the concept of a random forest can be done using visualizations.

*Support Vector Machines (SVMs).* Similar to other supervised learning algorithms, SVMs take as input the covariates for EWID to build the model based on training data. SVMs con- struct a hyperplane that is used to separate the data. A high-dimensional divider classifies the data into groups based on the interaction of several covariates. SVMs rely to classify- ing using hyperplanes (i.e., some sort of separator) in high dimensions, depending on the number of included covariates. Conveying this concept of high-dimensionality to laypeo- ple may be difficult, which may affect its use in EWID and law enforcement settings. While SVMs can be effective and accurate in prediction in some circumstances, both the SVM algorithm and the output are difficult to interpret, making SVMs possibly problematic for a court setting.

*Neural Networks.* Neural networks is a black box method that uses layers or neurons *pft),* which receive input. These neurons then change their internal state (activation) *aft) *based on that input, and produces output. Some threshold determines activation, which is an input to some activation function *aft* + 1) *=f (aft), pft), Of).* The output function is expressed as *oft) = f _{O}ut(aft)).* The network is formed by the connection of several of these neurons. Neural networks are flexible and can model a variety of functional forms, making it useful for complex and/or abstract problems. Like other machine learning algorithms, neural networks require training and computational resources. The covariates in an EWID experiment are used to determine the hidden units of the neural network, which are pro- cessed by the output function, resulting in a decision for each person. The decision from the algorithm for each person can then be compared to the person's actual outcome.

## Graphical Models

Graphical models, used in other forensic analysis, are also useful for the EWID paradigm (Dawid and Mortera, 2017). Luby explored this approach with log-linear analysis (Luby, 2016). In this model, the data are in the form of a multi-way table with Target Absence/ Presence (2 levels) x Eyewitness Decision (2 levels) x ECL (11 levels) x Witness instruc- tions (2+ levels); additional variables can be included without changing the theoretical foundation for the analysis. The model is fit iteratively to find the expected counts for each cell using a training set of data. Based on the experiment and corresponding data, we gen- erate different graphical models as follows. Let *a* represent the main effects, fi represent the two-way interactions, subscript *wc* represent witness choice, subscript *t* represent tar- get absence or presence, *i* represent witness instructions, and *c* represent ECL. Equation 21.19 shows an example of a fitted model (Figure 21.8). The model can include system and estimator variables, previously discussed in Section 21.2 of part 21.1.

**
**

Garbolino discusses the use of Bayesian networks for evaluating testimony (Garbolino, 2016); Garbolino's model is actually very general, and applies to testimony of any kind, not just from an eyewitness. The proposed model assumes that the witness is: (1) accu- rate, (2) objective, and (3) truthful. Each of these characteristics corresponds to an inference about the witness' personality:

- 1. Senses give evidence of what is seen;
- 2. Belief in the evidence from the senses;
- 3. Belief in what is said.

In the end, Garbolino proposes an object-oriented Bayesian network class for the analy- sis of the reliability of human witnesses. D'Agostini notes that Bayesian networks are a technical tool, but their true value is as a very powerful conceptual tool that can handle

**FIGURE 21.8**

This is the graphical model corresponding to log-linear model in Equation 21.19.

complex problems with variables related by both probabilistic and causal links (D'Agostini, 2016). Even with subjective probability (i.e., eyewitness testimony), the intuitive idea of probability can be recovered.

# Tools Based on ROC Methods

The popularity of the ECL-based ROC curve to compare lineup procedures, together with its limitations (see Section 1.3.1), leads us to consider other methods that augment and improve upon ROC curves for a more complete comparison between methods. We discuss the PROC curve (which utilizes PPV and NPV in a similar way that HR and FAR are used in ROC curves), multivariate ROC curves, and AUC estimation for these curves. We also discuss the inclusion of variability measures for ROC curves that could also be adapted for the PROC curve and multivariate ROC curves.