Performance Characteristics for Automatic LR Methods

Currently, accuracy, discrimination, calibration, robustness, monotonicity[1] and generaliza- tion have been identified as relevant to the validation performance characteristics for the assessment of automatic likelihood ratio methods (Meuwly et al., 2017). Performance met- rics and graphical representations are associated with each performance characteristic for the measurement and representation of the method's performance.

Accuracy, discrimination and calibration have been defined as primary performance characteristics, as they relate directly to performance metrics and focus on desirable prop- erties of the LR methods. They address the required behavior of the automatic LR method if it is intended to be fit for purpose. In Meuwly et al. (2017), their selection is based on the statistics literature on the evaluation of Bayesian probabilities, and in particular on the use of proper scoring rules.

Robustness, monotonicity and generalization have been identified as secondary perfor- mance characteristics. They describe how the primary characteristics behave in different conditions representing the extreme variability of forensic casework. Factors of variabil- ity are usually degrading, e.g., data sparsity, quality of the specimens or mismatch in the conditions between training data and operational data.

Empirical Validation

Empirical validation is strictly necessary before making use of a new method in practice, because of the variability and often low quality of the operational data analyzed, which may cause sound LR models to present undesirable behavior. Among the most common degrading factors are data sparsity, high variability of the quality of specimens, a shift between the conditions of the data used for LR model training, and the data captured in the different forensic scenarios.

As a central procedure of the validation process, performance measurement requires careful definition. In particular, the performance characteristics must guarantee that the likelihood ratios are fit for purpose, and that they have desirable properties under operational conditions.

Some definitions are given here for better understanding of the rest of the chapter[2]:

  • • A performance characteristic represents the answer to the question "What to mea- sure?" It is a characteristic of an LR method that is thought to have an influence on the desired or undesired behavior of a given interpretation method. For exam- ple, we want LR values that help the trier of fact to reach better decisions, and in that sense the LR values should possess the performance characteristic defined as accuracy^.
  • • A performance metric represents the answer to the question "How to measure?" It gives a quantitative measure of a performance characteristic, usually as a scalar. For the performance characteristic defined above as accuracy, the performance metric can be implemented by the use of proper scoring rules (DeGroot and Fien- berg, 1982; Gneiting and Raftery, 2007) on an empirical set of likelihood ratios (see Section 1.4.1). Thus, this performance metric will yield a single number that measures accuracy: the lower this number, the better the accuracy^, and vice versa.
  • • A validation criterion represents the answer to the question "what performance is needed to regard a method as valid?" It is defined as the decision rule to determine when a method is acceptable and fit for purpose according to a given performance characteristic. For the performance metric accuracy defined above (empirical aver- age of a proper scoring rule), a possible validation criterion is a scalar threshold over the performance metric. When the metric is above the threshold, the method is not validated from the point of view of the accuracy, and vice versa.

Validation Protocol

The validation protocol begins with a validation plan describing the experiments. This plan lists the performance characteristics considered for validation of the method and the performance metrics and graphical representations used to assess those performance characteristics. It also describes the aim of the experiments, the data used and the valida- tion criteria applicable. In order to get more insight into the expected performance of the method, a comparison with either the current state of the art or with a baseline method can be performed, which provides an initial set of validation criteria.

Experiments are performed in two stages, the first entails the development and valida- tion of the method and the second the validation for varying conditions. The development and validation of the method uses a training dataset (with a known ground truth) to select the automatic LR method, and to refine the parameters of this method and the statistical models involved in it. The aim is to measure the primary performance characteristics of the method and to obtain the best performance with the most representative dataset for the widest possible range of conditions.

FIGURE 7.1

Diagram describing the development and validation stages of the validation process.

The validation of the developed method for varying conditions consists in measuring its performance on a previously unseen set of data captured under forensic conditions (with a known ground truth), using both the primary and secondary performance characteristics. The aim is to test the automatic LR method under conditions that are as similar as possible to conditions in forensic casework, and to arrive at the validation decision. If a dataset is used to assign the value of some hyperparameter, which is often the case in the method development stage, then the same dataset should not be used to estimate the performance in the validation stage. The reason is to avoid a possible inadequate generalization to new data in casework (overfitting). The validation experiments in two stages are summarized in the flowchart shown in Figure 7.1.

Finally, the results of the validation experiments are summarized in a validation report, recording the decision of acceptance or rejection, depending on whether the experimen- tal results meet the validation criteria or not. A validation decision should always be linked to a specific set of experimental conditions determining the scope of validity of the method.

The protocol for the validation of an automatic LR method is summarized in the valida- tion matrix as shown in Table 7.1. Note that all the validation processes, seen as columns of the validation matrix in Table 7.1 apply to each of the performance characteristics (i.e., all the rows in Table 7.2). This might mean that a validation process could end with a "pass" validation decision for some characteristics, and with a "fail" validation decision for some others. To apply the method in casework (or not) will be the decision of the forensic science institute, but the validation report should be transparent and made public.

The guideline for validation proposed in Meuwly et al. (2017) is the first initiative in a long-term effort. It will be improved in the future, considering suggestions from others (see, for example, Alberink et al., 2017).

An example of a validation report using development and forensic data can be found in Ramos et al. (2017). It is linked to the necessary data used to reproduce the results, in the form of empirical sets of likelihood ratios with corresponding ground-truth labels. Inter- ested researchers can access the data and follow the set of steps presented in this report, which can help them to proceed with the empirical validation of their own methods.

Moreover, a toolbox for performance assessment is available with the main tools neces- sary to generate the performance metrics and graphical representations needed to validate

TABLE 7.1

Validation Matrix for Automatic Likelihood Ratio Methods

Performance

Charac- teristic

Performance

Metrics

Graphical Repre- sentation

Validation Criteria

Experiments

Data

Results

Validation

Decision

For each listed character- istic

As appro- priate for character- istic

As appro- priate for character- istic

According to the defini- tion

Description of the exper- imental settings

Data used

+/- [%1 compared to the baseline

Pass/Fail

TABLE 7.2

Performance Characteristics and Examples of Performance Metrics and Graphical Representations

Performance

Characteristic

Performance Metrics Examples

Graphical Representation Examples

Accuracy

Empirical average of a proper scoring rule for a given prior probability, such as C//r.

Prior-dependent representation of a proper scoring rule, such as an ECE plot.

Discrimination

Discrimination component of the empirical average of a proper scoring rule for a given prior probability, such as C'/l'" or EER.

Discrimination component of a prior-dependent representation of a proper scoring rule, such as an ECE”"” plot or a DET plot.

Calibration

Calibration component of the empirical average of a proper scoring rule for a given prior probability, such as C^.

Calibration component of a prior-dependent representation of a proper scoring rule, such as an ECE“' plot. Also visible in the symmetry of a Tippett plot (i.e., cumulative histograms).

Robustness, Monotonicity, Generalization

Variation of primary metrics such as C//r or EER, range of LR values.

Variation of primary representations such as ECE, Tippett or DET plots.

an LR method from an empirical set of LR values. This toolbox is freely available online (https://sites.google.com/site/perfevtoolbox/).

  • [1] This was previously referred to as coherence (Haraksim et al., 2015), but the name was changed for the sake ofclarity, and in order to avoid confusion with statistical coherence.
  • [2] As explained in Meuwly et al. (2017), these terms have been defined to be, as much as possible, in accordancewith relevant ISO standards. + Here, it can be seen that we define accuracy in terms of proper scoring rules, in contrast to its usual definition.See Section 1.4.1. t As we will see, the average of a proper scoring rule yields a penalty, which is lower when the accuracy is better.
 
Source
< Prev   CONTENTS   Source   Next >