Discussion and Recommendations

In this chapter the main approaches to assessing auto-contouring have been reviewed. Three main approaches were discussed: quantitative assessment, subjective studies, and the evaluation of clinical impact. It was noted that all of these approaches have limitations and challenges, with the greatest of these challenges being that of inter-observer variation. If experts cannot agree or be relied on to delineate contours in the same location, then auto-contouring methods cannot be expected to be able to do so.

Recommendations as to how clinical commissioning and validation of auto-contouring should be performed have been proposed [41]. While many studies have been performed since these recommendations were made, they are still valid. However, these recommendations do not separate the various purposes of evidence gathering, and the nature of the evaluation required. Therefore, Table 15.4 adds to the recommendations of Valentini et al. [41], by proposing the type of evaluation that should be performed depending on the context.

Noting the challenge inter-observer variation brings, the following statement from the previous recommendation is highlighted: “...must still be considered along with what we refer to as the ‘benchmark trap’: are we confident that the daily ‘human’ inter-observer variability could show better performances, in terms of dose and volumes, when considering a comparison with the same manual benchmark?”

Any evaluation must consider the variability of the benchmark. Inter-observer variation should be considered in all studies to provide context for the results presented.

TABLE 15.4

Recommended Approaches for Validation of Auto-Contouring for the Different Purposes of Evaluation

Purpose of Study

Validation Approach(es) Suggested

Development: The validation of a newly proposed method of auto-contouring

Quantitative measures are the primary method since these do not require clinical time. Current state-of-the-art measures, added path length, or surface Dice should be used to place a clinical interpretation on results. Further measures should be reported for context.

Inter-observer variation should be reported using the same measures.

If expert opinion is available, blinded subjective evaluation to assess potential clinical performance is highly beneficial. Clinical contours should be assessed alongside auto-contours as a benchmark.

Comparison: The validation of two or more auto-contouring methods for the purpose of determining their relative strengths

For challenge type evaluation, only quantitative assessment can be performed effectively for multiple methods. Inter-observer variation must be considered prior to drawing strong conclusions as to a “best” method. Methods using tolerances, such as added path length or surface Dice, should be considered.

Clinical commissioning: The validation of an auto-contouring method for the purpose of evaluating its suitability for clinical use

Methods assessing clinical impact should be the primary' mode of validation. Care should be taken in implementation of such evaluations to avoid bias. Consideration should be given to blinding to measure the impact that using an alternative clinical contour would have in the same setting. Evaluation should be performed against current clinical practice rather than artificial contouring scenarios.

Blind subjective assessment is recommended in addition to enable larger dataset and multiple observers to participate in validation.

Quantitative evaluation could be performed to demonstrate similar performance in the clinic as was observed in development studies. However, quantitative measures cannot demonstrate clinical acceptability.

