On the Evaluation of Auto-Contouring in Radiotherapy
Mark J. Gooding
A first glance the evaluation of auto-contouring performance for radiotherapy seems straightforward: there is a wide range of quantitative measures that can be used to assess the correctness of an automatically generated contour against the ground truth. However, scratching the surface of this topic reveals numerous challenges that lead to a range of assessments being performed in practice. Even where essentially the same assessment has been performed the results may not be directly comparable as a result of implementation differences.
Table 15.1 reviews publications considering the assessment of auto-contouring in the context of radiotherapy. This is not intended as a comprehensive review, and it should be noted that auto- contouring is used in other contexts, such as neurology, and alternative assessments may have been performed. Therefore, Table 15.1 serves to review what has been applied in the domain of radiotherapy, not what could be applied, with a view to illustrating current practices in this field. Even at the high level of the information provided in the table, it can be observed that there is significant variation in the assessment of auto-contouring both in the approaches taken and in the scale of the evaluation performed. Nevertheless, there are commonalities between the assessments allowing broad categorization both by the purpose of the evaluation being performed and the type of evaluation being performed.
The purpose of the evaluation can be considered as falling into three classes: evaluation/ demonstration of the benefits of a technical method being developed (Development), comparison of two or more methods to ascertain relative performance (Comparison), or the assessment of auto-contouring performance with the intent of clinical use (Commissioning). Some evaluations fall into more than one of these categories, for example it is common to show the benefits of a new method (Development) with respect to an existing benchmark method (Comparison). Although the purpose of evaluation may vary, there are common approaches to assessment used in each.
As noted in Gooding et al.  the types of evaluation can be grouped into quantitative, subjective, and clinical. While subjective and clinical assessments are often quantified in some way, e.g. time saving in minutes is a quantity, quantitative assessment can be defined as: the calculation of the similarity or difference of a test contour with respect to a defined “ground truth ”. Notwithstanding the “ground truth” used for quantitative assessment may itself be a person’s subjective opinion as to the correct contour for an organ or region, subjective assessment can be defined as the evaluation made by an observer expressing their opinion of a contour’s quality. Finally, clinical assessment can be defined as the evaluation or measurement of the impact the use of auto-contouring has on the clinical workflow, although such clinical assessment may be both quantified and/or dependent on subjective opinion.
In this chapter, each of these categories of evaluation method will be explored in more detail. The strengths and limitations of each method will be considered, and implementational challenges and the corresponding impact will be discussed.