Strengths and Limitations
Prior to considering specific methods or implementational details, the strengths and limitations of quantitative assessment are first considered given the general scope definition of quantitative assessment as the calculation of the similarity or difference of a test contour with respect to a defined “ground truth”.
The strengths of quantitative assessment come primarily from the calculation of the similarity or difference. In performing a calculation against a fixed “ground truth”, the derived score is both objective and deterministic. As noted already, these properties mean that quantitative assessment approaches lend themselves to the development framework, whereby one auto-contouring method can to compared against another to demonstrate improvement resulting from research. To be able to measure this improvement, an assessment method is required that does not depend on the person running the evaluation (objective) and gives the same answer for the same experiment every time (deterministic). Without these properties a measure could suggest improvement where there has been no change. Furthermore, once the “ground truth” is available against which evaluation can be performed, the calculation itself requires minimal human effort to perform, an additional property which makes the use of quantitative assessment highly suited to the development framework. This is reflected in the use of quantitative scoring extensively in “grand challenges” [2-4].
In contrast, the limitations of quantitative assessment stem from the similarity or difference ... with respect to a defined “ground truth’’. First, there is a challenge as to what “ground truth” is. While a region of an image is either part of a particular object or not - at least in an ontological sense, even if its anatomical boundary is hard to define - determining the boundary on the image is a challenge not only for the computer but also for the expert human observer. Consequently, a single observer may draw a different boundary on a different occasion for the same image, whether this is from random fluctuations in the accuracy/precision of their drawing or from a change in perception as to what the image shows. This is known as intra-observer variation. Furthermore, different observers may also draw regions differently, again as a result of random variation in contouring or differences in perception, or additionally resulting from differences in their definitions of the region. This is known as inter-observer variation. Therefore, the notion of “ground truth” is misleading, as it suggests that the contour being used as the reference is in some way more correct than any other expert contour and does not allow for the possibility that the test contour has the potential to be more correct than the reference in some places.
Second, the similarity or difference is defined in a way that makes it calculable and expresses something about the similarity or otherwise of the test contour with respect to the reference, however such a calculation does not necessarily relate to its suitability for clinical use. For example, the difference in volume of an auto-contoured region may be calculated compared to the reference; a difference shows that the auto-contour region has some error in size, yet no observed difference does not demonstrate that the auto-contoured region is accurate as it may not be in the right location or the correct shape.
These limitations of a reference that is the subject of inter-observer variation and the clinical meaning of the similarity must be borne in mind as the implementation of quantitative assessment is discussed.
There are many choices to make in the implementation of quantitative evaluation measures. Some of these are dependent on the measure being used and some will be dependent on the biological region being segmented, while other choices are independent of both.
The first choice to be made is the region representation to use for the assessment. A binary 3D voxel-based representation is most commonplace in research, as this representation lends itself to simple implementation. However, regions are normally saved in the Digital Imaging and Communications in Medicine (DICOM) format as radiotherapy structure sets (RTSS) in radiotherapy clinical practice. Typically, radiotherapy imaging is stored in the DICOM image format as individual 2D slice instances, and the RTSS format reflects this. The segmentation is stored as a series of ordered points per 2D image in real-world units. The neighboring points in the list are assumed to form a line segment. Thus, the RTSS implies a region representation that is a 2D irregular polygon. While other DICOM objects exist that represent segmentations as voxel masks, e.g. a DICOM segmentation object, these are not the standard format for radiotherapy.
In many implementations, including the 2017 AAPM Thoracic Auto-segmentation Challenge, the 2D polygon representation is converted to a voxel array for ease of implementation. This conversion process itself can be subject to implementation variations between programs and therefore introduces uncertainty in comparing results from different implementations. Thus, it would be better to define any measures in real-world space using a 2D polygon region representation to reduce the potential for discrepancies.
A second choice required is whether measures are implemented in 2D or 3D. While RTSS contour representation is 2D, the biological objects are defined as 3D objects, therefore it is of interest to know the discrepancies in 3D. For example, volume differences are likely to be more meaningful than cross-sectional area ones. However, this choice depends both on the measure being used and on the biological structure being contoured. Some objects are well-defined in 3D, e.g. a lung has a clear superior/inferior extent in addition to inferior/anterior and left/right, while others, particularly tubular structures such as the spinal cord, are defined over the treatment region but poorly defined in superior/inferior extent. Therefore, surface distance measures may be meaningful for the lung in 3D, but less meaningful for the spinal cord where a difference in contoured extent may be treated differently to inaccuracy in an individual slice. Yet, a 3D assessment is only meaningful when considering volume differences, regardless of the organ.
While not able to provide a simple number, a visual slice-by-slice assessment, as shown in Figure 15.1, can be used to present multiple 2D measures in an easy-to-interpret form. This facilitates the assessment of extent differences, as can be seen in the figure where the area for the comparison contour exceeds the zero area of the reference at the top middle plot, showing that the comparison contour over-segments the base of the heart.
How' measurements can be combined should also be considered. Assessment is more helpful if it is performed for multiple patient cases, to evaluate the performance over a range of variations in
FIGURE 15.1 Example contour evaluation report, showing contour point density, area, and banded 2D Hausdorff distances over the range of the contours. It can be easily observed that will slice-by-slice area agreement is good over the majority of the organ, the comparison contour over segments at the base of the heart. Example courtesy of Akos Gulyban.
FIGURE 15.2 Example contours used to illustrate contour evaluation measures. The dark contour indicates the reference contour. The light contour indicates the test “auto-contour”.
patient appearance. Where a mean and standard deviation can tell the average performance and its variation, for some measures it may be appropriate to combine results in a different way, for example the Hausdorff distance reports a maximum distance error, thus it might be appropriate to report the maximum Hausdorff distance across cases rather than the mean. It may also be desirable to implement quantitative measurements to allow comparison of auto-contouring methods for multiple organs and accounting for a range for measures. For example, for the 2017 AAPM Thoracic Auto-segmentation Challenge  a single score per method was required for all cases. The approach taken is discussed later.
A range of quantitative assessment methods is now described, considering their clinical meaning, method/structure-specific implementation details together, and how the assessment approach can be adapted for inter-observer variation. The measures selected are by no means comprehensive but are those that have been popular in the past or have recently been proposed and offer new potential.
To work through each of the measures, a toy example will be considered. This is shown in Figure 15.2 - while not anatomically realistic, it is much easier to draw!
In earlier assessments of auto-contouring, measures relating to classification accuracy were popular. Such measures historically had been used in computer science where pixel-wise classification was considered, and clinically in computer-aided detection where the correct detection of lesions is important. Classification accuracy is normally evaluated by counting the proportion of correctly labeled pixels, true positive (TP) and true negative (TN), and the incorrectly labeled pixels, false positive (FP) and false negative (FN). In Figure 15.3, regions have been labeled for TP, FN, and FP. However, the evaluation of true negative pixels becomes a challenge with auto-contouring, as the extent of the background region outside a structure (i.e. the negative classification) is unclear. Assuming the background is the rest of the image, the true negative proportion is always (assuming
FIGURE 15.3 Various measures are based on pixel/voxel classification accuracy. True positive, False negative, and False positive are well defined for anatomical structures. True negative cannot easily be defined as the space outside the anatomical structure has no bounds.
a competent system worthy of testing) high and will be dependent on the image size. Thus, true negative is not a helpful measure when evaluating auto-contouring unless a region is defined over which it will be measured (as in Isambert et al. ). However, the choice of definition of such a region makes the comparison of systems by different researchers/clinics potentially inconsistent.
Within a clinical context, such as Computer-Aided Detection (CAD), the performance of a system is often cast into sensitivity and specificity. Sensitivity, defined as TP/(TP + FN), measures the number of correct foreground labels as a proportion of the possible number of correct foreground labels. Thus, sensitivity will measure the ratio of the correctly identified area of a structure as a proportion of the total expected structure area. Specificity, on the other hand, defined as TN/(TN + FP), measures the correctly identified area of background as a proportion to the total expected background. However, limitations of measuring TN in the context of auto-contouring, specificity also becomes a difficult measure to compare between implementations. Other alternatives that have been proposed are the percentage of FP , defined as FP/(TP + FN), and the inclusiveness index , defined as TP/(TP + FP).
True positive, false positive, etc. are defined as the number of pixels/voxels - thus a correct implementation would be to convert RTSS to a voxel grid. However, the absolute values of TP counts will depend on the resolution at which this is done (which can be assumed to be the image resolution), therefore may vary between patients and clinical centers.
A more consistent and natural implementation may be to compute these measures in terms of volume or area. These values can be directly computed from RTSS using simple Boolean operations: the true positive region is the intersection of the reference and test contours, and the FP and FN regions can be calculated as the difference between the intersection and the test contour and reference contour respectively. A 2D implementation may give information about performance of an auto-contouring system over a range of slices and indicate slices with better or worse classification performance. However, given the inherent assumption of spatial independence of the observations, a volumetric implementation is more appropriate. A simple interpretation for RTSS is to convert the area to volume by multiplication by the slice thickness. While sensitivity is a ratio and is not dependent on this calculation, it is necessary if reporting TP, FP, or FN volume.
Advantages and Limitations
Such classification approaches have the benefit that measures such as sensitivity and specificity are reasonably well understood both within the clinical domain and within the computer science community. However, the approach has limited usefulness in evaluation of auto-contouring since each pixel is treated as an independent observation. The importance of spatial location and the concept of the structure are not considered by these measures; thus, the measures do not provide any information about the utility of the contours - for example the same results can be measured by a random spattering of incorrectly classified pixels and the same number of incorrectly classified pixels systematically located.