Auto-Segmentation Clinical Validation Studies – Current State-of-the-Art

Although there are several studies for clinical validation, explored in depth in Chapter 15, a few representative studies are outlined here to describe the range of assessment metrics used to evaluate the auto-segmentation (atlas or deep learning methods). Most of the evaluated methods reported here are atlas-based, because deep learning methods are relatively new in radiation oncology. In general, the most common metric used for evaluating the auto-segmentation methods is the Dice similarity coefficient (DSC) [15, 16, 31-33]. DSC compares the segmentations against a clinical “gold standard” manual delineation using a form of overlap ratio. DSC is simple to compute, and is used pervasively in the literature, which allows for easy comparison to prior state-of-the-art.

In addition, the clinical usefulness has commonly been measured by evaluating the time savings in the contour editing [12, 14, 16, 21, 31]. This is a slightly more useful metric as it takes in to account the efficiency of using an auto-segmentation in clinical practice. Another approach has been to employ visual grading of the segmentations to assess how useful they are likely to be for clinical use [16].

But these measurements do not consider the endpoint and do not provide any information on the impact of the auto-segmentation on improving treatment accuracy or outcomes. A different approach has been to compute whether the auto-segmentations improve the robustness with respect to multiple raters. For instance, editing variations between radiation oncologists were reported to be reduced when using auto-segmented contours as a starting point for segmentation [42, 43]. This is important because large field margins are often the result of increased uncertainties. Finally, clinical evaluation studies which studied the relation between the geometric metrics used for evaluating auto-segmentation, such as the DSC and Hausdorff distance, and the clinically relevant dosimetric metrics [13, 44] found that there is little correlation between these two types of assessment. For example, imperfect DSC scores did not necessarily translate to inferior normal organ dosimetry [44], especially if the OAR was in the low dose region, while even small deviations in the geometric accuracy could have a large impact if the organ in question was in the high dose region [13]. Novel metrics of geometric accuracy such as the surface DSC metric [45] and quantitative measures of time savings like the added path length [21] have been shown to be more clinically representative surrogates for the time-clock measures required for editing incorrect segmentations in the clinic. Table 15.1 in Chapter 15 summarizes representative studies with the metrics used for algorithm validation.

Data Curation Guidelines for Radiation Oncology

A basic requirement for the application of auto-segmentation tools is that the definition of volumes produced using these tools adheres to clinical contouring guidelines according to the anatomical region. However, adherence to such guidelines requires that the training set delineations were done following well-defined guidelines either defined institutionally, or better through community-wide agreed-upon guidelines. Methods developed on one set of training data from one institution may be difficult to apply to a dataset from a different institution. Difficulties arise due to variabilities in the imaging acquisition protocols applied across institutions, but also due to differences in the clinician preferences for accepting contours for treatment. Finally, patient privacy concerns make it difficult to share datasets across institutions. As a result, the algorithm evaluations and comparisons are done using methods developed and tested on widely different datasets, which makes the performances of various methods hard to compare. Even if the source code for performing evaluations are available, as the number of models grows, comparative evaluations themselves become quite laborious.

“Grand challenges” offer an unbiased approach to reduce some of the aforementioned issues. Multiple grand challenges in the area of segmentation are becoming increasingly common, through which limited datasets with common scoring and editing guidelines are available to compare the various algorithms. Some grand challenges publish the delineation guidelines, which could be used in creating the institutional and larger training datasets to allow for improved training of the deep learning methods as well as the comparison against multiple methods. Some issues to consider when developing datasets both for training and testing of auto-segmentation methods include:

a. Adherence to well-defined clinical guidelines of target and normal OAR structures: There is inherent variability between users when performing segmentations. One approach to reduce variabilities is to use clearly defined guidelines such as use of the Radiation Therapy Oncology Group (RTOG) guidelines, RTOG 1106, in the 2017 AAPM Thoracic Auto-segmentation Challenge [29], RTOG 0920 in the Medical Image Computer and Computer-Assisted Intervention (MICCAI) Head and Neck Segmentation Challenge [39], as well as the AAPM RT MRI auto-contouring (RTMAC) 2019 grand challenge [40]. Another approach is to combine these descriptive guidelines with visual examples as done in web-based tools like the e-contour [46].

b. Use of multiple expert delineations: While it is difficult to obtain even a single expert delineation on a reasonably sized dataset, one clinically useful measure to assess autosegmentation is whether or not it reduces the variability with respect to multiple experts [45]. Another approach is to use multiple algorithm delineations with manual editing to be used as benchmark for comparison against new' methods [47]. While not a perfect substitute to model inter-rater variabilities, such datasets are useful to estimate how well a new algorithm compares to others.

c. Multi-institutional datasets for algorithm selection: While obtaining multi-institutional datasets is very challenging for individual researchers, grand challenges offer an easier option for a group of researchers to put together multi-institutional datasets. Multi- institutional datasets are invaluable to evaluate the algorithms under a wider range of realistic clinical conditions used in these institutions as opposed to a specific imaging protocol used for training and testing the algorithms. The algorithms, especially those using deep learning methods, have matured to a point that it is no longer useful to restrict their training and testing to homogenized imaging protocols without any corner cases. Particularly, if these methods are to be used in the clinic, it is imperative that the datasets consider the wader range of imaging variations.

d. Creating an internal benchmarking dataset for internal acceptance: Multi-institutional datasets are useful for selecting the appropriate method among other methods. However, once selected, it is necessary that this method satisfies the requirements of clinicians and clinical preferences in segmentation. For example, the delineation requirements in an institution may require that the organs are never under-segmented such that the radiographer or dosimetrist do not have to spend additional time including the parts left out by the algorithm. In the author’s own institutional experience in commissioning a head and neck autosegmentation method, it w'as found that the method trained on the well-known external institution Public domain database for computation anatomy (PDDCA) dataset produced segmentations closer to the inside boundary of the organs but the institution preference was to include the external boundary of the organs (like mandible, for instance). This is a small difference but can lead to excess additional work for individual cases. A more practical solution for this problem was to retrain the algorithm on the internally curated datasets.

Other considerations for internal commissioning should include the range of cases seen in the institution. For instance, in a tertiary cancer center, the number of patients with abnormal anatomies are often higher than is encountered in multi-institutional challenge datasets. Ignoring abnormal anatomies can create problems as the algorithms fail to segment or even lead to false positives. Thus, it is necessary to include difficult and abnormal conditions that are more commonly seen in the clinic as detailed below.

e. Inclusion of difficult clinical conditions commonly seen in-clinic: As the goal of training deep learning methods is typically to develop an algorithm that achieves reasonable performance, datasets often ignore or even remove difficult conditions like images with artifacts, images with large tumors, or anatomical conditions like collapsed lungs, absence of structures like submandibular glands due to surgery, and large tumors. However, for clinical use, the algorithms need to be robust to these clinical conditions. A common reason why the methods are not used in the clinic is because they fail to generalize even slightly outside the most homogenous conditions under which they were trained. Data curation, either institutional or for a grand challenge would benefit by combining datasets that incorporate some more challenging situations to “stress test” the algorithms.

Evaluation Metrics Guidelines for Clinical Commissioning

Auto-segmentation methods are usually evaluated with respect to contour similarity metrics and the potential for time savings in contouring when using these methods [35]. The commonly used metrics for evaluating contour similarity are the Dice similarity coefficient, Jaccard index (which is related to the Dice similarity coefficient), and spatial distance metrics like the Hausdorff distances, mean surface distance, and average distances. However, there is no clear consensus on the best method for assessing performance. Importantly, these metrics may not be effective in distinguishing random from systematic errors or to clearly separate false positives from false negative segmentations [35, 48]. Also, these metrics operate “out of context”, in the sense that these metrics are independent of what impact their errors mean for treatment accuracies and the outcomes.

Importantly, volume overlap-based metrics like DSC are not well suited for measuring geometric deviations as they weight all misplaced segmentations, both internal and external, equally [45]. Similarly, volume comparison measures like volume ratio only give a rough idea of whether the segmentations are over and under the expected volumes as produced by an expert. However, these are not indicative of errors on the surface of the organs and target, which are more relevant for ensuring appropriate shaping of the treatment beams. Surface distance-based measures, such as the surface DSC that measures the extent of deviations of the algorithm contour from the manual delineation at the surface may be more suitable to measure the absolute deviations (in mm) from the organ surface by the algorithm [45]. These measures have also been shown to be more correlated to clinical usability measured using the time to perform manual editing on the autosegmentations [21, 36].

The ultimate goal for clinical commissioning is w'hether or not the auto-segmentation method improves clinical efficiency and whether it improves treatment accuracy. In this regard, the time savings-based metrics as used by several works on clinical validation [12, 13, 16, 21, 31] are more useful measures of clinical efficiency improvements. However, measuring wall clock times in editing is also difficult and may be problematic to require clinicians to time themselves. A non-invasive (in terms of additional work for clinicians), and more robust metric was recently introduced [36], which measured the added path length of the edited contours. These metrics can be directly measured after the clinician edits by comparing against the algorithm contours. Also, this measure together with the surface DSC metric was shown to be highly correlated with the time savings measurements.

Ultimately, the real impact of the segmentations is in the dose calculations. In this regard, measuring the dose impact using dose-volume histograms (DVH) computed from new treatment plans computed from the auto-segmentations when compared to the “gold standard” segmentation plans would be most useful. However, such an evaluation is difficult to do as it requires access to fast treatment optimization plans that must be computed twice (once for the “gold standard” and second for the algorithm delineations). Also, average comparisons of the dosimetric measures derived from the DVH may not be sufficiently informative as the real impact of these metrics is for organs that lie in the high dose regions.

Importantly, the metrics used to evaluate the algorithms are often uncorrelated with each other. For instance, a high DSC score achieved by an algorithm does not necessarily imply a low Hausdorff distance [34]. Similarly, high DSC accuracy does not necessarily correspond to the improved dosimetric accuracies, because dosimetric accuracies are impacted by the location of the organ in the high-dose region [13]. In this regard, a better strategy might be to come up with new and comprehensive measures that combine the various metrics. Grand challenges themselves typically use DSC and Hausdorff, and additional metrics based on one or more expert delineations [49]. These metrics could be treated separately or as a combination to create a combined score, as was used in the 2017 AAPM Thoracic Auto-segmentation Challenge, also incorporating the inter-rater differences as a baseline in the accuracy calculation [29]. An advantage of using such combined metrics is that it provides one comprehensive score to compare multiple algorithms and provides an easier way to rank the performance of these methods.

However, the combination of metrics is still only useful to assess how good the algorithms are with respect to each other. In the case of clinical commissioning and for clinical use, this may not necessarily be useful as a metric. This is because in day-to-day clinical use a more useful measure is the degree of confidence in different parts of the segmentation. Such a measure could be, for example, computed using a multi-atlas method comparing multi-atlas segmentations to the clinical contour (if available) to indicate areas that potentially need a second inspection to reduce inter-rater variability. Alternatively, segmentation uncertainties on the voxel-by-voxel basis could also inform users where corrections are necessary. These uncertainty metrics could be tied to the algorithm alone or could also consider how the algorithm contouring variability (due to different levels of uncertainty) impacts treatment plans (assuming fast treatment planning methods are available to automatically compute multiple plans). Figure 13.1 shows an example of a multi-atlas registration- based segmentation uncertainty map [50] visualized for a selected organ (the left parotid gland). As seen, the voxel-level visualization of uncertainties, which can also easily be obtained using deep

Voxel-wise segmentation confidences

FIGURE 13.1 Voxel-wise segmentation confidences (red indicates higher certainty) visualized for a representative case computed using multi-atlas registration-based segmentation. The geometric accuracy metrics are also shown by evaluating against clinical delineations.

learning methods, can potentially help ascertain where segmentation corrections may need to be done for QA in clinical workflows. The geometric uncertainties could also be combined with the dose volume histogram calculations to evaluate the extent of deviations between the various contours as a measure of contour stability.

In summary, the best metric or set of metrics needed for clinical commissioning is still an open research question. However, depending on the clinical requirements (accuracy vs time efficiency), a combination of the aforementioned metrics could be used in a hierarchical manner. For instance, geometric metrics could be used on a common benchmark dataset to ascertain the best method. Next, clinical commissioning and ongoing evaluations during various algorithm upgrades or in the event of imaging protocol changes could be evaluated using time-based, dosimetric, or a combination of these metrics.

< Prev   CONTENTS   Source   Next >