III Clinical Implementation Concerns

Clinical Implementation Concerns

Clinical Commissioning Guidelines

Harini Veeraraghavan

Introduction

Rapid advances in image-guided radiation therapy have brought forth a range of treatments including high dose radiation therapy, whereby high dose radiation is delivered over a few fractions with high conformity to the target tumors. Clinical studies have shown excellent local control in multiple cancers and with it, the potential to improve progression-free and overall survival [1-4] of some patients.

An important requirement for achieving highly conformal doses is very precise and accurate segmentation of the target and the nearby normal organ at risk (OAR) structures [5, 6]. In current clinical practice, targets and OARs are manually delineated by clinicians on CT images. Manual delineation is difficult because of the low soft tissue contrast especially on non-contrast CT images [7, 8]. Manual delineation is also subject to inter-rater variabilities [9] that can adversely impact tumor control probability [10]. The advent of new image-guided treatments including magnetic resonance imaging (MRI)-guided radiotherapy have made manual delineation more accurate due to improved soft-tissue contrast on MRI [11]. Nevertheless, the problem of variable segmentations persists, and segmentation remains the most time-consuming step in radiotherapy [12].

The commonly used atlas-based methods have been shown to both reduce inter-rater variability [13, 14] and decrease the manual editing times [12, 14]. Clinical validation studies have shown that atlas-based methods produce excellent agreement with manual delineations and reduce user effort [15, 16]. However, these methods can have systematic biases in the segmentations, which may necessitate extensive user editing of some organs [15, 17]. For example, small organs may need consistent manual editing [15], or certain organs like parotid glands may require re-segmentation [17].

More importantly, the large geometric uncertainties resulting from manual segmentation constrain the dose that can be safely delivered to target tumors. This is because of the large field treatment margins [18] necessary to account for the geometric uncertainties, which invariably lead to higher doses delivered to the nearby normal OARs. In-treatment-room X-ray-based cone beam computed tomography (CBCT) imaging is currently available as part of standard equipment. CBCT has been used for positional and setup corrections during treatments. Incorporating geometric corrections has been shown to lead to improved accuracy of the conformal treatment [4, 19], and with it the potential for improving outcomes. However, a key obstacle for these treatments is a lack of robust, fast, and accurate segmentation methods. Atlas-based methods are computationally expensive and are impacted by changing anatomy; anatomical changes are common during radiation therapy and imaging appearance may change, reducing the accuracy of atlas-based deformation techniques.

The more recent deep learning methods are computationally fast (usually in the order of seconds or minutes, compared to minutes or hours for atlas-based methods), and are robust of inter-rater variability [20]. Importantly, deep learning methods have been shown to reduce the manual editing times for multiple OARs [21] more than atlas-based methods. As a result, deep learning has been applied very extensively in numerous image segmentation problems in radiation oncology [20, 22-28].

However, the use of either deep learning or atlas-based methods in routine clinical care is highly limited. This is due in part to the difficulty in establishing the reliability of these methods for commissioning [29], as well as a lack of tools to identify when and where the algorithm fails, leading to manual override of the delineations. More importantly, while some of these methods may show phenomenal performance on limited testing sets, they fail to generalize in clinical datasets, as discussed in Chapter 12. Discrepancies in the performance may stem from large differences in the training/testing sets and the actual clinical use. For example, it is not uncommon to remove difficult conditions like images with large artifacts, large tumors, or those with abnormal anatomy like collapsed lungs from limited training/testing cohorts to assess basic performance of the developed methods. However, methods developed under these conditions fail to scale to actual clinical scenarios when images may have large artifacts (like dental artifacts on head and neck CT images) or abnormal anatomy (e.g. collapsed lungs, missing structures due to surgery, mass effect due to the presence of large tumors).

All this motivates the need for clear guidelines and metrics for evaluating auto-segmentation methods prior to clinical commissioning. In the rest of the chapter, the approaches used for clinical validation as outlined in some prior studies, and the challenges involved for clinical commissioning are briefly discussed, and some solutions towards mitigating these issues through better data cura- tion and evaluation metrics are presented.

Stages in Clinical Commissioning

A phased approach consisting of technique identification and verification, testing before clinical deployment, and ongoing clinical quality assurance (QA) is generally recommended when introducing a new technology into a clinic [30]. Clinical commissioning of an AI method at the minimum should include testing of the AI technique on the institution dataset prior to clinical implementation followed by routine QA of selected clinical cases with a group of multi-disciplinary experts. Prior to assessment of the auto-segmentation system, it is also important to obtain some details of the technique regarding the evaluations and quantitative performance of the metrics to be used. Evaluation of the system should include, where possible, testing with multi-institutional datasets, comparison against multiple methods, and evaluation on established metrics. In order to ensure that the most suitable methods are used in the clinic, there is a need for both robust and reliable metrics to evaluate these methods as well as well-curated datasets arising from multi-institutional and internal institution datasets. Next, after introduction into clinical use a frequent QA of selected cases should be done to determine that the system continues to perform at the desired level. Problems in performance should be logged and AI techniques may need to be retrained if the imaging technology and imaging protocols change.

Need for Robust and Clinically Useful Metrics

The most commonly used metrics for evaluating the accuracy of segmentation methods are based on those used in computer vision, whereby the spatial and geometric overlap accuracy of the algorithm is measured against manual delineation by an expert. One such metric is the Dice similarity coefficient that measures the overlap in the number of voxels that match between the algorithm and manual delineation [15, 16, 31-33]. Because of the need to guarantee spatial accuracy in terms of metric distances, Hausdorff distances have also been commonly used to assess medical image segmentation applied to radiation therapy. While these metrics are reasonable when analyzing objects from real world images that have well defined boundaries and can be easily identified and delineated by people, medical image analysis and delineations requires significant domain expertise. Also, even delineations by experts are subject to inter-rater variability, which inevitably results in a lack of an absolute gold standard segmentation [34, 35]. In medical images, there is no “gold standard” ground truth. More importantly, these metrics are neither clear indicators of clinical efficiency, such as the amount of reduction in editing times [36], nor are they indicative of target coverage improvement [37]. This motivates the need for practical metrics that can be used for clinical commissioning of auto-segmentation methods. Chapter 15 further considers quantitative methods for evaluation of auto-segmentation in clinical commissioning.

Need for Curated Datasets for Clinical Commissioning

A related problem is the lack of benchmark datasets to assess and compare different algorithms. Using a common reference dataset to establish performance allows direct comparison of the various methods. Benchmarking datasets that encompass the expected variabilities to be seen in the clinic, including imaging variations, imaging artifacts, and large deformations in patient anatomy are necessary to evaluate the utility of the auto-segmentation methods for clinical use. Such datasets, if available, are also useful to evaluate upgrades to auto-segmentation software already used in the clinic to ensure safe use for treatment planning. The recent push towards increased reproducibility in research initiated by multiple top-tier machine learning conferences like Neural Information Processing Systems and some medical journals has accelerated the implementation of more advanced methods by biomedical scientists and the evaluation of the new methods against established state-of-the-art. However, there is a need for a community-wide testing framework with common datasets and common metrics to evaluate the various methods. Grand challenges [29, 38-40] that provide limited size datasets with clearly defined metrics for evaluation and delineations done using clearly defined criteria represent a successful first step in this direction.

More importantly, lack of well-curated and large datasets is a fundamental problem in the successful application of data-hungry methods like deep learning. Variabilities in the delineation guidelines used across the institutions may make these methods less portable across institutions; the accuracy and robustness heavily depend on the size and quality of the training datasets [41]. Hence, well-curated datasets delineated according to published contouring guidelines, with reasonable size, and arising from different institutions to capture the variability in the imaging is crucial to ensure the development of clinically useful deep learning methods. Chapter 14 addresses the challenges of data curation for auto-segmentation in more detail.

On the flip side, internal commissioning within an institution would require a well-curated dataset, where the contouring adheres to the clinical preferences and the needs of treated patients within that institution. More details on some of the considerations needed for clinical commissioning data curation are presented in the subsequent subsection on data curation guidelines for radiation oncology.

 
Source
< Prev   CONTENTS   Source   Next >