Identifying Possible Scenarios Where a Deep Learning Auto-Segmentation Model Could Fail

Carlos E. Cardenas


Site-Specific Models

A majority of radiation oncology auto-segmentation studies focus on developing algorithms for single anatomical sites. This is driven by several reasons, some of which include limited scan field- of-view (FOV), availability of expert contours, and an increase in complexity to account for in multi-site solutions.

In radiotherapy, image acquisition FOV of treatment simulation scans is driven by preset imaging protocols for each disease site (defined based on disease site, the tumor location, and the extent of the disease as imaging the patient outside of these regions would deliver an unnecessary radiation dose to the patient). Limiting the FOV potentially reduces an unnecessary radiation dose in a computed tomography (CT) scan or could drastically reduce the scanning time in magnetic resonance imaging (MRI).

Generating manual contours for normal tissues outside of the treatment region offers little benefit in the treatment planning process. For example, most commercially available treatment planning systems do not calculate doses outside of the treatment region, so contouring the lungs for a prostate cancer patient’s plan offers no benefit as a dose to the lungs (if visible in the simulation scan) would not be calculated and, therefore, not reflect accurate dose estimates for that organ. Additionally, it is well established that manual contouring is time-consuming, and it would be cost prohibitive to contour all normal tissues outside of the treatment region in routine clinical practice.

The location and size of a tumor play a critical role in determining patient setup during simulation and the course of treatment. Patient setup could be standard for certain sites (i.e. head and neck cancers are supine with thermoplastic mask) whereas many options can be available for others. For example, rectal cancers can be treated either in the supine or prone positions, depending on the treatment delivery technique 3D conformal radiotherapy (3DRT) or Volumetric Modulated Arc Therapy (VMAT) and individual treatment centers preferences for clinical practice. Furthermore, a wide range of immobilization devices are used in radiotherapy, playing a role in treatment simulation setup, with some devices designed to pull/compress organs in or out of the treatment field. The wide-variability in patient setup and the effects of some immobilization devices (i.e. use of a belly board on prone setup for treatment of rectal cancers) on patients’ anatomy increase the complexity in developing a multi-site auto-segmentation solution.

Limitations of Training Data

Despite the success and superior performance of deep learning-based auto-segmentation algorithms [l], individual model performance and generalizability on unseen data is often overlooked and unreported. As detailed in the previous subsection, several factors such as a scan’s craniocaudal FOV, patient setup, and anatomical variabilities due to immobilization devices could have a significant impact on a model’s ability to produce clinically acceptable segmentations. Other factors such as image quality, presence of medical devices, implants or hardware, and anatomical changes from prior patient history (i.e. collapsed lung), or due to surgical procedures have not been widely investigated.

At the time this chapter is written, a single study has reported results highlighting the limitations caused by the lack of diversity in publicly available data when used for thoracic organ segmentation [2]. Feng et al. used the 2017 AAPM Thoracic Auto-segmentation Challenge data to train a two-stage 3D convolutional network model [3] using a variant of the 3D U-net [4]. This model (trained on the challenge data) was then evaluated on institutional data where some discrepancies between clinical and auto-segmented volumes were noticed, with significant differences reported for heart contours [2]. In their study, the authors found that differences in motion in the management technique between the challenge data and institutional patient scans resulted in unacceptable autosegmentations for the heart. Figure 12.1 illustrates differences between challenge and institutional data from the study by Feng et al. The authors explain that most of their thoracic cancer patients are treated using an abdominal compression immobilization technique resulting in the heart (shown as manually contoured in the figure) being pushed cranially into the thoracic cavity. Utilization of this immobilization technique leads to over-contouring of the heart, as shown in Figure 12.2a.

It remains unknown how deep learning auto-segmentation models trained using limited datasets (n < 50) behave when used for a wide variety of clinical applications. In this chapter, the 2017 AAPM Thoracic Auto-segmentation Challenge data was used as a case study to train a commonly used deep learning auto-segmentation method and evaluate its performance on a variety of treatment sites (i.e. non-thorax) and clinical presentations (i.e. atelectasis).

Figure from Feng et al

FIGURE 12.1 Figure from Feng et al. [2] illustrating differences in motion management techniques between challenge data (left) and institutional data (right) using sagittal views from thoracic CT scans. Most patients at their institution were treated using an abdominal compression technique which the hardware can be seen on the bottom right corner of the right panel. To highlight these differences, the heart contour is provided (purple) for both patient images.

Adapted figure from a study by Feng et al

FIGURE 12.2 Adapted figure from a study by Feng et al. [2] Both panels show clinical (dark) and auto- segmented (light) heart contours, (a) highlights disagreement between the clinical and auto-segmentation when the model is only trained using the challenge dataset, whereas (b) shows an improvement in agreement between clinical and auto-segmented heart contours after the model has been re-trained using both challenge and an additional 30 thoracic cancer patient scans with abdominal compression immobilization techniques.

Deep Learning Architecture

Two-Stage U-Net Model

In this chapter, a two-stage U-net model similar to the approach previously introduced by Feng et al. [3] is used. The two stages each employ 3D U-net architecture to first localize the normal tissues using a multi-class (background and five normal tissues) model, and then to segment individual normal tissues about the regions identified in the localization stage (Figure 12.3). This approach of combining neural networks that focus on localization first and then perform the segmentation is widely used in the literature [5-9]. The following subsections describe in more detail the selected image preprocessing steps and the two stages (localization and fine-detail segmentation neural networks) used for this case study.

Image Preprocessing

Image preprocessing is a critical step in any image-related task. To account for the variability in pixel spacing and slice thickness, the voxel size is standardized to be 1.25 mm x 1.25 mm x 2.75 mm (in the x-, y-, and, z-directions) for both CT images and their corresponding region of interest (ROI) masks which are shaped to have size nz x a, x nx; this convention of ordering images in z-, y-, and x-directions is used for all inputs in the model. These values were chosen as it ensures that every input image goes through this standardization process as none of the provided images used these pixel spacing or slice thickness values. To reduce the Housfield unit (HU) range of the CT images, all voxels outside of -500 HU and +500 HU were set to have values of -500 and +500, respectively. The threshold pixel values were then linearly transformed to be within the range [0, 1].

For the first stage in this approach, the standardized voxel image and ROI masks were resized to 64 x 128 X 128 using a tri-linear interpolator. Prior to this resizing step, a binary 3D dilation was applied to the esophagus and spinal cord masks to ensure these ROIs are preserved and not averaged out during downsampling. This 3D dilation is only applied when generating input data for the first stage of this model as this model is only intended for localizing these normal tissues rather than getting an anatomically correct segmentation.

In the second stage of this approach, the standardized voxel image and ROI masks were used as “full resolution” inputs for the fine-detail segmentation models. Here, the individual ROIs were cropped by identifying their left-right (LR), superior-inferior (SI), and anterior-posterior (AP) borders and then a margin was applied to ensure that there was a buffer space large enough to encompass the random translations applied during training. Individual ROI input sizes are determined based on the volume sizes on the standardized voxel masks. For all cases in the training set, the displacement in the RL, CC, and SI directions was calculated for a specific ROI. Then, Equations 12.1-12.3 were used to determine the optimal value for each individual direction.

Here A is a vector containing all displacements for a specific direction (either LR, SI, or AP) and multiple is a constant used to ensure that the input size values have a base 2. This constant was chosen to be 32 (25). The resulting input sizes for each ROI are listed in Table 12.1.

Stage 1: Localization through Coarse Segmentations

Several studies have demonstrated the effectiveness of segmentation networks to initially localize a region of interest prior to generating a final segmentation. Detection networks require large amounts

(A) Illustration of two-stage approach in this chapter

FIGURE 12.3 (A) Illustration of two-stage approach in this chapter. The first stage generates coarse segmentations for all OARs, whereas the second stage uses these coarse segmentations to focus individual segmentation networks about the desired OARs to be automatically segmented. (B) The U-net network used in this work. Input sizes are detailed in Table 12.1. The numbers at each resolution stage represent the number of filters used for the stage; concatenation layers combine features from encoding path with decoding path features. The number of features is doubled at each max pooling layer and halved at each de-convolutional layer. The final number of features in the softmax layer depends on the stage (i.e. six features/classes each representing an OAR + background in first stage, and two features/classes representing background and foreground in second stage).

TABLE 12.1

Input Size for 3D U-Net Networks Used in this Chapter. For Individual Segmentation Networks, Input Sizes Were Determined Individually for Each Normal Tissue by Sampling Volume Sizes for Each Structure on the Training Dataset


SI Direction (z)

AP D(y)

LR Direction (x)













Left Lung




Right Lung




Spinal Cord




SI: superior-inferior (i.e. cranio-caudal), AP: anterior-posterior, LR: left-right

of training data to identify useful patterns to accurately localize objects within an image. In a different manner, segmentation networks can take advantage of the label map information to train a network with higher accuracy than detection networks, especially when training data is limited, to localize an ROI within an image. It is for this reason that several works have highlighted the success of using this approach in medical imaging segmentation.

There are several advantages to using a segmentation network to localize an ROI as a first step in a segmentation model. First, the localization network is being used to find a specific ROI within the image space, which can be compared to “finding a needle in a haystack” for some volumes. In medical imaging, the image space can be large, and reducing this FOV through a localization (and then cropping to this ROI) can lead to faster and more efficient training of a neural network. Secondly, since the localization network is used to find a region to focus on during the segmentation stage, there is no requirement in maintaining the original resolution of the image space (i.e. pixel or slice spacing). This allows for the resizing of the medical image to a smaller input size which can then better accommodate graphics processor unit (GPU) memory limitations when training a network. There are a few things to consider when resizing an image to a smaller size. One unintended consequence of reducing the image size in medical imaging is that small or thin volume masks can often lose useful information or be completely averaged out by the interpolation method chosen during the resizing process. Another disadvantage of using a localization stage in a segmentation task is that it can be computationally expensive during training and increase the time to auto-segment a new patient. Lastly, using a multi-class localization stage assumes that all cases in the training data have ground truth volumes available for all normal tissues or structures to be auto-segmented. While this is true for well curated datasets such as those often found in public segmentation challenges, this may not be necessarily true for clinical data where often only critical organs at risk within or near the treatment region are contoured. A way to address this could be to use multiple localization networks for organs that are less frequently contoured or contoured under a specific protocol for a limited number of cases.

In this chapter, a generic implementation of the 3D U-net was used for the localization stage (Figure 12.3). This network uses an input size of 64 x 128 x 128 (see Table 12.1), a kernel size of 3x3x3, and has two convolutional layers at each resolution level (all convolutional layers apply padding), which are followed by a max pooling operation. The network has a depth of six resolution levels; the first level uses eight convolutional filters for each convolutional layer, which then are doubled after each max pooling operation. Each convolutional layer (Conv3D) is followed by batch normalization (BN), which is then followed by the rectified linear unit (ReLU) activation function

TABLE 12.2

Limit Values Selected for Augmentations Used During Training

Augmentation Type


Roll (rotation)


Pitch (rotation)





[90%, 110%]

Gaussian (sigma)


а х, у, and z translations were independently defined. b In some cases, this could have been smaller to prevent cropping outside of the image space.

(Conv3D + BN + ReLU) prior to the next convolutional layer or max pooling layer. The localization network outputs coarse segmentations for six classes which include the five organs-at-risk (OARs) (left and right lung, heart, esophagus, and spinal cord) and the background.

During training, resized volumes (Table 12.1) were used and traditional augmentations such as translation, pitch and roll rotations, and zoom were applied. In addition, a Gaussian filter was applied with randomly selected sigma values to make the network robust to variations in contrast and sharpness. Values used for augmentations were randomly generated using a uniform distribution within the predefined limits shown in Table 12.2. Image padding was applied when needed using the reflection of the edge of the image requiring padding. Here, a batch size of four per iteration was used. To train the model, the commonly used Dice loss function was used as it generally converges faster than categorical cross-entropy. Chapter 10 gives more details regarding the use of loss function in training deep learning for auto-segmentation. The Adam optimizer was used with a learning rate of 0.001, with p, and p2 values of 0.9 and 0.999, respectively. Early stopping was used to prevent overfitting by randomly selecting six training cases (out of 36) prior to the start of training and these were used to independently assess the progress of the model during training.

When running an inference on a new patient, the resulting segmentations of the localization stage are used to identify which regions of the “full resolution” image contain individual OARs to focus individual segmentation models for the resulting fine-detail segmentations using a cropping approach.

Stage 2: OAR Segmentation through Fine-Detail Segmentation

The second stage of this approach focuses on fine-detail segmentations by training individual ROI 3D U-net models using “full resolution” inputs. There are a few advantages to using individual models per ROI. First, is that each model can be trained to focus on learning intensity features that are characteristic of each ROI. Second, using individual models per ROI allows for additional flexibility in model design and input size. Third, individual ROI models can be updated independently without making changes to other ROI models. This is advantageous w'hen a segmentation model produces high quality segmentations for most ROIs but produces inaccurate segmentations for a few, more challenging, ROIs. Lastly, using individual models per ROI allows for the use of segmentations from sparse datasets w'here most cases do not contain the full list of ROIs to be auto-segmented. Using individual ROI models has its disadvantages though; training individual ROI models can be computationally expensive and less efficient when computational resources are limited. Also, using multiple models can result in an increase in the time required to auto-segment a large list of ROIs; here, individual models need to be loaded to the GPU, individual ROI input data have to be generated, with this input data then fed through the neural network to predict the resulting auto-segmentations.

During training, the CT images were cropped to regions around the ground truth segmentations for each individual ROI provided in the training data. Individual ROI models use different input sizes, which were determined based on training data volumes (see Section, except for the left and right lung where the image size is the same (Table 12.1). The architecture used to train each ROI model has identical parameters to the 3D U-net model trained for the localization stage; here, the training details (loss, optimizer, etc.) and image augmentations remain the same as those described in Section

In the testing phase, a clustering approach was used to identify cropped CT image inputs for each model by using the segmentations generated from the localization stage network. This approach is described in more detail in Section 12.2.2.

Test-Time Cluster Cropping Technique

Instead of using a traditional tile and stitch approach, a cluster cropping technique was introduced in order to focus the cropped CT image volumes within the localized region identified in the first stage of the model. Here, the resulting coarse segmentation masks (size of 64 x 128 x 128) from the first stage model were resized back to the standardized voxel size for a specific patient. Then, K-means clustering was used on the coarse segmentations to identify cluster centroids which will serve as the center of the region of the standardized voxel CT image to be cropped and used as inputs for testing using the weights trained for the second stage models (Figure 12.4). The К-means clustering assigns individual voxels within an ROI to the nearest cluster centroid by distance resulting in an even distribution of clusters (and therefore centroids) throughout a volume mask. Using many clusters and/or a large input size will ensure large overlap between patches which will then increase the confidence in the probability of belonging to the foreground assigned to individual voxels. To increase the confidence in the predictions, the number of clusters (K) was set to 24 for all ROIs.

The predictions of the cluster patch inputs are then stacked so that individual voxel averages can be calculated using the overlapping prediction values for each voxel. The resulting probability prediction map is resized from the “full resolution” image back to the original image size for that patient. Here probabilities are converted to a mask (using p > 0.5) where post-processing takes place prior to converting the auto-segmentations to Digital Imaging and Communications in Medicine (DICOM) format.

< Prev   CONTENTS   Source   Next >