Organ-Specific Segmentation Versus Multi-Class Segmentation Using U-Net

Xue Feng and Quan Chen


In clinical practice of radiation treatment planning, multiple organs need to be segmented from the CT to calculate the dose distribution on each organ. Deep convolutional neural networks (DCNN) represented by U-net have been widely used in this application and demonstrated far superior performance than all traditional methods [1]. To achieve the multi-organ segmentation task, there can be two design choices: a single network can be designed and trained with direct multi-class segmentation output or multiple organ-specific networks can be trained with each one performing a binary class segmentation. This study aims to perform a comparison of these two options using the data from the 2017 AAPM Thoracic Auto-segmentation Challenge and evaluate the advantages and disadvantages of each method.

Materials and Methods


The 2017 AAPM Thoracic Auto-segmentation Challenge dataset was used in this study. Details of this dataset are provided in Chapter 1 and in Yang et al. [1]. The organs to be segmented included the esophagus, heart, left and right lungs, and spinal cord. The Radiation Therapy Oncology Group (RTOG) contouring guideline [2-3] was used for the ground truth labeling and evaluation of the output.

Network Structure

As CT images are often acquired in three dimensions (3D), a 3D U-net was used as the backbone network structure to more effectively exploit the full 3D spatial information directly from the image volume. Chapter 8 evaluates the performance difference between two-dimensional (2D) and 3D networks. All 2D operations were replaced with their 3D counterparts [4]. Padding was used during convolution to maintain the spatial dimension of each layer so that the output labels would have the same size as the input images. The dimensions of the input image were set to be 72 x 208 x 208 determined by the graphic processing unit (GPU) memory available and the average aspect ratio of the chest CT. The 72 corresponds to the craniocaudal direction and the number of features at the first layer was set to 24. Two sets of 3D convolution filters of 3 x 3 x 3, batch normalization layer, and the rectified linear activation function were used for each encoding block. With each pooling step, the spatial dimension was reduced in all directions and the number of features was doubled. The final segmentation map contained p classes. For the organ-specific networks, p is 2 as it only contains the target organ and background; for multi-organ networks, p is 6 to represent five organs and the background. Figure 9.1 shows the network structure.

Pre-Processing and Downsampling

As the CT imaging protocol including pixel spacing, axial slice thickness, and field-of-view in the z-direction can vary from different scans, to reduce the variability within the dataset including both training and testing cases and to fit the same input matrix size to the network, image pre-processing was performed. In addition, limited by the GPU memory, downsampling of the original images was performed. As the axial field-of-view (FOV) of CT images is often larger than the body, for all axial slices, it was first resampled to have the in-plane resolution of 1.9512 mm x 1.9512 mm2 and center- cropped the resulting images to 208 x 208. The number of slices was resampled to be 72 without cropping or padding as some organs extend to the very top or bottom slices. Thus, the resulting 3D input image size was 72 x 208 x 208, matching the network architecture input. The ground truth labels were processed using the same workflow to be consistent with the CT images. Finally, to normalize the image intensity, the voxel values outside of-1000 to 600 Hounsfield units (HU) were set to -1000 and 600, respectively. The resulting images were then normalized to the range [0, 1].

Quantitative Evaluation Metrics

When ground truth contours are available, the automatic segmentation results can be evaluated using quantitative measures. A detailed discussion of quantitative evaluation measure is given in Chapter 15. The measures used in the 2017 AAPM Thoracic Auto-segmentation Challenge, the Dice coefficient, mean surface distance, and 95% Hausdorff distance, were used in this study. The definitions of these measures follow those used in the challenge and may vary from the definition or implementation given in Chapter 15, therefore the definitions are provided here.

The Dice coefficient (D) is calculated as:

where X and Yare the ground truth and the algorithm segmented contours, respectively. The directed average Hausdorff measure is the average distance of a point in X to its closest point in Y, given as:

General structure for 3D U-net used in this application. Each encoding block consists of two sets of consecutive convolutions

FIGURE 9.1 General structure for 3D U-net used in this application. Each encoding block consists of two sets of consecutive convolutions: batch norm and rectifier linear activation layers. Padding was used to maintain the spatial dimension during convolution. The number of features was doubled after each pooling layer. Long range connections were used by concatenating the outputs from the corresponding encoding blocks with the decoding blocks. For multi-organ segmentation network, p = 1 + number of organs: for organ-specific networks, p = 2.

The mean surface distance (MSD) is then defined as the average of the two directed average Hausdorff measures:

The 95% directed percent Hausdorff measure is the 95th percentile distance over all distances from points in X to their closest point in Y. Denoting the 95th percentile as Kg5, this is given as:

The undirected 95% Hausdorff distance (HD95) is then defined as the average of the two directed distances:

Implementation and Comparison Experiments

Both the multi-organ network and each organ-specific network model were implemented using the TensorFlow framework. To reduce the effects of the unbalanced voxels of different organs, the weighted cross-entropy loss was used. For multi-organ network, the relative weights for background and five organs were: background: 1.0, spinal cord: 2.0, right lung: 1.0, left lung: 1.0, heart: 1.0, esophagus: 3.0; for the organ-specific network, the same relative weights were used but the background was defined to include all voxels other than the specific organ. In practice, as smaller organs have much fewer corresponding voxels, their weights in the cross-entropy loss function are often increased to avoid gradient vanishing in learning these structures. However, a very large weight can also lead to increased false positives. The weights for the spinal cord were thus empirically set to be 2.0 and for the esophagus to be 3.0 without a very deep investigation into the actual effect on the performance. The Adam optimizer [5] with a learning rate of 0.0005 was used. The training process ran for 200 epochs in which each epoch looped through all the training cases once. During training, data augmentation was performed by applying random translations, rotations, and scaling to the input images and the corresponding ground truth label maps at each iteration. Further discussion of data augmentation is given in Chapter 11.

It is noted that the multi-label maps, although often using integer values from 0 to 5 to denote the different classes, they are not continuous numerically. Therefore, in order to avoid any interpolation errors, such as two neighboring voxels having labels of 0 and 6 may generate an intermediate voxel with value of 3, which corresponds to another class, the multi-label map to multiple binary maps for each organ were first converted and then applied the transformations to each one separately. Furthermore, after applying the random transformations to the binary maps, a threshold value of 0.5 was applied to each interpolated organ segmentation to convert back to binary values. The trained networks were then deployed to the testing dataset. As a simple post-processing step, isolated voxels labeled as pertaining to a specific organ which were not connected to the majority of the voxels belonging to that organ were regarded as false positives and removed during post-processing.

In the experiments, 24 cases were randomly selected for training both networks and the remaining 12 were used for performance evaluation and comparison. The Dice coefficient, MSD, and HD95 were calculated, respectively. Students’ t-tests were used and p < 0.05 was used as the criteria for statistical significance.


Tables 9.1, 9.2, and 9.3 show the Dice scores, MSD, and HD95 of the multi-organ network and organ-specific networks. No statistically significant differences were observed for all organs, indicating that the multi-organ and organ-specific networks yielded comparable results. Furthermore, although the organ-specific networks showed slightly better Dice scores for the spinal cord and esophagus, the MSD and HD95 of the spinal cord were larger, confirming that it is difficult to claim the superiority of one method over another.

Figure 9.2 shows the Dice scores for training and validation datasets during the training process. For the multi-organ network, the mean values of all organs are shown. Comparing different organs


Dice Scores of Multi-Organ Networks and Organ-Specific Networks

Spinal Cord






0.807 ± 0.056

0.972 ± 0.007

0.965 ±0.012

0.899 ± 0.032

0.512 ±0.141

Organ specific

0.816 ±0.044

0.973 ± 0.007

0.964 ± 0.009

0.892 ± 0.027

0.543 ±0.125








MSD (mm) of Multi-Organ Networks and Organ-Specific Networks

Spinal Cord






1.832 ±0.903

0.996 ± 0.236

1.102 ±0.471

3.380 ± 1.373

6.693 ± 5.934

Organ specific

1.964 ± 1.119

0.936 ± 0.236

1.158 ±0.449

3.653 ± 1.214

5.681 ±4.664








HD95 (mm) of Multi-Organ Networks and Organ-Specific Networks

Spinal Cord






8.268 ±7.717

3.798 ± 0.897

4.498 ± 3.084

9.677 ±3.921

20.884 ± 18.00


9.801 ±9.158

3.525 ± 0.875

4.737 ±3.091

11.44 ±5.508

19.44 ± 15.94







with the organ-specific networks, the spinal cord and esophagus are the two organs that are more difficult to learn as the training and testing Dice scores showed sudden jumps around iteration 1000 while the jumps happened at a much earlier stage for the other three organs. This is due to the fact that the spinal cord and esophagus are relatively smaller organs and the network learned to segment background first. The multi-organ networks also showed a stepwise increase with the second jump during iteration 1300-1800, which are assumed to be due to spinal the cord and esophagus. Comparing with organ-specific networks, the time it took to reach the final performance was also longer due to the reduced gradients as only two organs were not learned, which contributed to the gradient descent process. Despite the differences in convergence speed, the plots showed that all networks were able to converge to the optimal performance with less than half of the total number of iterations.


In this study the performances of the multi-organ segmentation network and organ-specific network were compared when all other settings such as input images, network structure, and training and testing augmentations were kept the same. The only difference between these two strategies is the last layer, as the former aims to simultaneously segment all organs while the latter segments one organ every time by treating all voxels not pertaining to the target organ as background. Intuitively, the multi-organ network has more information provided during training as a multi-label segmentation map is provided; therefore, ideally it is possible to utilize the label information from other organs to help the segmentation of the target organ. On the contrary, in the organ-specific network, the detailed information of other organs is not available but is merged with the background. This is similar to the concept of multi-task learning vs single-task learning [6-7]. It is generally regarded that multi-task learning can yield better performance than single-task learning if the parallel tasks are strongly correlated, or share common features, so that one task can benefit from other tasks. However, it is often challenging to perform a rigorous proof in CNN, as although there are many shared parameters of the different tasks, the detailed correlations are hard to analyze. For this specific application of thoracic organ segmentation, although the voxels of each organ are mutually exclusive and segmenting one organ may benefit from the contours of another organ, the segmentation criteria mostly relies on the target organ itself, or a combination with some key anatomical

Dice scores of the training cases and testing cases during training for the multi-organ network and organ-specific networks

FIGURE 9.2 Dice scores of the training cases and testing cases during training for the multi-organ network and organ-specific networks. The Dice scores of the multi-organ network showed a stepwise increase, meaning that the segmentation of different organs was learned consecutively rather than simultaneously. The spinal cord and esophagus were more difficult to learn as the performance only increased after iteration 1000.

biomarkers. Therefore, it is hard to justify how much benefit the segmentation of one organ can get from the availability of the segmentation of other organs. The experiments in this study suggest that there is negligible benefit, as both strategies yield almost the same results on the validation datasets.

Comparing the two strategies, the multi-organ segmentation network ensures that all output labels are mutually exclusive, as the class with the highest probability is assigned to each voxel; therefore, no potential conflict exists. For organ-specific networks, as each network yields a binary label map, it is possible that one voxel is classified as belonging to the specific organ by multiple networks so that conflict resolving is needed. This often happens for boundary voxels between two organs. In this case, the heart shares some boundaries with both lungs so that conflicts may happen, as shown in Figure 9.3 of a final segmentation. One solution is to record the probabilistic output of each organ-specific network and compare among different networks to follow the network yielding the highest foreground probability for a given voxel. In addition, although organs are very unlikely to overlap, the organ-specific networks provide a convenient way to handle overlap regions of interest (ROIs) such as organ and lesion within the organ as separate networks can be trained for organ and lesion segmentations. However, one significant disadvantage of the organ-specific network is

Five organs automatically segmented on a validation case. All OARs have satisfactory contours with no obvious mistakes. Minimal human interaction is needed

FIGURE 9.3 Five organs automatically segmented on a validation case. All OARs have satisfactory contours with no obvious mistakes. Minimal human interaction is needed.

the prolonged training and testing time as multiple models need to be trained and deployed to accomplish the same task.

Although the multi-organ and organ-specific networks did not show any differences in performance, a key factor that can affect the performance is the spatial resolution of the input images. In a previous study [8], the output of a multi-organ network was used to crop the images to only contain one organ and trained organ-specific networks on each cropped image; the performance was significantly improved compared to the multi-organ network. It is expected as more spatial details become available to make organ segmentation more accurate. This strategy is only possible with the organ-specific networks as many organs will be missing in the cropped images so that the multi-organ network will not be able to make correct segmentations. Alternatively, although it is possible to use patch-based strategy [9] to extract small patches and slide over the whole volume to train the multi-organ network, the spatial dimension is limited by the patch size so that no global information can be learned.

In this study, the Dice curves were analyzed during the training process to investigate the convergence of the networks. As expected, the smaller organs are more difficult to learn as they converge slower, even with organ-specific networks. The curves on the multi-organ networks also showed that instead of gradual learning for all organs, it first learned to segment the easy organs with a sudden increase, during which the segmentation of hard organs is learned. A detailed comparison did show that the organ-specific networks can learn the segmentation more rapidly with fewer iterations between the “not learned” and “learned” states. However, as all models converge rather rapidly, these differences are not likely to make a significant impact in practice.

One significant limitation of this study is the relatively small datasets, as only 36 cases were used in the experiments. While such a small training set may be insufficient to perform high quality segmentation, this dataset is sufficient to make a comparison between multi-organ networks and organ-specific networks, which is the main focus of this chapter. Furthermore, both strategies are able to yield acceptable performance for all organs with the exception of the esophagus which suffers the most from the reduced spatial resolution.

Another limitation is that this study only investigated the comparisons for thoracic organ segmentation. It is possible that for other body parts, w'here the inter-connection of different organs is more apparent such as in the head and neck region, a multi-organ segmentation network can benefit more from multi-task learning, especially with a greatly increased number of organs.


In conclusion, the chapter has demonstrated that in thoracic organ segmentation, there are no differences between a multi-organ segmentation network and organ-specific networks in terms of performance. This is likely due to the fact that the organ segmentation is largely independent. However, organ-specific networks are more attractive as they can be used to take high resolution images that only contain the specific organ as the input to improve the performance, but at the cost of prolonged training and testing times.


The authors would like to thank NVIDIA Corporation for providing the GPU grant support for Dr.

Quan Chen’s lab.


  • 1. Yang J, Veeraraghavan H, Aramato SG III, et al. Autosegmentation for thoracic radiation treatment planning: a grand challenge at AAPM 2017. Med Phys 2018:45( 10):4568—4581. doi: 10.1002/mp.l3141.
  • 2. Kong FM, Ten Haken RK, Schipper M, et al. Effect of midtreatment PET/CT-adapted radiation therapy with concurrent chemotherapy in patients with locally advanced non-small-cell lung cancer: a phase 2 clinical trial. JAMA Oncol. 2017;3( 10): 1358-1365. doi:10.1001/jamaoncol.2017.0982.
  • 3. Kong FM. Ritter T, Quint DJ, et al. Consideration of dose limits for organs at risk of thoracic radiotherapy: atlas for lung, proximal bronchial tree, esophagus, spinal cord, ribs, and brachial plexus. Int J Radial Oncol Biol Phys 2011;81(5):1442-14574оЬЮ.юГб/|игоЬр.20Ю.07.1977.
  • 4. Cicek O. Abdulkadir A, LienKamp SS, Brox T. Ronneberger O. 3D U-Net: learning dense volumetric segmentation from sparse annotation. arXiv:1606.06650 [cs.CV].
  • 5. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv: 1412.6980 [cs.LG],
  • 6. Ruder S. An overview of multi-task learning in deep neural networks. arXiv: 1706.05098 [cs.LG],
  • 7. Zhang Y, Yang Q. A survey on multi-task learning. arXiv: 1707.08114 [cs.LG],
  • 8. Feng X, Qing K, Tustison NJ, Meyer CH, Chen Q. Deep convolutional neural network for segmentation of thoracic organs-at-risk using cropped 3D images. Med Phys 2019;46(5):2169—2180. doi:10.1002/ mp. 13466.
  • 9. Kim H, Jung J. Kim J, et al. Abdominal multi-organ auto-segmentation using 3D-patch-based deep convolutional neural network. Sci Rep. 2020; 10( 1 ):6204. Published 2020 Apr 10. doi:10.1038/ s41598-020-63285-0.
< Prev   CONTENTS   Source   Next >