Deep Learning Architecture Design for Multi-Organ Segmentation

Yang Lei, Yabo Fu, Tonghe Wang, Richard L.J. Qiu, Walter J. Curran, Tian Liu, and Xiaofeng Yang


An increasing number of deep learning (DL) techniques have been proposed in the computer vision field in recent years. Inspired by their success, researchers have extended them into organ segmentation tasks in medical images [1-20]. DL-based methods adaptively explore deep features from medical images to represent the image structural information in detail. This fundamental change in methodology enables them to achieve state-of-the-art performances in medical image segmentation, especially in multi-organ segmentation. Based on convolutional neural networks as a basic component, current studies have proposed a variety of architectures of DL networks. These architectures vary in network structure, complexity, and implementation, resulting in variations in, and task-dependent, performance. Reviewing the network architectures developed recently will indicate the progress in current DL and facilitate the clinical transition of auto-segmentation methods. This review will also reveal the limitations in current network design that need to be addressed in future studies. In this chapter, popular deep learning network designs for multi-organ segmentation are summarized. Specifically, thoracic organ segmentations are used as an example to discuss network performances and challenges of organ-at-risk (OAR) auto-contouring for thoracic radiation treatment planning. This survey aims to:

  • • Summarize the latest architectural developments in DL-based medical image multi-organ segmentation
  • • Highlight contributions, identify challenges, and outline future trends
  • • Provide benchmark evaluations of recently published DL-based multi-organ segmentation methods

Deep Learning Architecture in Medical Image Multi-Organ Segmentation

The task of medical image multi-organ segmentation is typically defined as assigning each voxel of the medical images to one of the several labels that represent the objects of interest. Segmentation is one of the most commonly studied DL-based applications in the medical field. Therefore, there are a wide variety of methodologies with many different network architectures.

There are many ways to categorize the DL-based multi-organ segmentation methods according to their properties such as network architecture, training process (supervised, semi-supervised, unsupervised, transfer learning), input size (patch-based, whole volume-based, 2D, 3D), and so on. In this chapter, approaches are classified into six categories based on their architecture, namely: (1) auto-encoder, (2) convolutional neural network, (3) fully convolutional network, (4) generative adversarial network (GAN), (5) regional convolutional neural network, and (6) hybrid DL-based methods. In each category, a comprehensive table is provided, listing all the surveyed works belonging to this category and summarizing their important features. Besides multi-organ segmentation methods, single-organ segmentation methods are also included since single-organ segmentation can be easily transformed to multi-organ segmentation by replacing the last layer’s binary output to a multi-channel binary output. The difference between multi-organ and single organ approaches is considered in Chapter 4. Similarly, medical image object detection methods were included as they could be used to first obtain the region of interest (ROI) to aid the segmentation procedure and improve the segmentation accuracy.

Before diving into the details of each category, a detailed overview of DL-based medical image multiorgan segmentation methods with their corresponding components and features is provided in Figure 7.1. The aim of Figure 7.1 is to give the readers an overall understanding by listing important features of each category. The definition, features, and challenges of each category are also listed in Table 7.1.

Works cited in this review were collected from various databases, including Google Scholar, PubMed, Web of Science, Semantic Scholar, and so on. To collect as many works as possible,

Overview of six categories of DL-based methods in medical image segmentation

FIGURE 7.1 Overview of six categories of DL-based methods in medical image segmentation.

a variety of keywords was used, including but not limited to deep learning, multi-organ, medical segmentation, convolutional neural network, and so on. Over 180 papers that are closely related to DL-based medical image segmentation and over 40 papers that are closely related to multi-organ segmentation were reviewed. Most of these works were published between 2017 and 2019. The number of multi-organ publications is plotted against year by stacked bar charts in Figure 7.2, with the number of papers in each of the six categories shown. The dotted line in Figure 7.2 indicates increased interest in DL-based multi-organ segmentation methods over the years, highlighting the increase in the dramatic growth in publications recently.

Auto-Encoder Methods

Auto-Encoder and Its Variants

In the literature, the autoencoder (AE) and its variants have been extensively studied and continue to be utilized in medical image analysis [21]. AEs are often used for unsupervised [22] and semi-supervised [23] neural network learning. As shown in Figure 7.3, AEs usually consist of neural network encoder layers that transform the input into a latent or compressed representation by minimizing the reconstruction errors between input and output values of the network, and network decoder layers that restore the original input from the low-dimensional latent space. By constraining the dimension of latent representation, AEs can discover relevant patterns from the data.

To prevent an AE from learning an identity function, several improved AEs were proposed. The most widely used network model in deep unsupervised architecture is stacked AE (SAE). An SAE is constructed by organizing AEs on top of each other, also known as deep AEs. SAEs consist of multiple AEs stacked into multiple layers where the output of each layer is wired to the inputs of the successive layers [22]. To obtain good parameters, SAEs use greedy layer-wise training. The benefit of an SAE is that it represents a deeper network with more hidden layers, therefore it has greater expressive power. Furthermore, it usually captures a useful hierarchical grouping of the input [22].

Denoising autoencoders (DAEs) are another variant of the AE and are used to constitute better higher-level representation and extract useful features [24]. DAEs prevent the model from learning


Summary of Six Categories of DL-Based Methods in Medical Image Segmentation





Auto-encoder (AE)

Single neural network encoder/ decoder layer

Low model complexity

Poor performance on target contours with large shape variability Large computation complexity when stacking multiple AEs for deeper network

Convolutional neural network (CNN)

Input/output layers and multiple hidden layers including convolutional layers, max pooling layers, batch normalization layers, dropout layers, fully connected layers, and normalization layers

Facility for deeper networks

The fully connected layer requires the classification step to be performed voxel-wise or patch-wise

Fully convolutional network (FCN)

CNN with fully connected layer replaced by convolutional layer

Enables end-to-end segmentation

Fixed receptive size Voxel-wise loss introduces boundary' leakage in low contrast regions

Generative adversarial network (GAN)

Two competing networks, a generator, and a discriminator

High accuracy at low contrast region attributed to adversarial loss provided by discriminator

Less effective in simultaneous multi-organ segmentation due to imbalance of loss function among different organs



CNN using selective search to extract candidate regions

Enables simultaneous multi region-of-interest detection and multi-organ segmentation

Large computational burden when training with 3D image volumes


Two or more networks with different architectures for different functional propose

Better performance and lower demand on training data size

High model complexity


Overview of number of publications in DL-based multi-organ segmentation.

Diagram showing the basic architecture of auto-encoder

FIGURE 7.3 Diagram showing the basic architecture of auto-encoder.

a trivial solution where the model is trained to reconstruct a clean input from the corrupted version from noise or another corruption [23]. A stack denoising autoencoder (SDAE) is a deep network utilizing the power of DAEs [25].

One of the limitations of AEs is the relatively small number of hidden units resulting from their fully connected nature and graphics card memory limitations. This restricts the depth of the network and the information that can be learned from the input data. To overcome this limitation, other constraints, such as prior knowledge, can be imposed on the network, to facilitate the network learning deep information. A sparse constraint is a typically used constraint in a sparse AE. The aim of a sparse autoencoder is to make a large number of neurons have a low average output so that neurons may be inactive most of the time [26]. Sparsity can be achieved by introducing a loss function during training or manually zeroing a few of the strongest hidden unit activations.

SAE requires layer-wise pre-training since training of SAEs may be time consuming as they are built with fully connected layers. Li et al. first investigated training convolutional auto encoders (CAE) directly in an end-to-end manner without pre-training [27]. Guo et al. suggested that CAEs are beneficial to learn features for images, preserve local structures, and avoid distortion of feature space [28]. Wang et al. proposed an automated chest screening based on a hybrid model of transfer learning and CAE [29].

Overview of Works

Since abnormalities, e.g. abnormal tissue types and irregular organ shapes, are often present in medical images, it is challenging to obtain ground truth labels of multi-organs for supervised learning. However, organ segmentation in such abnormal datasets is meaningful in radiation therapy. Shin et al. applied an SAE method for organ detection in magnetic resonance imaging (MRI) [22]. Their method was used to detect the locations of the liver, heart, kidney, and spleen for MRI scans of the abdominal region containing liver or kidney metastatic tumors. Only weakly supervised training is required to learn visual and temporal hierarchical features that represent object classes from unlabeled multimodal dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) data. A probabilistic patch-based method was employed for multiple organ detection, with the features learned from the SAE model.

Accurate and automated segmentation of glioma from MRI is important for treatment planning and monitoring disease progression. Vaidhya et al. used an SDAE to solve the challenge of variable


Overview of AE Methods











Weakly supervised

3D patch







3D patch

Brain Gliomas






2D patch

Brain lesion





Transfer learning

2D slice






Transfer learning

2D slice


Chest X-rays





2D patch







2D patch





Hierarchical 3D neural networks



Head and neck


shape and texture of glioma tissue in MRI for this segmentation task [25]. 3D patches were extracted from multiple MRI sequences and were then fed into the SDAE model to obtain the glioma segmentation. During training, two SDAE models were supervised in this task, one for high grade glioma (HGG) data, the other one for a combination of HGG and low-grade glioma (LGG) data. During testing, the segmentation was obtained by a combination of predictions from two trained networks via maximum a posteriori (MAP) estimation. Simultaneously, Alex et al. applied an SDAE for brain lesion detection, segmentation, and false-positive reduction [23]. An SDAE was pre-trained using a large number of unlabeled patient volumes and fine-tuned with 2D patches drawn from a limited number of patients. LGG segmentation was achieved using a transfer learning approach in which an SDAE network pre-trained with LGG data was fine-tuned using LGG data.

Ahmad et al. proposed a deep SAE (DSAE) for CT liver segmentation [30]. First, deep features were extracted from unlabeled data using the AE. Second, these features are fine-tuned to classify the liver among other abdominal organs.

In order to efficiently detect and identify normal levels during mass chest screening of lung lesions of chest X-rays (CXRs), Wang et al. proposed a convolutional SDAE (CSDAE) to determine to which three levels of the images (i.e. normal, abnormal, and uncertain cases) the CXRs belong [29].

Accurate vertebrae segmentation in the spine is essential for spine assessment, surgical planning, and clinical diagnostic treatment. Qadri et al. proposed a stacked SAE (SSAE) model for the segmentation of vertebrae from CT images [26]. High-level features were extracted via feeding 2D patches into the SSAE model in an unsupervised way. To improve the discriminability of these features, a further refinement using a supervised fashion and fine-tuning was integrated. Similarly, Wang et al. proposed to localize and identify vertebrae by combining SSAE contextual features and structured regression forest (SRF) [31]. Contextual features were extracted via SSAE in an unsupervised way and were then fed into SRF to achieve whole-spine localization (Table 7.2).


In contrast to previous approaches to machine learning, whose performance depends on hand-craft features, an AE can learn the important contextual features of a medical image, improving their contextual discrimination ability [31].

For the segmentation of public BraTS 2013 and BraTS 2015 data [33], which are multi-modality brain MRI tumor segmentation datasets, an SDAE can provide good segmentation performance [23]. For segmenting liver on CT images, DSAEs showed high classification accuracy and can speed up the clinical task [30].

For detecting and identifying normal levels during mass chest screening of lung lesions of CXRs, the CSDAE method achieves promising results in terms of precision of 98.7% and 94.3% based on the normal and abnormal cases, respectively [29]. The results achieved by the proposed framework show superiority in classifying the disease level with high accuracy. CSDAEs can potentially save radiologists time and effort, allowing them to focus on higher-level risk CXRs.

Validated on the public MICCAI CS2014 dataset, which includes a challenging dataset of 98 spine CT scans, the SSAE method could effectively and automatically locate and identify spinal targets in CT scans, and achieve higher localization accuracy while maintaining low model complexity without making any assumptions about visual field in CT scans [26].

Although AEs have many benefits, they face some challenges and limitations in medical multiorgan segmentations. One of the limitations is related to data regularity. For example, in cases of anatomical structures like lung, heart, and liver, even if the inter-subject variability of the dataset is high, the shape variety of segmentation masks would remain low. Unlike organs which tend to have similar structure, irregular lesions and tumors with large shape variability are difficult for AEs to encode and remain challenging for the unsupervised AE methods. Furthermore, the number of layers can be limited due to the large computation complexity associated with the fully connected networks used in AE methods, compared to convolutional neural networks (CNNs) which use convolution kernels with shared learnable parameters.

< Prev   CONTENTS   Source   Next >