GAN Methods

Network Designs

GANs have gained a lot of attention in medical imaging due to their capability for data generation without explicitly modeling the probability density function. The adversarial loss brought by the discriminator provides a way of incorporating unlabeled samples into training and imposing higher order consistency. This has been proven to be useful in many cases, such as image reconstruction [106], image enhancement [107, 108], segmentation [12, 109], classification and detection [110], augmentation [111], and cross-modality synthesis [112].

A typical GAN consists of two competing networks: a generator and a discriminator [113]. The generator is trained to generate artificial data that approximate the target data distribution from a low-dimensional latent space. The discriminator is trained to distinguish the artificial data from actual data. The workflow is shown in Figure 7.6 [12]. The discriminator encourages the generator to predict realistic data by penalizing unrealistic predictions. Therefore, the discriminative loss could be considered as a dynamic network-based loss term. The two networks compete with each other in a zero-sum game. Multiple variants of GAN can be summarized into three

The process of generative adversarial network. Reprinted by permission from John Wiley and Sons

FIGURE 7.6 The process of generative adversarial network. Reprinted by permission from John Wiley and Sons: Medical Physics, “Automatic multiorgan segmentation in thorax CT images using U-net-GAN” by Dong et al. [12], copyright 2020.

categories: (1) variants of discriminator’s objective, (2) variants of generator’s objective, and (3) variants of architecture, which are summarized in Yi et al. [114].

Overview of Works

As discussed in the section on FCN methods, one of the challenges of medical image segmentation using traditional FCN methods is that these methods may introduce boundary leakage in low contrast regions. Using adversarial loss introduced via a discriminator can take into account high order structures that can potentially solve this problem [115]. The adversarial loss can be regarded as a learned similarity measurement between the segmented contours and the annotated ground truth (manual contours) for medical image segmentation tasks. Instead of only measuring the similarity (such as Dice loss and cross entropy loss) in the voxel domain, the additional discriminator maps the segmented and ground truth contours to a low dimensional feature space to represent the shape information and then uses logistic loss to measure the similarity of the feature vector between segmented contours and manual contours. The idea is similar to the perceptual loss. The difference is that the perceptual loss is computed from a pre-trained classification network on natural images whereas the adversarial loss is computed from a network that trained adaptively during the evolve- ment of the generator.

Dai et al. proposed the structure correcting adversarial network (SCAN) to segment lung fields and the heart in CXR images [109]. The SCAN approach used an FCN as the generator to generate the binary mask of segmented organs and incorporated a critic network (discriminator) to discriminate the structural regularities emerging from human physiology. During training, the critical network learns to discriminate between the ground truth organ annotations from the masks synthesized by the segmentation network. Through this adversarial process, the critical network is able to learn the higher order structures and to guide the segmentation model to achieve realistic segmentation outcomes.

In medical image multi-organ segmentation, a major limitation of traditional DL-based segmentation methods is their requirement for a large amount of paired training images with ground truth contours as learning targets. In order to solve this challenge, Mondal et al. proposed a GAN-based method for 3D multimodal brain MRI segmentation from a few-shot learning perspective [116]. The main idea of this work is to leverage the recent success of GANs to train a DL-based model with highly limited training set of labeled images, without sacrificing the performance of full supervision. The proposed adversarial network encourages the segmentation to have a similar distribution of outputs for images with and without annotations, thereby helping generalization. In addition, the few-shot learning method seeks good generalization on problems with a limited labeled dataset, typically containing just a few training samples of the target classes.

Dong et al. proposed an adversarial network to train deep neural networks for the segmentation of multiple organs on thoracic CT images [12]. The proposed design of adversarial networks, called a U-net-generative-adversarial-network (U-net-GAN), jointly trains a set of U-nets as generators and FCNs as discriminators. A U-net-GAN is a conditional GAN. Specifically, the generator, composed of a U-net produces an image segmentation map of multiple organs by an end-to-end mapping learned from the CT image and its labeled organs. The discriminator, structured as FCN, discriminates between the ground truth and segmented organs produced by the generator. The generator and discriminator compete against each other in an adversarial learning process to produce the optimal segmentation map of multiple organs (Table 7.5).

Discussion

In segmentation tasks, GAN is efficient at the prediction stage since it only needs to perform a forward pass through the generator network for segmentation. The discriminator network is only used during training. Using adversarial loss as a shape regulator can benefit more when the learning target (organ) has a regular shape, e.g. for lung and heart, but would be less useful for other irregular objects, such as vessels and catheters.

TABLE 7.5

Overview of GAN Methods

Ref.

Year

Network

Supervision

Dimension

Site

Modality

[109]

2015

SCAN

Supervised

2D slice

Chest

X-rays

[117]

2017

Multi-connected adversarial networks

Unsupervised

2D slice

Brain

Multi-modality MR1

[П8]

2017

Dilated GAN

Supervised

2D slice

Brain

MRI

[119]

2017

Conditional GAN

Supervised

2D slice

Brain tumor

MRI

[120]

2017

GAN

Supervised

2D patch

Retinal vessel

Fundoscopic

[115]

2017

Adversarial image-to-image network

Supervised

3D volume

Liver

CT

Ц21]

2017

Adversarial FCN-CRF nets

Supervised

2D slice

Mass

Mammograms

1122]

2018

GAN

Supervised

N/A*

Brain tumor

MRI

[116]

2018

Few-shot GAN

Semi-supervised

3D patch

Brain

MRI

[123]

2018

Context-aware GAN

Supervised

2D cropped slices

Cardiac

MRI

[124]

2018

Conditional generative refinement adversarial networks

Supervised

2D slice

Brain

MRI

[125]

2018

SegAN

Supervised

2D slice

Brain

MRI

[126]

2018

MDAL

Supervised

2D slice

Left and right ventricular

Cardiac MRI

1127]

2018

TD-GAN

Unsupervised

2D slice

Whole body

X-ray

112]

2019

U-net-GAN

Supervised

3D volume

Thorax

CT

[128]

2019

Conditional GAN

Supervised

2D slice

Nuclei

Histopathology Images

[129]

2019

Distance-aware GAN

Supervised

2D slice

Chest

CT

*N/A: not available, i.e. not explicitly indicated in the publication

In Dong et al. [12], GAN was applied to delineate the left and right lungs, spinal cord, esophagus, and heart using 35 patients’ chest CTs. The averaged DSC for the above five OARs are 0.97, 0.97, 0.90, 0.75, and 0.87, respectively. The mean surface distance of the five OARs obtained with GAN method ranges between 0.4 and 1.5 mm on average among all 35 patients. The mean dose differences on the 20 SBRT lung plans using the segmented results ranged from -0.001 to 0.155 Gy for the five OARs. This demonstrates that GAN is a potentially valuable method for improving the efficiency of the lung radiation therapy treatment planning.

Patient movement, such as translation and rotation, does not change the relative position among organs. Including the transformed data could help avoid overfitting and help the segmentation algorithm learn this invariant property. However, for multi-organ segmentation, due to the size and shape differences among different organs and variation among patients, it is difficult to balance the loss function among different organs. Integrating all the segmentations into one network complicates the training process and reduces segmentation accuracy. To simplify the method, the GAN method of Dong et al. [12] grouped OARs of similar dimensions, and utilized three subnetworks for segmentation, one for lungs and heart, and the other two for esophagus and spinal cord, respectively. This approach improves segmentation accuracy at the cost of computation efficiency. However, it also introduces additional computation time for both training and prediction. This could become an issue if the method was required to segment more OARs simultaneously. Simultaneously determining the location of organs and segmenting the organs within that location for multi-organ segmentation w'ould be a future direction for research.

R-CNN Methods

Network Designs

In medical image multi-organ segmentation, as discussed above, simultaneous segmenting of multiple organs is challenging, because it requires the correct detection of all organs in an image volume while also accurately segmenting the organs within that detection. It is similar to the classical computer vision of tasks of instance segmentation, which include two subtasks: one is the object detection with the goal of classifying individual objects and localizing each using a bounding box (ROI to medical image), the other one is the semantic segmentation with the goal of classifying each pixel into a fixed set of categories without differentiating object instances. Recently, the development of the region-CNN (R-CNN) family introduced a simple and flexible way to solve this challenge.

An R-CNN is a network based on ROIs [130]. To bypass the problem of selecting a large number of regions, the R-CNN utilized a selective search [131] to extract 2000 candidate regions from each image. These regions were called region proposals. By w'arping to the same size, these regional proposals were then fed into a CNN to extract a 4096-dimensional feature vector as output. The CNN acts as a feature extractor. The output dense layer consists of the features extracted from the image and the extracted features are fed into a support vector machine (SVM) to classify the presence of the object wdthin that region proposal. In addition to predicting the presence of an object wdthin the region proposals, the algorithm also predicts four values (2D version) which are offset values to increase the precision of the bounding box.

An R-CNN needs a long computation time to train the network with 2000 region proposals per 2D image slice. To address this issue, Girshick et al. proposed a faster objection algorithm called the Fast R-CNN [132]. Compared to the R-CNN, instead of selecting region proposals and feeding them into a CNN, the region proposals were obtained by first feeding the original image into an FCN to obtain the convolutional feature map to identify the regional proposals, followed by w'arping them into squares and reshaping them into a fixed size using an ROI pooling layer. By using a fully connected layer, the regional proposal w'as projected to an ROI feature vector. Finally, a softmax layer was used to predict the class of that region proposal

The architecture of Faster R-CNN. Reprinted by permission from Elsevier

FIGURE 7.7 The architecture of Faster R-CNN. Reprinted by permission from Elsevier: Computerized Medical Imaging and Graphics, “Fast and fully-automated detection and segmentation of pulmonary nodules in thoracic CT scans using deep convolutional neural networks” by Huang et al. [134], copyright 2020.

and also the offset values for the bounding box. The reason that a Fast R-CNN is faster than an R-CNN is because it does not need to feed 2000 region proposals to the CNN for each feeding image. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Both the R-CNN and the Fast R-CNN use selective search to identify the region proposals. However, selective search is time-consuming. To solve this problem, Ren et al. proposed an object detection algorithm that eliminates the selective search and lets the network learn the regional proposals, called the Faster R-CNN [133]. Similar to the Fast R-CNN, the Faster R-CNN first fed the image into an FCN to extract convolutional feature map. Instead of using selective search on the feature map to identify the region proposals, a separate network was used to predict the region proposals. The predicted region proposals were then reshaped using an ROI pooling layer which was then used to classify the image within the proposed region and predict the offset values for the bounding boxes. The architecture of Faster R-CNN is illustrated in Figure 7.7 [134].

Based on the ground works of feature extraction and regional proposals identification built by the Faster R-CNN, performing image segmentation within the detected bounding box (ROI) is easy to achieve. After an ROI pooling layer in the Faster R-CNN, He et al. integrated tw'o more convolution layers to build the semantic segmentation within the ROI, called the Mask R-CNN [135]. Another major contribution of the Mask R-CNN is the refinement of the ROI pooling. In the previous Faster R-CNN, Fast R-CNN, and R-CNN methods the ROI warping is digitalized: the cell boundaries of the target feature map are forced to realign with the boundary of the input feature maps. Therefore, each target cell may not be of the same size. Mask R-CNN uses ROI Align which does not digitalize the boundary of the cells and make every target cell the same size. It also applies interpolation to better calculate the feature map values within the cell.

Overview of Works

In order to solve the problem of low-quality CT images, the lack of annotated data, and the complex shapes of lung nodules, Liu et al. applied a 2D Mask R-CNN for lung pulmonary nodule segmentation in a transfer learning manner [136]. The Mask R-CNN was trained on the common objects in context (COCO) data set, which is a natural image dataset, and was then fine-tuned to segment pulmonary nodules. As an improvement, Kopelowitz and Engelhard applied a 3D Mask R-CNN to handle a 3D CT image volume to detect and segment the lung nodules [137].

Xu et al. proposed a novel heart segmentation pipeline which combined the Faster R-CNN and the U-net, abbreviated as CFUN [138]. Due to the Faster R-CNN’s precise localization ability and the U-net’s powerful segmentation ability, CFUN needs only one-step detection and segmentation inference to get the whole heart segmentation result, obtaining good results with significantly reduced computational cost. Furthermore, CFUN adopts a new loss function based on edge information, 3D Edge-loss, as an auxiliary loss to accelerate the convergence of training and improve the segmentation results. Extensive experiments on a public dataset show that CFUN exhibits competitive segmentation performance in a sharply reduced inference time. Similarly, Bouget et al. proposed a combination of a Mask R-CNN and a U-net for the segmentation and detection of mediastinal lymph nodes and anatomical structures in CT data for lung cancer staging [139].

Li et al. proposed a lung nodule detection method based on the Faster R-CNN for thoracic MRI in a transfer learning manner [140]. A false positive (FP) reduction scheme based on anatomical characteristics was designed to reduce FPs and preserve the true nodule. Similarly, the Faster R-CNN was also used for pulmonary nodule detection on CT images [134].

Xu et al. proposed an efficient detection method for multi-organ localization in CT images using a 3D regional proposal network (RPN) [141]. Since the proposed RPN is implemented in a 3D manner, it can take advantage of the spatial context information in a CT image. AlexNet was used to build a backbone network architecture that is able to generate high-resolution feature maps to further improve the localization performance of small organs. The method was evaluated on abdomen and brain site datasets and achieved high detection precision and localization accuracy with fast inference speed (Table 7.6).

TABLE 7.6

Overview of R-CNN Methods

Ref.

Year

Network

Supervision

Dimension

Site

Modality

[136]

2018

Mask R-CNN

Transfer

learning

2D slice

Lung nodule

CT

[138]

2018

Combination of faster R-CNN and U-net (CFUN)

Supervised

3D volume

Cardiac

CT

[139]

2019

Combination of U-net and mask R-CNN

Supervised

2D slice

Chest

CT

[134]

2019

Faster R-CNN

Supervised

2D slice

Thorax/pulmonary nodule

CT

[137]

2019

3D mask R-CNN

Supervised

3D volume

Lung nodule

CT

[140]

2019

3D faster R-CNN

Supervised

3D volume

Thorax/lung nodule

MRI

[142]

2019

Mask R-CNN

Supervised

N/A*

Chest

X-Ray

[141]

2019

3D RPN

Supervised

3D volume

Whole body

CT

[143]

2019

Multiscale mask R-CNN

Supervised

2D slice

Lung tumor

PET

*N/A: not available, i.e. not explicitly indicated in the publication

Discussion

In the work of Liu et al. [136], researchers used a Mask R-CNN to segment lung nodules for the first time. After a series of comparative experiments, ResNetlOl and feature pyramid network (FPN) were selected as the backbone of a Mask R-CNN. Experimental results showed that it not only identified the location of nodules, but also provided nodule contour information. It provided more detailed information for cancer treatment. The proposed method was validated on the Lung Image Database Consortium - Image Database Resource Initiative (LIDC-IDRI) data set and achieved the desired accuracy. However, due to the 2D network design, some spatial information of the CT image will be lost. 3D contexts play an important role in recognizing nodules. A 3D Mask R-CNN would perform better than a 2D version as it also captures the crani-caudal information.

Limitations still exist for the detection of lung nodules using Faster R-CNN. First, small and low contrast nodules are not successfully detected by the Faster R-CNN. This challenge may also occur for other multi-organ segmentation problems when there are small organs, such as the esophagus in lung segmentation. Second, the researchers found that some air artifacts and juxta-cardiac tissues may be falsely detected as nodules. In order to alleviate these problems, Li et al. designed a filter to improve the image quality and remove these artifacts [140]. In addition, a multi-scale strategy was introduced in the whole detection system to increase the detection rate of small and low contrast nodules.

The R-CNN family of methods could be an efficient tool for several multi-organ segmentation and detection tasks. However, technical adjustments and optimizations may be required to make the extended model to achieve comparable performance to the methods dedicated to organ segmentation. Due to the higher data dimensionality and larger number of weight parameters, training a 3D R-CNN-based model is more time-consuming than a 2D version. However, it may have significant advantages, such as higher localization accuracy and higher prediction speed. To speed up the training procedure of the proposed method, one potential solution is to apply batch normalization after each convolutional layer in the backbone network to improve the model convergence, and conduct most calculations on a graphics processing unit with parallel computing [141].

Hybrid Methods

Network Designs

Recently some methods used hybrid designs to solve the challenge of poor image quality, such as low contrast around the organ boundary. The hybrid design involves two or more networks for different functional proposes. For example, one network aims to enhance the image quality, and the other one to segment the OARs from the enhanced image. This has the potential to be a new trend for multi-organ segmentation.

Overview of Works

Accurate segmentation of pelvic OARs on CT images for treatment planning is challenging due to the poor soft-tissue contrast [144, 145]. MRI has been used to aid prostate delineation, but its accuracy is limited by MRI-CT registration errors [146, 147]. Lei et al. developed a deep attention-based segmentation strategy based on CT-based synthetic MRI (sMRI) images created using a cycle GAN [112]. This was done to address the segmentation of low contrast soft tissue organs (such as bladder, prostate, and rectum) without MRI acquisition [5]. This hybrid method included two main steps: first, a CycleGAN was used to estimate sMRI from CT images. Second, a deep attention FCN was trained based on sMRI and manual contours deformed from MRIs. Attention models were introduced to pay more attention to the prostate boundary. Inspired by this method, Dong et al. developed an sMRI-aided segmentation method for male pelvic CT multi-organ segmentation [4]. Similarly, Lei et al. introduced this kind of method for the multi-organ segmentation of cone-beam computed tomography (CBCT) pelvic data for CBCT-guided adaptive radiation therapy workflow [3].

TABLE 7.7

Overview of Hybrid Methods

Ref.

Year

Network

Supervision

Dimension

Site

Modality

14.5]

2019

Synthetic MRl-aided FCN

Supervised

2.5D patch

Pelvic

CT

13]

2019

Synthetic MRI-aided deep attention FCN

Supervised

3D volume

Pelvic

CBCT

[148]

2019

Deep multi-planar co-training (DMPCT)

Co-training

3D volume

Abdomen

CT

In multi-organ segmentation of abdominal CT scans, as discussed earlier, supervised DL-based algorithms require lots of voxel-wise annotations, which are usually difficult, expensive, and slow to obtain. However, massive unlabeled 3D CT volumes are usually easily accessible. Zhou et al. proposed deep multi-planar co-training (DMPCT) in a semi-supervised learning manner to solve this problem [148]. The DMPCT network architecture includes three steps: (1) A DL-based network is learned in a co-training manner to mine consensus information from 2D patches extracted from multiple planes. The DL-based network is called a “teacher model” in their work. (2) The trained teacher model is then used to assign pseudo labels to the unlabeled data. Multi-planar fusion is applied to generate more reliable labels to alleviate the errors occurring in the pseudo labeling and thus can help train better segmentation networks. (3) An additional network, called a “student model”, is trained on the union of the manual labeled data and automatically labeled data (called self-labeled samples) to enlarge the data variation of the training data (Table 7.7).

Discussion

Compared to CT and CBCT images, the superior soft-tissue contrast of sMRI improves the prostate segmentation accuracy and alleviate the issue of prostate volume overestimation when using CT images alone. However, in sMRI-aided segmentation methods, the registration between training MRI and CT or CBCT will affect the sMRI image quality and thus affect the segmentation network performance. In this sense, the registration error also affects the delineation accuracy ultimately. Thus, this kind of method relies on the accurate deformable image registration.

Hybrid methods can be practical for clinical applications since hybrid methods do not need a large number of training multi-modality data samples to provide comprehensive information or a large number of manual delineated contours for learning the target annotations (the annotation of multiple organs in 3D volumes requires massive labor from radiologists). Testing of hybrid methods’ performance in segmentation of multiple complex anatomical structures, such as the 2017 AAPM Thoracic Auto-segmentation Challenge, will be a future research direction.

 
Source
< Prev   CONTENTS   Source   Next >