FCN Methods

Network Designs

As discussed previously, the CNN methods input the downsized input image or patch into the convolutional layers and fully connected layers, and subsequent output-predicted label. Shelhameret al. first proposed a CNN whose last fully connected layer is replaced by a convolutional layer. Since all layers in this CNN are convolutional layers, the new network is named as a fully convolutional network (FCN). Due to the major improvement of deconvolution kernels used to up-sample the feature map, an FCN allows the model to have a dense voxel-wise prediction from the full-size whole volume instead of a patch-wise classification as in a traditional CNN [70]. This segmentation is also called “end-to-end segmentation”. By using an FCN, the segmentation of the whole image can be achieved in just one forward pass. To achieve better localization performance, high-resolution activation maps are combined with up-sampled outputs and then passed to the convolution layers to assemble more accurate output.

One of the most well-known FCN structures using the concept of deconvolution for medical image segmentation is the U-net, initially proposed by Ronneberger et al. [68]. The U-net architecture is built upon the elegant architecture of FCN, including an encoding path and a decoding path. Besides the increased depth of network with 19 layers, the U-net introduced a design of long skip connections between the layers of equal resolution in the encoding path to those in the decoding path. These connections provide essential high-resolution features to the deconvolution layers. The improvement of U-net overcomes the trade-off between organ localization and the use of context. This trade-off arises in patch-based architectures since the large size patches require more pooling layers and consequently will reduce the localization accuracy. On the other hand, small size patches can only observe a small context of input.

Inspired by the study of U-net, Milletari et al. proposed an improved network based on the U-net, called the V-net [71]. The V-net architecture is similar to the U-net. It also consists of encoding path (compression) and decoding path and the long skip connection between the encoding and decoding paths. The improvement of V-net as compared to U-net is that at each stage of encoding and decoding path, V-net involves a residual block as a short skip connection between early and later convolutional layers. This architecture ensures convergence compared with non-residual learning network, such as the U-net. Secondly, V-net replaces the max pooling operations w'ith convolutional layers to force the network to have a smaller memory footprint during training, as no switches mapping the output of pooling layers back to their inputs are needed for back-propagation. Thirdly, in contrast to binary cross entropy loss used in the original U-net method, the proposed V-net used Dice loss. Therefore, weights to samples of different organs to establish balance between multi-organs and background voxels are not needed.

As another improvement of the U-net, Christ et al. proposed a new FCN that by cascading two U-nets to improve the accuracy of segmentation [72], called a cascade FCN. The main idea of the cascade FCN is to stack a series of FCNs in the way that each model utilizes the contextual features extracted by the prediction map of the previous model. A simple design is to combine FCNs in a cascade manner, where the first FCN segments the image to ROIs for the second FCN, where the organ segmentation is done. The advantage of using such a design is that separate sets of filters can be applied at each stage and therefore the quality of segmentation can be significantly improved.

The main idea of deep supervision in deeply supervised FCN methods [10, 13] is to provide the direct supervision of the hidden layers and propagate it to lower layers, instead of employing only one supervision at the output layer for traditionally supervised FCNs. In this manner, supervision is extended to deep layers of the network, which would enhance the discriminative ability of feature maps to differentiate multiple classes in multi-organ segmentation tasks. In addition, recently, attention gates were used in an FCN to improve performance in image classification and segmentation [73]. The attention gate could learn to suppress irrelevant features and highlight salient features useful for a specific task.

Overview of Works

Zhou et al. proposed a 2.5D FCN segmentation method to automatically segment 19 organs in CT images of the whole body [74]. In that work, a 2.5D patch, which consists of several consecutive slices in the axial plane, was used as multi-channel input for the 2D FCN. A separate FCN was designed for each 2D sectional view, resulting in three FCNs. Ultimately, the segmentation results of three directions were fused to generate the final segmentation output. The technique produced higher accuracy for big organs such as the liver (a Dice value of 0.937) but yielded lower accuracy while dealing with small organs, such as the pancreas (a Dice value of 0.553). In addition, by implementing the convolution kernels in a 3D manner, the FCN has also been used for multi-organ segmentation for 3D medical images [54].

An FCN which has been trained on whole 3D images has high class imbalance between the foreground and background, which results in inaccurate segmentation of small organs. One possible solution to alleviate this issue is applying two-step segmentation in a hierarchical manner, where the second stage uses the output of the first stage by focusing more on boundary regions. Christ et al. performed liver segmentation by cascading two FCNs, where the first FCN detects the liver location, estimating the ROI, and the second FCN extracts features from that ROI to obtain the liver lesions segmentation [72]. This system has achieved 0.823 in Dice for lesion segmentation in CT images and 0.85 in MRI images. Similarly, Wu et al. investigated the cascaded FCN to improve the performance in fetal boundary detection in ultrasound images [75]. Their results have shown better performance compared to other boundary refinement techniques for ultrasound fetal segmentation.

Transrectal ultrasound (TRUS) is a versatile and real-time imaging modality that is commonly used in image-guided prostate cancer interventions (e.g. biopsy and brachytherapy). Accurate segmentation of the prostate is key to biopsy needle placement, brachytherapy treatment planning, and motion management. However, the TRUS image quality around the prostate base and apex region is often affected by low contrast and image noise. To address these challenges, Lei et al. proposed a deep supervision V-net for accurate prostate segmentation [10]. To cope with the optimization difficulties of training the DL-based model with limited training data, a deep supervision strategy with a hybrid loss function (logistic and Dice loss) was introduced to the different stages of decoding path. To reduce possible segmentation errors at the prostate apex and base in TRUS images, a multi-directional-based contour refinement model was introduced to fuse transverse, sagittal, and coronal plane-based segmentation.

Similarly, for the task of MRI pelvic segmentation, segmentation of the prostate is challenging due to the inhomogeneous intensity distributions and variation in prostate anatomy. Wang et al. proposed a 3D FCN with deep supervision and group dilated convolution to segment the prostate on MRI [13]. In this method, the deep supervision mechanism was introduced into FCN to effectively alleviate the common exploding or vanishing gradient problems in training deep models, which forces the update process of the hidden layer filters to favor highly discriminative features. A group dilated convolution which aggregates multi-scale contextual information for dense prediction was proposed to enlarge the effective receptive field. In addition, a combined loss (including cosine and cross entropy) was used to improve the segmentation accuracy from the direction of similarity and dissimilarity. Its architecture is shown in Figure 7.5 as an example of FCN implementation.

Segmenting glands is essential in cancer diagnosis. However, accurate automated DL-based segmentation of glands is challenging because a large variability in glandular morphology across tissues and pathological subtypes exist, and a large number of accurate gland annotations from several tissue slides is required. Binder et al. investigated the idea of cross-domain (organ type) approximation that aims at reducing the need for organ-specific annotations [76]. Two proposed dense-U-nets are trained on hematoxylin- and eosin-strained colon adenocarcinoma samples focusing on the gland and stroma segmentation. Unlike U-net, dense-U-nets use an asymmetric encoder and decoder. The encoder is designed to learn the spatial hierarchies of features automatically and adaptively from low to high level patterns coded within the image. The encoder uses transition

Schematic representation of an exemplary FCN architecture. Reprinted by permission from John Wiley and Sons

FIGURE 7.5 Schematic representation of an exemplary FCN architecture. Reprinted by permission from John Wiley and Sons: Medical Physics, “Deeply supervised 3D fully convolutional networks with group dilated convolution for automatic MR1 prostate segmentation” by Wang et al. [13], copyright 2020.

layer (convolution with stride size 2) and dense convolution blocks consecutively to extract the compressed encoded feature representation. The dense-convolution blocks from DenseNet [43] are used to strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. The decoder is composed of deconvolution layers and convolution blocks. The skip connection between the encoder and the decoder side allows for feature reuse and information flow. The architecture has two decoders, one to predict the relevant gland locations, and a second to predict the gland contours. Thus, the decoders output a gland probability map and a contour probability map. The network is supervised to jointly optimize the prediction of gland locations and gland contours (Table 7.4).

Discussion

A problem of the FCN is that the receptive size is fixed so if the object size changes then the FCN struggles to detect it all. One solution is multi-scale networks, where input images are resized and fed to the network. Multi-scale techniques can overcome the problem of the fixed receptive size in an FCN [90]. However, sharing the parameters of the same network on a resized image is not a very effective way as the object of different scales requires different parameters to process. Another solution for images with a larger field of view than the receptive field is for the FCN to be applied as a sliding window across the entire image [45].

As compared to a single FCN architecture, the advantage of cascade FCNs is that separate sets of filters can be applied at each stage and therefore the quality of segmentation can be significantly increased. For example, Trullo et al. proposed using two collaborative FCNs to jointly segment multiple organs in thoracic CT images, one is used for organ localization and the other one is used to segment the organ within that ROI [105]. However, because of the additional network involved and two or more steps, the computation time of this kind of method would be longer than a single

Ref.

Year

Network

Supervision

Dimension

Sife

Modality

168]

2015

U-net

Supervised

2D slice

Neuronal structure

Electron microscopic

172]

2016

Cascaded FCN

Supervised

3D volume

Liver and lesion

CT

177]

2016

3D U-net

Supervised

3D volume

Kidney

Xenopus

[78]

2017

Dilated FCN

Supervised

2D slice

Abdomen

CT

[79]

2017

3D FCN Feature Driven regression forest

Supervised

3D patch

Pancreas

CT

[74]

2017

2D FCN

Supervised

2.5D slices

Whole body

CT

[80]

2018

Foveal fully convolutional nets

Supervised

N/A*

Whole body

CT

[81]

2018

DRINet

Supervised

2D slice

Brain, abdomen

CT

[82]

2018

3D U-net

Supervised

3D volume

Prostate

MRI

183]

2018

Dense V-net

Supervised

3D volume

Abdomen

CT

[84]

2018

NiftyNet

Supervised

3D volume

Abdomen

CT

185]

2018

PU-net, CU-net

Supervised

2D slice

Pelvis

CT

[86]

2018

Dilated U-net

Supervised

2D slice

Chest

CT

[87]

2018

3D U-JAPA-Net

Supervised

3D volume

Abdomen

CT

[88]

2018

U-net

Supervised

2D slice

Pelvis

CT

[89]

2018

Cascade 3D FCN

Supervised

3D patch

Abdomen

CT

[90]

2018

Multi-scale pyramid of 3D FCN

Supervised

3D patch

Abdomen

CT

[91]

2018

Shape representation model constrained FCN

Supervised

3D volume

Head and neck

CT

192]

2018

Hierarchical dilated neural networks

Supervised

2D slice

Pelvis

CT

18]

2018

CNN with correction network

Supervised

2D slice

Abdomen

MRI

[93]

2019

Dilated FCN

Supervised

2D slice

Lung

CT

[76]

2019

Dense-U-net

Supervised

2D slice

Head and neck

Stained colon adenocarcinoma dataset

[94]

2019

2D and 3D FCNs

Supervised

2D slice and 3D volume

Pulmonary nodule

CT

[95]

2019

Dedicated 3D FCN

Supervised

3D patch

Thorax/abdomen

DECT

[96]

2019

2D FCN (DeepLabV 3+)

Transfer learning

2D slice

Pelvis

MRI

[97]

2019

2D FCN

Supervised

2D patch

Pulmonary vessels

CT

(Continued)

Ref.

Year

Network

Supervision

Dimension

Site

Modality

[98]

2019

Dual U-net

Supervised

2D slice

Glioma nuclei

Hematoxylin and eosin (H&E)-stained histopathological image

[99]

2019

Consecutive deep encoder-decoder Network

Supervised

2D slice

Skin lesion

CT

[100]

2019

U-net

Supervised

2D slice

Lung

HRCT

[101]

2019

3D U-nct

Supervised

3D volume

Chest

CT

[7]

2019

3D U-net with multi-atlas

Supervised

3D volume

Brain tumor

Dual-energy CT

[102]

2019

Triple-branch FCN

Supervised

N/A

Abdomen/torso

CT

[10]

2019

2.5D deeply supervised V-net

Supervised

2.5 patch

Prostate

Ultrasound

[13]

2019

Group dilated deeply supervised FCN

Supervised

3D volume

Prostate

MRI

[11]

2019

3D FCN

Supervised

3D volume

Arteriovenous malformations

Contract-enhanced CT

Ц4]

2019

3D FCN

Supervised

3D volume

Left ventricle

SPECT

[15]

2019

DeepMAD

Supervised

2.5D patch

Vessel wall

MRI

[103]

2019

3D U-net

Supervised

3D volume

Head and neck

CT

[104]

2019

OBELISK-Net (sparse deformable convolution)

Supervised

3D volume

Abdomen

CT

*N/A: not available, i.e. not explicitly indicated in the publication

FCN architecture. In addition, if the first network of cascade FCNs is used for organ localization, the performance of this method will largely rely on the accuracy of localization.

Although FCN-based prostate segmentation [10, 13] can offer good performance, limitations of this kind of method still exist. First, due to three or more stages of deep supervision and the corresponding up-sampling convolutional kernels involved, the computation complexity is higher than the U-net and V-net methods. In addition, when using 2.5D patch as input, the segmented contours of each of the three directions may not be well matched. Introducing an adaptive and non-linear contour refinement model, such as a conditional random field, would be a future work if researchers use this kind of method for multi-organ segmentation. Second, FCNs used voxel-wise loss such as cross entropy for segmentation. However, in the final segmentation map, there is no guarantee of spatial consistency. Recently, in FCN-based methods, conditional random field and graph cut methods have been used as segmentation refinement into the FCN-based workflow by incorporating spatial correlation. The limitation of these kinds of segmentation refinement is that they only consider pair-wise potential which can allow boundary leakage in low contrast regions.

 
Source
< Prev   CONTENTS   Source   Next >