CNN Methods

Network Designs

CNNs derive their name from the type of hidden layers they consist of. The hidden layers of a CNN typically consist of convolutional layers, max pooling layers, batch normalization layers, dropout layers, and normalization layers. Fully connected layers are normally used at later stages, if used at all. The last layer of a CNN is typically a sigmoid or softmax layer for classification/segmentation and tanh layer for regression. Figure 7.4 demonstrates a typical CNN architecture [34].

Convolution layers are the core of CNNs and are used for feature extraction [2]. The convolution layer extracts variant feature maps depending on its learned convolution kernels. The pooling layer performs a down-sampling operation by using maximum or average of the defined neighborhood as the value to reduce the spatial size of each feature map. A rectified linear unit (ReLU) and its modifications such as leaky ReLU are among the most commonly used activation functions, which transform data by clipping any negative input values to zero while positive input values are passed as output [35]. The fully connected layer connects every neuron in the previous layer to every neuron in the next layer. Neurons in a fully connected layer are fully connected to all activations in the previous layer. They are placed before the classification output of a CNN and are used to flatten the results before a prediction is made using linear classifiers. Via several fully connected layers, the previous feature maps extracted from convolutional layers are converted to a probability-like representation to classify the medical image or medical image patch or voxels.

An exemplary diagram of CNN architecture. Reprinted by permission from Springer Nature Customer Service Centre GmbH

FIGURE 7.4 An exemplary diagram of CNN architecture. Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas Segmentation by Roth et al. [34], copyright 2020.

During training of a CNN architecture, the model predicts the class scores for training images, computes the loss using the selected loss function, and finally updates the weights using the gradient descent method by back-propagation. Cross-entropy loss is one of the most widely used loss functions, and stochastic gradient descent (SGD) and Adam gradient descent optimizations are the most popular method to operate gradient descent.

LeCun et al. first proposed a CNN model, named LeNet, for hand-written digit recognition [36]. LeNet is composed of convolution layers, pooling layers, and fully connected layers. With the development of computer hardware and the increase in the amount of data available for neural network training, in 2012, Krizhevsky et al. proposed AlexNet and won the ILSVRC-2012 image classification competition [37] with a far lower error rate than the second place [38]. The improvements of AlexNet as compared to LeNet include (1) ReLU layer for nonlinearity and sparsity, (2) data augmentation to enlarge the dataset variety, (3) dropout layer to reduce learnable parameters and prevent overfitting, (4) local response normalization to normalize the nearest data and, (5) overlapping pooling. Since then, CNNs have begun to attract widespread attention, and variants of CNNs have been developed and have achieved the-state-of-art performances in various image processing tasks. Additionally, Zeiler and Fergus proposed ZFNet to improve the performance of AlexNet [39] and proved that a shallow network is able to learn edge, color, and texture features of images and a high-level network can learn abstract features of images. In addition, they demonstrated better performance can be achieved via a deeper network. The main improvement of ZFNet is a deconvolution network used to visualize the feature map.

Simonyan and Zisserman proposed the VGG to further explore the performance of the deeper network model [40]. The main innovation of VGG is a thorough evaluation of networks of increasing depth using an architecture with very small (3 x 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth from 16 to 19 layers. Similarly, GoogLeNet was proposed to broaden the network structure [41]. By integrating the proposed inception module, GoogLeNet won the winner of the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14), which is an image classification and detection competition. The inception module is helpful for the CNN model to better describe the input data content while further increasing the depth and width of the network model.

Many of the previous developments of CNNs were to increase the depth and width of CNN to improve the performance. However, simply increasing the depth would lead to vanishing/exploding gradients. To ease the difficulty of training deep CNNs and solve the degradation effect caused by increasing network depth, He et al. proposed a residual network (ResNet) for image recognition [42]. ResNet, which is mainly composed of residual blocks, is demonstrated to be able to break through a 100-layer barrier and even reach 1000 layers.

Inspired by residual networks, Huang et al. later proposed a densely connected convolutional network (DenseNet) by connecting each layer to every other layer [43]. In contrast to residual blocks, which would focus on learning the structural difference between the input features and desired output, DenseNet aimed to combine both low-frequency and high-frequency feature maps from previous and current convolutional layers via dense blocks.

Overview of Works

In medical image segmentation, CNNs can be used to classify each voxel or patch in the image individually, by presenting the network with patches extracted around that voxel or patch. Roth et al. proposed a multi-level deep CNN approach for pancreas segmentation in abdominal CT image [34]. A dense local image patch label was obtained by extracting an axial-coronal-sagittal viewed patch in a sliding window manner. The proposed CNN learned to assign class probabilities for each center voxel of its patch. Finally, a stacked CNN leveraged the joint space of CT intensities and dense probability maps. The CNN architecture used consists of five convolutional layers which are followed by max pooling, three fully connected layers, two drop out layers, and a softmax operator to perform binary classification. This CNN architecture can be introduced into multi-organ segmentation frameworks by specifying more tissue types since CNN naturally supports multiclass classifications [44].

In contrast to 2D input, which would lose spatial information, Hamidian et al. proposed using a 3D patch-based CNN to detect lung pulmonary nodules for chest CT images [45] using volumes of interest extracted from the lung image database consortium (LIDC) dataset [46]. They extended a previous 2D CNN to three dimensions which would be more suitable for volumetric CT data.

For highly pathologically affected cases, segmenting and classifying the lytic and sclerotic metastatic lesions from CT images is challenging, because these lesions are ill-defined. Therefore, it is hard to extract relevant features that can well represent texture and shape information for traditional machine learning-based methods. In order to solve this problem, Chmelik et al. applied deep CNN (DCNN) to segment and classify these kinds of lesions [47]. The CNN architecture takes three perpendicular 2D patches for each voxel of 3D CT image as input and output classification of three categories (healthy, lytic, and sclerotic) for that voxel. The proposed CNN consisted of several convolutional layers which are followed by ReLU and max pooling to extract features, several fully connected layers with dropout layers to combine the feature maps to feature vectors, and a last fully connected layer to convert the feature vector to a three-element output of class scores. A high score correlates to a high probability to the corresponding class. L2 regularized cross-entropy and class error loss are used for optimization. Mini-batch gradient descent with a momentum back-propagation algorithm is used to optimize the learnable parameters of this CNN.

During radiotherapy for nasopharyngeal carcinoma (NPC) treatment, accurate segmentation of OARs in head and neck (H&N) CT images is a key step for effective planning. Due to low-contrast and surrounding adhesion tissues of the parotids, thyroids, and optic nerves, automatically segmenting these regions is challenging and will result in lower accuracy for these regions as compared to other organs. In order to solve this challenge, Zhong et al. proposed a cascaded CNN to delineate these three OARs for NPC radiotherapy by combining boosting algorithm [48]. In their study, CT images of 140 NPC patients treated with radiotherapy were collected. Manual contours of three OARs were used as the learning target. A hold-out test was used to evaluate the performance of the proposed method, i.e. the datasets were divided into a training set (100 patients), a validation set (20 patients), and a test set (20 patients). In the boosting method for combining multiple classifiers, three cascaded CNNs for segmentation were combined. The first network was trained with the traditional approach. The second one was trained on patterns (pixels) filtered by the first net, that is, the second machine recognized a mix of patterns (pixels), 50% of which were accurately identified by the first net. Finally, the third net was trained with the new patterns (pixels) screened jointly by the first and second networks. During the test, the outputs of the three nets were considered to obtain the final output. A 2D patch-based ResNet [42] was used to build the cascaded CNNs.

For multi-OARs segmentation in thoracic radiotherapy treatment, Harten et al. proposed a combination of 2D and 3D CNNs for automatic segmentation of OARs (including esophagus, heart, trachea, and aorta) on thoracic treatment planning CT scans of patients diagnosed with lung, breast, or esophageal cancer [49]. The two CNNs are summarized as follows: one 3D patch-based network that contains a deep segment of residual blocks [50] with a sigmoid layer to perform multi-class binary classification, and one 2D patch-based (2D patch extracted from axial, coronal, and sagittal planes) network containing dilated convolutions [51] with softmax layer to perform classification. A hold-out validation (40 data for training and 20 data for testing) was used to evaluate the performance of the proposed method (Table 7.3).

Discussion

For the multi-OARs segmentation of CT H&N, Dice similarity coefficient (DSC), 95th percentile of the Hausdorff distance (95% HD), and volume overlap error (VOE) were used to assess the performance of a cascaded CNN [48]. The mean DSC values were above 0.92 for parotids, above 0.92 for thyroids, and above 0.89 for optic nerves. The mean 95% HDs were approximately 3.08 mm for parotids, 2.64 mm for thyroids, and 2.03 mm for optic nerves. The mean VOE metrics were

TABLE 7.3

Overview of CNN Methods

Ref.

Year

Network

Supervision

Dimension

Site

Modality

[52]

2017

Deep deconvolutional neural network (DDNN)

Supervised

2D slice

Brain

CT

[34]

2015

Multi-level DCNN

Supervised

2D patch

Pancreas

CT

[53]

2016

Holistically nested CNN

Supervised

2D patch

Pancreas

CT

[45]

2017

3D CNN

Supervised

3D patch

Chest

CT

[54]

2017

3D DCNN

Supervised

N.A.*

Abdomen

CT

[55]

2017

CNN

Supervised

3D patch

Head and neck

CT

[56]

2017

Fuzzy-C-Means CNN

Supervised

3D patch

Lung nodule

CT

[57]

2017

DCNN

Supervised

2D Slice

Body, chest, abdomen

CT

[58]

2018

Fusion Net

Supervised

2D patch

lOOROIs

HRCT

[47]

2018

DCNN

Supervised

2D patch

Spinal lesion

CT

[59]

2018

DCNN

Supervised

2D slice

Malignant pleural mesothelioma

CT

[60]

2018

2D and 3D CNN

Supervised

2D slice, 3D volume

Artery/vein

CT

[61]

2018

3D ConvNets

Transfer learning

3D volume

Brain

MRI

[62]

2018

CNN with specific fine-tuning

Supervised or Unsupervised

2D slice, 3D volume

Brain, abdomen

Fetal MRI

[63]

2018

2D and 3D DCNN

Supervised

2D slice, 3D volume

Whole body

CT

[64]

2019

Deep fusion Network

Supervised

2D slice

Chest

CXR

[65]

2019

DCNN

Supervised

2D slice

Abdomen

CT

[66]

2019

2.5D CNN

Supervised

2.5D patch

Thorax

CT

[48]

2019

Cascaded CNN

Supervised

2D slice

Head and neck

CT

[49]

2019

2D and 3D CNN

Supervised

2D slice, 3D volume

Thorax

CT

[67]

2019

U-net neural network

Supervised

3D patch

Lung

CT

*N.A.: not available, i.e. not explicitly indicated in the publication.

approximately 14.16% for parotids, 14.94% for thyroids, and 19.07% for optic nerves. From the comparison in Zhong et al. [48], the proposed boosting-based cascaded CNN outperformed U-net [68] in segmenting the three OARs. Despite the powerful accuracy performance of the boosting structure, its pixel-based classification took more time as compared to U-net. This is because all classifiers in the boosting structure need to classify all pixels in the image.

In the study by Yang et al. [49], researchers evaluated the performance for 2D CNN, 3D CNN, and a combination of 2D and 3D CNNs individually and demonstrated the combination network produces the best results. The DSC of the esophagus, heart, trachea, and aorta were 0.84 ± 0.05, 0.94 ± 0.02,0.91 ± 0.02, and 0.93 ±0.01, respectively. These results demonstrate potential for automating segmentation of OARs in routine radiotherapy treatment planning.

A drawback of CNN is classification needs to be performed on every voxel or small patch. By sliding a window with a huge overlap between two neighboring patches, the CNN models perform classifications on each voxel of the whole volume. This approach is inefficient since it requires repeated forward network prediction on every voxel of the image. Fortunately, the convolution and dot product are both linear operators and thus inner products can be written as convolutions and vice versa [69]. By rewriting the fully connected layer as convolutions, the traditional CNNs can take input and images larger than its training image and produce a likelihood map, rather than an output for a single voxel. However, this may lead to an output with a far lower resolution than input due to the pooling layers used.

 
Source
< Prev   CONTENTS   Source   Next >