Data Augmentation for Training Deep Neural Networks
Zhao Peng; lieping Zhou, Xi Fang, Pingkun Yan, Hongming Shan, Ce Wang, X. George Xu, and Xi Pei
Data augmentation is a popular technique for reducing overfitting and improving the generalization capabilities of deep neural networks. Augmentation encompasses a suite of techniques that enhances the size and diversity of training datasets. It plays a critical role when the amount of high-quality ground truth data is limited, and acquiring new examples is costly and time-consuming, a very common problem in medical image analysis, including auto-segmentation for radiation therapy . This chapter reviews current advances in data augmentation techniques applied to auto-segmentation in radiation oncology, including geometric transformations, intensity transformation, and artificial data generation. In addition, an example application of these data augmentation methods for training deep neural networks for segmentation in the domain of radiation therapy is provided.
Introduction and Literature Review
With the support of big data, deep convolutional neural networks have performed remarkably well on many computer vision tasks [2-8]. However, many application domains do not have access to big data, such as medical image analysis. It is especially difficult to build big medical image datasets due to the rarity of diseases, patient privacy, the requirement of medical experts for labeling, and the expense and manual effort needed to conduct medical imaging processes. In order to successfully build well-generalizing deep models, a huge amount of ground truth data is needed to avoid the
FIGURE 11.1 The taxonomy of data augmentation methods in auto-segmentation for radiation oncology applications.
overfitting of such a large-capacity neural network method, and “memorizing” the training set . It is a generally accepted notion that bigger datasets result in better deep learning models [10-12]. To combat the problem of limited size medical training sets, data augmentation has been widely used in medical image analysis [13-16]. It encompasses a series of techniques that enhance the size and diversity of training datasets so that better deep learning models can be built using them.
Shorten and Khoshgoftaar  summarized general data augmentation algorithms in natural images, including geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. However, in radiation oncology, the images involved are medical images such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET), which are different from natural images. For example, CT is grayscale while natural images are color. In addition, there are also some differences in the data augmentation methods for different learning tasks. In this chapter, considering data augmentation for auto-segmentation in radiation oncology, the literature is reviewed, and several types of data augmentation methods are summarized. Figure 11.1 illustrates the range of methods that can be employed for data augmentation. Geometric transformations, intensity transformation, and artificial data generation are considered in this chapter.
The most commonly used geometric transformations for data augmentation include flipping, rotation, translation, scaling, shearing, and cropping. The flipping operation creates a mirror reflection of an original image along one or more selected axes. The rotation operation is done by rotating the image right or left on an axis between 1° and 359°. The translation operation shifts the entire image by a given number of pixels in a chosen direction, while applying padding accordingly. The scaling operation zooms in or out an image along one or more selected axes. The shear operation displaces each point in an image in a selected direction. This displacement is proportional to its distance from the line which goes through the origin and is parallel to this direction. The cropping operation can be used as a practical processing step for image data with mixed height and width dimensions by cropping a central patch of each image. Additionally, random cropping is usually adopted to increase the variety of training examples [18-20]. The arbitrary combinations of the flipping, rotation, translation, scaling, and shearing are usually called affine transformations .
FIGURE 11.2 Applying geometric transformations to CT images.
FIGURE 11.3 Applying intensity transformations to CT images.
Affine transforms preserve the parallelism of lines in the input and output images. Those operations are the easiest to implement and have proven useful on datasets such as ImageNet . They are also widely used in the medical image segmentation task [22-27].
Another common geometric transformation is elastic transformation, which can lead to the distortion of shapes in the image and bring the different training examples from the affine transformation . Considering the great variability in the shape and appearance of the tissue, organ, and tumor, this operation can be especially useful in medical image analysis. Elastic transformations are often used in combination with affine transformations [29-32], which can greatly increase the diversity of the training examples. Figure 11.2 presents examples by applying these geometric transformations to a CT image.
Intensity transformation refers to change of pixel intensity values, either locally or across the entire image. Methods include adding Gaussian noise, random dropout, shifting, and scaling of pixel- intensity values (for example modifying the image brightness), sharpening, blurring, and more [33- 36]. Such operations can be especially useful in medical image analysis, where different training images are acquired in different locations and using different scanners, hence they can be intrinsically heterogeneous in the pixel intensities or intensity gradients. Figure 11.3 shows some examples of intensity transformation on a CT image.
Artificial Data Generation
Oversampling is a traditional method for data augmentation by synthesizing new samples using the existing training data. This approach primarily focuses on alleviating problems due to class imbalance. Random oversampling (ROS) is a naive approach which duplicates images randomly from the minority class until a desired class ratio is achieved. Intelligent oversampling techniques date back to the synthetic minority over-sampling technique (SMOTE), developed by Chawla et al. . SMOTE created new' instances by interpolating new points from existing instances via A'-nearest neighbors. Later, Inoue  introduced a simple but surprisingly effective data augmentation technique named SamplePairing. A new sample was synthesized from one image by overlaying another image randomly chosen from the training data (i.e. taking an average of two images for each pixel). Zhang et al.  introduced a data oversampling routine, termed mixup, which blended tw'o examples drawn at random from the training data by weighted summation. Their experiments showed that mixup improves the generalization of state-of-the-art neural network architectures.
Generative adversarial nets (GANs) are a method for synthesizing data using deep neural networks. GANs consist of a generator, which synthesizes samples, and a discriminator, which evaluates the reality of synthetic samples. GANs were first introduced in 2014 , and since then various works on GAN extensions, such as DCGANs , WGAN , and CycleGANs , were published. GANs have been widely used for data augmentation. Sandfort et al.  used a Cycle- GAN-based data augmentation to improve generalizability in CT segmentation tasks. Frid-Adar et al.  used GAN-based data augmentation for liver lesion classification. This improved classification performance from 78.6% sensitivity and 88.4% specificity using classic augmentations to 85.7% sensitivity and 92.4% specificity using GAN-based data augmentation. Tang et al.  used pix2pix-GAN-based data augmentation to enhance lymph node segmentation; the Dice score increased about 2.2% (from 80.3% to 82.5%). Zou et al.  used a CycleGAN-based framework to generate domain adaptive images to realize unsupervised segmentation of images in the target domain.
Applications of Data Augmentation
Datasets and Image Preprocessing
In this chapter, two datasets were used: The 2017 Lung CT Segmentation Challenge (LCTSC) [48-50] detailed in Chapter 1, and a Pancreas-CT (PCT) dataset, w'hich contains 43 abdominal contrast enhanced CT scan patients with eight segmented organs (the spleen, left kidney, gallbladder, esophagus, liver, stomach, pancreas, and duodenum) [22, 30, 49, 51]
For each patient in these datasets, the Hounsfield unit (HU) values were processed using a minimum threshold of -200 and a maximum threshold of 300 prior to being normalized to yield values between 0 to 1. In order to focus on organs and suppress the background information, the image was cropped to a region of interest according to the body contour in the original CT images and used as training data. Finally, to circumvent computer memory limitation, data resampling was performed using linear interpolation for CT images and using nearest interpolation for the labels. The resulting resolution after resampling was 2.0 mm x 2.0 mm x 2.5 mm for the LCTSC dataset, and 2.0 mm x 2.0 mm x 1.0 mm for the PCT dataset.
Training, Validation, and Testing for Organ Segmentation
The network used in this study was based on the 3D U-net [52, 53] showm in Figure 11.4; the network consists of an encoder and a decoder. The role of the decoder network is to map the low-resolution encoder feature maps to full input resolution feature maps for pixel-wise classification . The encoder contains four repeated residual blocks. Each block consists of four
FIGURE 11.4 The network architecture used in the chapter.
convolutional modules. Each convolutional module is composed by a convolution layer with the kernel of 3 x 3 x 3, an instance normalization, and a leaky rectified linear unit with coefficient of 0.3. For each residual block, the stride of convolution layer in the convolutional modules is 1 x 1 x 1 with the exception of the last convolutional module in which the stride is 2 x 2 x 2 to achieve dow'nsampling. There is a spatial dropout layer between the early two convolutional modules to prevent the network from overfitting. The decoder contains four repeated segmentation blocks. Each block consists of two convolutional modules and one deconvolutional module. The four dashed arrows in the figure indicate four skipping connections that copy and reuse early feature- maps as the input to later layers that have the same feature-map size by a concatenation operation to preserve high-resolution features. In the final three segmentation blocks, a 1 x 1 x 1 convolution layer is used to map the feature tensor to a probability tensor with the channels of the desired number of classes before all the results are merged by the upsampling operation to enhance the precision of segmentation results. Finally, a softmax activation is used to output a probability of each class for every voxel .
A five-fold cross-validation method was adopted for this work . The entire dataset is randomly split, using the “random.shuffle()” function in Python, into five non-overlapping subsets for training, validation, and testing in the ratio of 3:1:1 (i.e. three subsets for training, one subset for validation, and one subset for testing). The validation process is used to monitor the training process and to prevent overfitting. To reduce the potential for bias, the five randomly split subsets were rotated five times to report the average performance over these five different holdout testing subsets, as illustrated in Figure 11.5. The five-fold cross validation strategy is key to ensuring the independence of the testing data, i.e. each sample is used in the testing subsets only once.
To assess the value of data augmentation, the segmentation model is trained with and without data augmentation. Geometric transformations, including flipping, rotation, and random cropping were used for data augmentation. Patches are first randomly extracted from the resampled CT
FIGURE 11.5 Example of splitting and rotation using the five-fold cross-validation method for the dataset involving five subsets .
FIGURE 11.6 An example to illustrate patches from LCTSC, the database used in the training in terms of axial, sagittal, and coronal views .
images, and then flipped or rotated along some axes. In this experiment, the rotation angles are 0°, 90°, 180°, or 210°, and the patch size is 96 x 96 x 96. Figure 11.6 shows an example of such patches from LCTSC used in the training in terms of axial, sagittal, and coronal views. Finally, the network was trained with the patches and their corresponding labels. The weighted Dice similarity coefficient was used as the loss function, defined as:
where pikv is the predicted probability of the voxel v of the sample i belonging to the class k, yikv is the ground truth label (0 or 1), N is the number of samples, К is the number of classes, V is the number of voxels in one sample, and e is a smooth factor (set to be 1 in this study). The initial learning rate was 0.0005, and the Adam algorithm  was used to optimize the parameters of the network. The validation loss was calculated for every epoch, and the learning rate was halved when the validation loss no longer decreased after 30 consecutive epochs. To prevent overfitting, the training process was terminated when the validation loss no longer decreased after 50 consecutive epochs.
At the testing stage, patches were first extracted from each CT image with a moving window. The window size was 96 x 96 x 96 and the stride was 48 in each direction. In other words, multiple patches are extracted from one patient and fed into the network. The output of the network was a probability tensor for each patch. Then all probability tensors were merged from the same patient with a mean operator in the overlapping area to obtain the final probability tensor. Next, the class of each voxel was determined by the largest probability. This resulted in preliminary results of organ segmentation. Using the nearest neighbor interpolation, the preliminary segmentation results were resampled to the size of original CT images to obtain the final organ segmentation result.
All experiments described above were performed on a Linux computer system. Keras with TensorFlow as the backend was used as the platform for designing and training the neural network . The hardware includes (1) GPU - NVIDIA GeForce Titan X Graphics Card with 12 GB memory, and (2) CPU - Intel Xeon Processor X5650 with 16 GB memory.