Implementation Details

Since CT intensities (corresponding to Hounsfield units) should be standard across different devices, they were not normalized but were cropped to the range of-1024 to 700 before training and testing. The pseudo code for the Pytorch Dataset class’s__getitem__() function is summarized as follows:

Image = read_image(image_filename)

Image = intensity_clip(Image)

Segmentation = read_image(ground_truth_filename)

Image = resample(Image)

Segmentation = resample(Segmentation)

Index = generate_index(Segmentation)

Imaget = patch_sampler(Image, Index).cuda()

Segmentation = patch_sampler(Segmentation, Index).cuda()

Return {'image': Image, 'segmentation', Segmentation }

Note that the image patches used for training are randomly picked according to the ground truth segmentation. The sampling rule was that half of the patches picked would have their center points located within the region of interest of ground truth masks, and another half would not. In this way, the training data sampler was well balanced between the background regions and the ROIs.

The training procedure is illustrated by the following pseudo code:

for epoch in range(Number_Of_Epochs + 1): for i, data in enumerate(dataLoader): image = data['image'] segmentation = data['segmentation']

... (prepare tensor data) output = model(image)

... (prepare output)

loss = dice_loss(output, segmentation) optimizer.zero_grad() loss.backward() optimizer.step()

The major parameters and configuration of the experiments are shown in Table 8.1. No validation set was used to choose which model will be used finally, since the number of samples available was limited. The model selected was the last trained model after reducing the learning rates from 0.0001 to 0.00001 and when the Dice loss plots are reasonably stable.

The training of all the five networks was fairly straightforward. However, due to limited GPU memory, only one 3D patch with size 256 x 256 x 256 could be fed to the training per batch, resulting in bumpy Dice loss plots. Therefore, the maximum patch size chosen was set to 128 x 128 x 128. For training the 3D U-net at resolution 1 mm, the results of 2 mm model on the training set were cropped to get smaller volumes, and the 1 mm 3D U-net model was trained only using the cropped images. It is in this context that the 1 mm 3D U-net is considered a refinement step for the corresponding 2 mm model.

Theoretically, the size of patches does not change network behavior. However, behavior may be dependent on the padding and how deep the network is. Each network has an effective field of view dependent on the network depth. Therefore, if one uses zero padding, a 128 x 128 x 128 patch will yield a resultant probability map with size 80 x 80 x 80 for the 3D U-net. Thus, the boundary effects or effective covering range for a voxel in the segmentation map could be 2.4 cm wider in radius for 1 mm resolution, which is fairly small. In other words, the network used here may not be deep enough to cover a large spatial region. Thus, the 3D U-net at 1 mm resolution could generate false positive regions outside the ROI. However, the U-net may detect image boundaries and detailed information with greater accuracy because of the fine resolution, and therefore improve performance.

Thus, to capture both global and local information using a U-net is difficult despite it being inherently multi-resolution. Increasing the depth so that the network has a wider effective field of view may better handle such situations and improve the performance. However, a consequence is that the computational cost and memory requirement become very high. An alternative solution is to apply the network on multi-resolution images. For example, a coarse-resolution network can


List of Major Parameters and Configurations for Training

3D U-net (1 mm)

3D U-net (2 mm)

3D U-net (5 mm)

2D U-net (1 mm)

2D U-net (2 mm)

Number of training samples

36 Volumes

36 Volumes

36 Volumes

5858 Slices

5858 Slices

Patch size

128 x 128 x 128

128 x 128 x 128

64 x 64 x 64

256 x 256

128x 128

Batch size/minimal epochs



6/ 1000



Learning rate




0.0001 ->0.00001



Dice (3D)

Dice (3D)

Dice (3D)

Dice (2D)

Dice (2D)







Machine and GPU

Tesla VI00 DGXS

Tesla VI00 DGXS

Tesla VI00 DGXS

Tesla VI00 DGXS

Tesla VI00 DGXS

be first used to segment the ROI so that the ROI region can be cropped at high-resolution, and the fine-resolution network can be applied in a subsequent refinement step. Such an implementation actually works like a cascade model - the first stage has a wider field of view, and the last stage can be considered as fine tuning the results within a smaller spatial domain.


Before reporting the quantitative metrics, segmentation results of a subject produced using the 2D network are visualized in Figure 8.3. Overall, the two segmentations at resolutions of 2 mm and 1 mm were similar, but the segmentation of the high-resolution model is more accurate. For example, the 1 mm model yielded a more detailed segmentation shape and could match the boundaries better. It is also noted from the sagittal views that esophagus masks were not smooth along the z-direction for either 2D U-net models.

When comparing different organ segmentations, the overall segmentations of the lungs and the heart were much better than those of the esophagus and the spinal cord. This can be quantifiably confirmed from Figure 8.4, which plots the average and standard deviation values of (a) the Dice similarity coefficient and (b) the 95% Hausdorff distances between the segmentation results and the ground truth. For convenience, the results of 3D U-net are also plotted in this figure. By comparing all the 2D cases, especially for the esophagus, the spinal cord, and the heart, the high-resolution model outperformed the low-resolution one. The 95% Hausdorff distances of the left and right lungs were much smaller in the high-resolution model.

Figure 8.5 shows the segmentation results for the same subject for the 3D U-net segmentation. For convenience, the same ground truth is shown. Overall, the 1 mm U-net model yielded a more

Comparison of low-resolution (2 mm, top) and high-resolution (1 mm, bottom) 2D U-net segmentation results. The last row shows the ground truth

FIGURE 8.3 Comparison of low-resolution (2 mm, top) and high-resolution (1 mm, bottom) 2D U-net segmentation results. The last row shows the ground truth.

Comparison of the mean and the standard deviation values of Dice coefficients of 2D and 3D U-nets at different resolutions

FIGURE 8.4 Comparison of the mean and the standard deviation values of Dice coefficients of 2D and 3D U-nets at different resolutions. For each organ, the segmentation model from left to right are 3D coarse (5 mm), 3D coarse (2 mm), 2D coarse (2 mm), 3D fine (1 mm), and 2D fine (1 mm), respectively.

detailed segmentation, and compared to the 5 mm model, both the 2 mm and 1 mm models better matched the organ boundaries. The 2 mm model may have 1-2 voxel shifts because of the resampling procedure. Notice that there were some occasional disconnected spots for the 1 mm model. As discussed previously, this is a result of the selection of smaller patch sizes. Network inference must be performed by cropping the images into pieces and then stitching the results together as a consequence of the graphic card memory limitation and the size of the image. This would not be a problem with larger GPU memory if the images could be input as a whole.

The average and standard deviation of Dice coefficients and 95% Hausdorff distances between the segmentation results and the ground truth for the 3D U-net models are shown in Figure 8.4. In

Comparison of 3D U-net results using three resolutions

FIGURE 8.5 Comparison of 3D U-net results using three resolutions: 5 mm, 2 mm, and 1 mm.

all the five organs, the high-resolution 1 mm model outperformed the 2 mm and 5 mm low-resolution models for both Dice and 95% Hausdorff distances. The segmentation performance increased along with the resolution. The segmentation results of the lungs and the heart were much better than those of the esophagus and the spinal cord.

When comparing the 2D and 3D U-nets, Dice of the spinal cord and the heart in the 2 mm 3D model was higher than those in 2 mm resolution 2D model. However, the 2 mm 3D model achieved a better 95% Hausdorff distance across all the five organs compared to the 2 mm 2D model. When comparing the Dice values for the 1 mm resolution, the 3D model did not perform as well as the 2D model. This may have been caused by the cropping patch operation and padding effect. Overall, the results of 2D model and 3D model were comparable, and a significant improvement was not observed by using a 3D U-net for the dataset. This may seem counter-intuitive, however there were limited volumetric samples for the 3D model. Data augmentation could improve the performance of both models and are discussed in Chapter 11.


There are many variations for U-nets in the literature [6-9, 11, 14, 16-22], and it should be noted that this chapter does not intend to fully evaluate every configuration of the network structures or compare them. Rather, this chapter seeks to evaluate two basic 2D and 3D U-nets so that their effectiveness can be better understood. Similarly, it is also an open question about how to choose comparable network structures and blocks so that a 2D U-net and a 3D U-net have similar capacity. In the work of this chapter, 2D and 3D U-nets have been configured to be as similar as possible in terms of network structures, but with a different number of channels to fully use the capacity of the GPU memory. An alternative would be to keep the number of channels the same and reduce the capacity of the 2D network according to the memory constraint of the 3D network.

Selection of batch size could also affect the learning ability. One cannot use as many batches for 3D U-net as for the 2D U-net on account of the GPU memory limit. Moreover, the size of patches must be restricted for 3D U-net for the same reason. In this chapter, the patch size was set to 128 x 128 x 128 for 3D models. This limits the training of 3D U-net by using a relatively smaller patch size compared to the size of the image (typically 500 pixels along each axis after resampling to 1 mm isotropic resolution). The selection of patch size is also closely related to the padding in the convolution blocks as previously mentioned. Ideally, one should not use padding and only compute the loss for the reduced size patches after network output to eliminate padding discrepancies. However, as the network becomes deeper and deeper, the resultant feature map becomes very small in size. An effective way to overcome this problem is to use a cascade model whereby a coarse resolution segmentation is used to crop the fine-resolution images so that segmentation on isotropic 1 mm resolution can be performed with a fine resolution network.

< Prev   CONTENTS   Source   Next >