# Compound Loss Functions

Many of the loss functions discussed in this chapter exhibit unique properties which make them well suited for segmentation tasks. Occasionally, however, problems require properties at the intersection of multiple loss functions. Fortunately, different loss functions can be combined to span a larger set of properties.

## Dice + Cross Entropy

The combination of cross entropy and Dice loss is a popular pairing for loss functions [15]. Alone, the Dice loss is robust to minor class imbalances but does not allow for weighting of false positives or false negatives. The two terms within a weighted binary cross entropy function, however, can be modified to increase or decrease the penalty for false negative or false positive values. When Dice loss and cross entropy losses are combined, the result is a partially class imbalanced loss function with variable sensitivity for false predictions.

## Dice + Focal Loss

A further example of combined loss function is Dice loss and focal loss [16]. More precisely, this loss function implementation utilized the Tversky loss function with a = f) = 0.5, although these hyperparameters could have been tuned differently for this task. Through the combination, this joint loss function combines both the volumetric dependency of the Dice loss and the focal loss property of increased importance of highly uncertain predictions.

## Non-Linear Combinations

To generate the most utility from a combined loss function, the balance between the terms should exhibit non-linear behavior. A strong loss function combination should choose loss functions which each possess unique properties. For some tasks, these behaviors can be more powerful at the early or late stages of training.

For example, take the Hausdorff loss function. Traditionally, the 100th percentile Hausdorff distance is highly sensitive to spatial outliers which limits the usefulness during early training. However, this becomes an asset during late training stages, as it can accurately discriminate against spatial outliers, thus fine-tuning performance.

Another example of a potential non-linear combination is Dice and focal loss. In the original loss function implementation, the Dice loss term dominates for epochs with poor validation set performance. Then, the importance of the focal loss term increases as the validation set performance improves. This gradual shift in balance allows the model to partially train on Dice loss before becoming dominated by focal loss and being penalized for high prediction uncertainty.

It should be noted that non-linear loss function combinations will require additional hyperparameter tuning and are more likely to train inconsistently. A suggested workflow is to begin training the model with only the initially dominant term. Then, once hyperparameter-tuned, the loss function can be expanded with the minor terms, before re-tuning the hyperparameters.

# Dealing with Imperfect Data

For most medical image segmentation tasks, the training data set must be large, diverse, and high quality. Unfortunately, particularly in medicine, creating such a training set is a time-consuming undertaking. This is particularly problematic when the generation of ground truth labels requires an expert, whose time is likely at a premium.

An ongoing field of research attempts to create methods and loss functions to train high quality models from imperfect data. In many clinical cases, only the relevant selection of all organs-at-risk are segmented. This means that the original clinical dataset may not be densely populated with all structures on all cases. For cases that lack a labeled structure, gradient backpropagation will penalize a model’s potentially accurate prediction due to imperfections in the ground truth.

A few attempts to account for imperfect data, particularly sparsely labeled ground truths, have achieved success through modification of the loss function. For example, Bokhorst et al. [17] trained a U-net model from sparsely labeled histology images by only backpropagating the loss function from channels which had “valid” ground truth labels. Zhu et al. [16] extended this concept by not only masking for only “valid” ground truths but weighting each class at the inverse of their occurrence. In doing so, the loss function compensated for the inter-class imbalance deriving from the sparsely labeled ground truth. Although these are promising first steps, the further adaptation of loss functions to train robustly on imperfect data will continue to garner interest for medical image segmentation. For further discussion on data set preparation, see Chapter 14.

# Evaluating a Loss Function

In the proceeding sections, many differing loss functions and their application were discussed. With the numerous loss function choices, picking a starting point can be overwhelming. A decision tree to help choose an initial loss function is provided in Figure 10.6. However, to get the most out of the chosen loss function, a user should understand how to evaluate and tune the loss function’s performance.

Typical deep learning strategy dictates a dataset be separated into three unique subsets: training, validation, and testing. The training set, as the name implies, is used to train the model and is the largest of the three subsets. During the training process, predictions made from this data are used for backpropagation weight updates. Following every epoch, the training model makes predictions from a smaller subset of data, the validation set, where predictions are made without updating the model’s weights. It should be repeated that deep learning models are lazy and will take whatever shortcuts are available. Commonly, this shortcut is overfitting by memorization. When a model memorizes, it begins to perform outstandingly on the training dataset without learning generalizable features, which means it cannot replicate this performance equally on an unlearned dataset, such as the validation set. The model’s progress can be monitored in real time by frequently predicting the validation dataset, preventing time from being wasted when the training is non ideal. Typically, the relationship of training and validation loss falls into one of four categories, as shown in Figure 10.7.

FIGURE 10.6 A flowchart to aid in determining the proper loss function for a given task.

FIGURE 10.7 A representation of different types of relationships between the training loss (slightly lighter) and validation loss (slightly darker). Top-left: A model which does not train. Top-right: A highly imbalanced data set with a poorly suited loss function. Bottom-left: A model which overfits on the training set. Bottom- right: A model which trains.

A model that consistently performs poorly on both losses across all epochs, as seen in the top- left of Figure 10.7, is indicative of a model that is not training. Unfortunately, there is no clear-cut reason why a model does not train, but troubleshooting should progress through the training process. Beginning with the data, this issue may arise from training data or ground truth labels that are incorrectly formatted or not properly corresponding. Within the model, errant graph connections or incorrect final activation and loss function pairings can prevent the model from properly backpropa- gating the gradient. Finally, hyperparameters may be poorly selected, causing weights to change too quickly or coarsely to successfully converge to the minima.

A model which immediately produces outstanding and desirable results, like that shown in the top- right of Figure 10.7, is indicative of a highly unbalanced task paired with an unbalanced loss function. At the start of training, a model’s weights are randomly initialized, and are never expected to perform perfectly after only a few iterations of the training cycle. This behavior is typically characterized by a model becoming trapped in an overwhelming local minimum, such as predicting one class for the entire volume. This can be troubleshot through experimentation with alternative loss functions.

An overfitting model, as given in the bottom-left of Figure 10.7, has a loss function that consistently decreases while the validation loss remains unchanged. To prevent overfitting, common techniques may be to introduce dropout into the model or utilizing optimizer regularization. Additionally, the training data can be augmented to simulate a more diverse dataset. Approaches to data augmentation are discussed in Chapter 11.

When everything comes together, and a deep learning model learns properly, both loss functions are expected to decrease relatively steadily and asymptotically to the same value, as shown in the bottom-right of Figure 10.7. It is important to note that the rate of convergence will vary based on task, model, and optimizer. In this instance, the model was able to learn a generalizable feature from the training data and perform equally well on the validation set. The possibility exists, however, that the chosen loss function is not indicative of desired performance. To check this, the model’s predictions on the validation set should be compared to the ground truth with additional metrics. If these metrics also indicate strong performance, a final prediction on the test set can be made. A more detailed discussion of evaluation of model performance is presented in Chapter 15.

For a deep learning model to converge upon a generalizable solution, the method by which it gauges performance, the loss function, must be carefully chosen. The loss function dictates the backpropagation process, and in turn how a model learns, because the loss function quantifies the fitness of the model’s predictions. While educated guessing may assist in selecting a loss function, finding the ideal function typically requires experimentation with different loss functions or combinations. The most popular loss functions were described within this chapter, but there exist many niche functions which were not discussed. As techniques for medical image segmentation evolve, pioneering individuals will continue to develop novel loss functions capable of greater admissibility and ease of trainability.

# References

• 1. Li H, Xu Z, Taylor G, Studer C, Goldstein T. Visualizing the Loss Landscape of Neural Nets. arXiv:171209913 [cs, stat] [Internet], 2017 Dec 28 [cited 2019 Jul 9]. Available from: http://arxiv.org/abs /1712.09913
• 2. Menze BH, Jakab A, Bauer S. Kalpathy-Cramer J, Farahani K, Kirby J, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trails Med Imaging. 2015;34(10): 1993—2024. PMID: 25494501
• 3. Bakas S, Akbari H, Sotiras A, Bilello M. Rozycki M. Kirby JS, et al. Advancing the Cancer Genome Atlas glioma MRI Collections with Expert Segmentation Labels and Radiomic Features. Sci Data. 2017:4:170117. PMID: 28872634
• 4. Bakas S, Reyes M, Jakab A, Bauer S, Rempfler M, Crimi A, et al. Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge. 2019 Apr 12 [cited 2020 Oct 2]. Available from: https://www.repository.cam.ac. uk/hand le/1810/291597
• 5. Dice LR. Measures of the Amount of Ecologic Association Between Species. Ecology [Internet], 1945;26(3):297—302 [cited 2019 Jun 24]. Available from: https://esajournals.onlinelibrary.wiley.com/doi/ abs/10.2307/1932409
• 6. Milletari F. Navab N, Ahmadi S. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV). 2016. pp. 565-571.
• 7. Karimi D, Salcudean SE. Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks. arXiv:190410030 [cs, eess, stat] [Internet]. 2019 Apr 22 [cited 2020 Mar 30]. Available from: http://arxiv.org/abs/1904.10030
• 8. Maier O. loli/medpy [Internet], 2020 [cited 2020 Oct 2]. Available from: https://github.com/loli/medpy
• 9. Ronneberger O. Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF. editors. Medical Image Computing and Computer- Assisted Intervention - MICCAI2015. Springer International Publishing: 2015. pp. 234-241. (Lecture Notes in Computer Science).
• 10. Ribera J, Giiera D. Chen Y. Delp EJ. Locating Objects Without Bounding Boxes. arXiv: 180607564 [cs] [Internet], 2019 Apr 3 [cited 2020 Mar 30]. Available from: http://arxiv.org/abs/1806.07564
• 11. Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Lecture Notes in Computer Science [Internet], 2017:10553:240-248 [cited 2019 Jun 24]. Available from: http://arxiv.org/abs/1707.03237
• 12. Lin T-Y, Goyal P. Girshick R. He K. Dollar P. Focal Loss for Dense Object Detection. arXiv: 170802002 [cs] [Internet], 2018 Feb 7 [cited 2020 Mar 24]. Available from: http://arxiv.org/abs/1708.02002
• 13. Brosch T. Yoo Y, Tang LYW, Li DKB. Traboulsee A. Tam R. Deep Convolutional Encoder Networks for Multiple Sclerosis Lesion Segmentation. In: Navab N, Hornegger J, Wells WM. Frangi AF, editors. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. Springer International Publishing; 2015. pp. 3-11. (Lecture Notes in Computer Science).
• 14. Salehi SSM, Erdogmus D, Gholipour A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. arXiv:170605721 [cs] [Internet]. 2017 Jun 18 [cited 2019 Aug 24]. Available from: http://arxiv.org/abs/1706.05721
• 15. Taghanaki SA, Zheng Y, Zhou SK, Georgescu B, Sharma P, Xu D, et al. Combo Loss: Handling Input and Output Imbalance in Multi-Organ Segmentation. arXiv:180502798 [cs] [Internet], 2018 Oct 22 [cited 2020 Mar 23]. Available from: http://arxiv.org/abs/1805.02798
• 16. Zhu W. Huang Y, Zeng L, Chen X, Liu Y, Qian Z, et al. AnatomyNet: Deep Learning for Fast and Fully Automated Whole-Volume Segmentation of Head and Neck Anatomy. Med Phys [Internet], 2019;46(2):576—89. [cited 2020 Mar 30]. Available from: https://aapm.onlinelibrary.wiley.com/doi/abs/ 10.1002/mp. 13300
• 17. Bokhorst JM, Pinckaers H, van Zwam P, Nagtegaal I, Laak J van der, Ciompi F. Learning from Sparsely Annotated Data for Semantic Segmentation in Histopathology Images. 2018 [cited 2020 Mar 30]. Available from: https://openreview.net/forum?id=SkeBT7BxeV