# Effect of Loss Functions in Deep Learning-Based Segmentation

**Evan Porter, David Solis, Payton Bruckmeier, Zaid A. Siddiqui, Leonid Zamdborg, and Thomas Guerrero**

## Introduction

Traditional problem-solving algorithms define a problem and a specific set of steps required to arrive at a solution. In contrast, a deep learning model is a statistical framew'ork, which, when trained stochastically, arrives at a solution. For the model to effectively converge to a solution, it must be able to evaluate the quality of candidate solutions as it learns. Loss functions, also called objective functions or cost functions, quantify the quality of a candidate solution during the model training process. For each step during training, the model’s weights are progressively updated to yield predictions which minimize the loss function. Because the loss function dictates the model’s measure of success and the degree to which the weights are updated, choosing the proper loss function for a given task is vital.

At the beginning of training a model, the weights are randomly initialized and generally incapable of making any useful predictions. However, through backpropagation training, models can learn to solve tasks across many divergent domains. Take, for example, the simple problem of segmenting the skull on a CT image, as shown in Figure 10.1.

The backpropagation training process is broken into three steps: prediction, evaluation, and back- propagation. During the first step, the training input data flows through the model which is simply a series of mathematical operations, most commonly convolutional operations. The data which is returned from the model is referred to as a prediction. In the skull segmentation example, the model is provided with a two-dimensional CT image slice as input, from which it generates a prediction for a segmentation mask. From the example in Figure 10.1, the current model’s skull prediction is non-ideal and further training, or updates to the model’s weights, is warranted. Next, the error of the prediction, in relation to the ground truth, is calculated using the loss function. In the final training step, the gradient of the error is calculated with respect to each model weight. Then every weight is updated by the scaled gradient of the error, with the intent of minimizing each weight’s contribution to the error in subsequent predictions. The scaling factor, commonly called the learning rate, is represented by *X* in Figure 10.1. Therefore, to allow for backpropagation training, a loss function must have scalar-valued output and be differentiable with respect to the model weights. A complete training process repeats these three steps until the output of the loss function, or prediction error, is minimized. Ideally, upon finishing training, the model weights should converge upon a state capable of robustly solving the given task.

In addition to dictating what is learned, a loss function can influence how easily a model converges upon a solution. Like many optimization problems, the training of deep learning models utilizes a multi-dimensional gradient descent. A simple visual representation of the training process would be the act of navigating to the lowest point on an uneven plane, such as those shown in Figure 10.2. If the plane possesses many depressions in addition to the true lowest point, it would be difficult to detect the lowest point globally or merely locally; after all, the only knowledge is of the local surroundings, not if there is a deeper depression elsewhere on the plane.

To adapt this to deep learning terminology, the x-y axis of the surface represents all potential model weight combinations, and the z-axis indicates the loss function performance of the current weight combination. During training, the model is initialized randomly within the weight possibility space. Then, as the model trains, it explores the space of its possible weight combinations to minimize the loss function. Optimal loss functions therefore have an easily computed gradient path towards the global minimum.

The set of weights which minimize the loss function are referred to as the global minimum, and the other sets of weights which produce loss functions lower than their surroundings as the local

FIGURE 10.1 The steps in training a deep learning model. Step 1, from the training data, a prediction is made. Step 2, using the loss function, the ground truth and prediction are compared, and an error is determined. Step 3, each weight is updated proportionally to the gradient of the error.

FIGURE 10.2 A visual depiction of loss functions where the x-y axis is model weight combinations, and the z-axis is the loss function. With an incorrectly chosen loss function (A), a poorly suited loss function (B) and an easily trainable loss function (C).

minima. If a loss function completely unsuited to the data is selected, it is unlikely the model will train at all. Such a visualized loss space [1] example is given in Figure 10.2a. If a poorly suited, but trainable, loss function is chosen instead, there will be both a global minimum and local minima, as in Figure 10.2b. However, if a carefully chosen a loss function well suited for the task is used, finding the global minimum will be both simple and efficient, as seen in Figure 10.2c.

A well-chosen loss function has a significant role in reaching an optimal solution for a given deep learning task. This chapter will cover the necessary elements of a loss function, the challenge of segmentation tasks for loss functions, common loss functions, and their applications, dealing with imperfect data, choosing a starting loss function, and troubleshooting methods to help overcome frequent challenges in medical image segmentation.

## Admissibility of a Loss Function

To understand the importance of admissibility, imagine that two people are bidding to build a fence enclosure for a farmer’s sheep. The farmer only tells both designers that whoever designs the fence with the shortest length will be hired. The first designer, using his knowledge of geometry, designs a circular fence, large enough to encircle the flock. On the other hand, the second designer proposes to build a fence only around himself, declaring himself ‘outside’ the fence. Clearly, this second solution fails to enclose the flock, which is the original purpose of a building fence. However, the farmer presented the ideal solution as that which minimized fence distance, not that which minimized the danger to the sheep. In a deep learning context, the farmer’s loss function, length of fence, was not admissible to his true intentions behind building the fence.

While the second solution may seem outlandish, deep learning models are inherently prone to converging upon these lazy solutions. For segmentation tasks, common lazy solutions are models which do not predict every structure, predict highly smoothed structures, or models which uniformly predict a single structure. To prevent these lazy solutions, a loss function must be carefully chosen which defines the ideal solution to the task, minimizes the risk of unintended results, and ensures effective convergence to a robust solution.

## Presenting the Problem

The remainder of this chapter covers the proper combination of ground truth data and loss functions and presents a selection of different losses useful for image segmentation. For discussion, a segmentation task is considered where a ground truth label mask is available in which each voxel is designated as either a member of the class or not. These ground truth label masks can be organized as either a multi-label or multi-class segmentation tasks, both of which can be used to train a deep learning model. Multi-label and multi-class segmentation are discussed further in Chapter 9, but a brief discussion is included here, since the definition of the problem impacts the choice of loss function.

A multi-label segmentation allows for each voxel to be a member of multiple classes, as well as not a member of any class. An example of a multi-label segmentation is a patient with multiple thoracic structures and a body contour. In this case, every voxel classified as “heart” would also be member of the “body” class. And, for any voxel exterior to the body, class membership would not be required.

A multi-class segmentation is a restriction of a multi-label segmentation task, where each voxel is a mutually exclusive classification. This means that each voxel must, and can only, be a member of a single segmentation class. For example, when contouring the left and right lung, each voxel will be one of three classes: left lung, right lung, or neither lung. Through the inclusion of the “neither”, also referred to as the “background” class, the problem allows for every voxel to be a member of a class. To restrict voxels from having membership of multiple classes, or likewise to reduce a multi-label to a multi-class segmentation problem, binary operators (i.e. AND, OR, and NOT) can be utilized.

Strict adherence to the multi-class labeling rules is important because any mislabeled voxels will interfere with the model’s training. Take, for example, a voxel which was not assigned any of left lung, right lung, or neither. During the training process, a prediction of any class membership will falsely be evaluated as an error and will be backpropagated into the model weights, potentially interfering with the otherwise properly trained parameters.

Although multi-class labeling restricts the preparation and data organization of the ground truth labels, doing so also restricts the complexity of any prediction. By reducing the degrees of freedom possible in a solution, the overall solution space is restricted, and the gradient descent is simplified. This means that, for most tasks, preparing the ground truth as a multi-class problem will result in quicker convergence to a solution.

As a depiction of both label types, Figure 10.3 demonstrates different representations of an arbitrary 2D image composed of a partially overlapping circle and triangle. Figure 10.3b shows a “one- hot encoded” multi-label data set representation of the original image, Figure 10.3a. In this case, a third dimension is added to the 2D image, with each position along this dimension called a channel, where each channel represents membership of the pixel position to different categories, or classes, of data. A pixel value of 1 in channel 1, Figure 10.3b left, would indicate that the pixel belongs to the circle region, and a pixel value of 1 in channel 2 would indicate that the pixel belongs to a triangle

FIGURE 10.3 (A) The original image of a circle and triangle sharing an overlapped region is shown. (B)

A one-hot encoded multi-label representation of image A. (C) A multi-class label encoding (LE) representation of image A. (D) A one-hot encoded multi-class representation of image A.

region. It is important to note that in a multi-label representation of the data, a given pixel position may hold a value of 1 in either channel, indicating that the pixel position belongs to both the circle region and triangle region. This contrasts the with multi-class representation of the dataset, which must hold mutually exclusive classifications. In Figure 10.3c, a multi-class label-encoded data representation of Figure 10.3a is shown. In this representation, a unique integer label is assigned to each pixel, which indicates to which classification category the pixel belongs: 0 - background, 1 - circle only, 2 - triangle only, 3 - intersection region of the circle and triangle. Because this is a multi-class representation, a new classification is needed to indicate membership of the pixel in the overlapping region. In Figure 10.3d, a one-hot encoded multi-class representation of Figure 10.3a is shown. In a similar fashion to Figure 10.3b, multiple channels are again utilized to indicate the category a given pixel belongs to (from left to right): channel 1 - background, channel 2 - circle only, channel 3 - triangle only, channel 4 - circle and triangle intersection. As will be discussed later, though similar in their composition, the use of either a multi-label or multi-class representation (Figure 10.3b vs Figure 10.3d) for a dataset may hold distinct advantages for loss functions and their application.

The output of a neural network needs to match the dimensionality of the target ground truth labels. For segmentation, this requires a special output layer to convert the regression from the network into class probabilities for each voxel in the input. Multi-class segmentation requires a softmax function, which is a scaled activation which maps the neural network to a normalized distribution function representing the per-channel estimation of class membership (the sum of the classes for a given voxel predication is equal to one). Despite the output of a softmax activation being normalized, the model output should not be confused with a probabilistic (i.e. Frequentist or Bayesian) output for class membership. This means that probabilistic statistical tests or utilizing a probabilistic determination to inform clinical decisions is not a valid interpretation of a network’s output. Instead, in order to make an inference, each voxel has a class assigned to the channel with the highest value, typically by applying a maximum argument (argmax) function, ensuring each voxel is a member of only a single class. However, during model training, the loss is computed from the raw outputs (without the argmax function applied) to compute and backpropagate the gradient of the error with respect to all possible classes.

For a model to achieve multi-label segmentation, the model should conclude with a sigmoid function as the final activation. This ensures that the model outputs normalized, class-independent, per-voxel class membership predictions. Since the sigmoid function is independent for each output channel, a voxel having membership in multiple classes is a valid prediction. Then, during inference, a sigmoid-activated prediction is rounded to the nearest binary value, allowing each voxel the potential of being a member of multiple classes. And, similarly to multi-class segmentation training, the loss function should be computed on the raw, or unrounded, predictions.