II Deep Learning for Auto-Segmentation

Deep Learning for Auto-Segmentation

Introduction to Deep Learning-Based Auto-Contouring for Radiotherapy

Mark ). Gooding


The 2017 AAPM Thoracic Auto-segmentation Challenge could be regarded as a turning point for auto-segmentation in radiation oncology. This was the first challenge in the domain where deep learning-based approaches were used, but not all entries exclusively used this method. All entries to the previous similar challenge in radiotherapy, at the 2015 conference on medical imaging computing and computer assisted intervention (MICCAI), either used atlas-based or model-based segmentation approaches, while in the subsequent challenge at AAPM 2019, all entries were deep learning-based. It was also in 2017 that the first journal publications using deep learning for organ-at-risk [1] and target volume [2-4] segmentation were published, although conference publications had preceded these [5, 6]. Following this, an early clinical validation of commercial systems was reported in a journal publication as early as 2018 [7], again preceded by conference publications [8, 9].

Historical Context

Given this rapid shift in focus to using deep learning for auto-segmentation in radiotherapy, one might be led to believe that this innovation was developed within the field of radiotherapy. However, a look at references of some of the earliest papers quickly leads back to the broader fields of medical imaging and computer vision. This innovation in auto-segmentation for radiation oncology is very much standing on the shoulders of giants. Figure 6.1, reproduced from the work of Wang and Raj [10], gives an indication of the significant steps taken in the development of deep learning.

With the focus of this book being on auto-segmentation, it is not the place of this chapter, or book, to provide a detailed history of deep learning or technical introduction. For historical context,

Tabic 1: Major milestones that will be covered in this paper




300 ВС


introduced Associationism, started the history of human’s attempt to understand brain.


Alexander Bain

introduced Neural Groupings as the earliest models of neural network, inspired Hebbian Learning Rule.


McCulloch & Pitts

introduced MCP Model, which is considered as the ancestor of Artificial Neural Model.


Donald Hebb

considered as the father of neural networks, introduced Hebbian Learning Rule, which lays the foundation of modern neural network.


Frank Rosenblatt

introduced the first perceptron, which highly resembles modern perceptron.


Paul Werbos

introduced Backpropagation

1980 -

Teuvo Kolronen

introduced Self Organizing Map

Kunihiko Fukushima

introduced Neocogitron, which inspired Convolutional Neural Network


John Hopficld

introduced Hopficld Network


Hilton & Scjnowski

introduced Boltzmann Machine


Paul Smolensky

introduced Harmonium, which is later known as Restricted Boltzmann Machine

Michael I. Jordan

defined and introduced Recurrent Neural Network


Yann LeCun

introduced LeNet, showed the possibility of deep neural networks in practice

1997 -

Schuster & Paliwal

introduced Bidirectional Recurrent Neural Network

Hochreiter & Schmidhuber

introduced LSTM, solved the problem of vanishing gradient in recurrent neural networks


Geoffrey Hinton

introduced Deep Belief Networks, also introduced layer-wise pretraining technique, opened current deep learning era.


Salakhutdiuov & Hinton

introduced Deep Boltzmann Machines


Geoffrey Hinton

introduced Dropout, an efficient way of training neural networks

FIGURE 6.1 Key stages in the development of deep learning. Table reproduced from Wang and Raj [10]. Reproduced with permission.

readers should look to overview reviews of Wang and Raj [10] or of Schmidhuber [11], while for a preliminary introduction to artificial neural networks and deep learning, there are numerous papers (e.g. [12,13]) and books available (e.g. [14,15]). There are also a large number of online courses (e.g. [16, 17]) and videos available (e.g. [18, 19]). Nevertheless, it is worth pursuing a brief recap, in order to motivate an analysis of why the field has migrated to deep learning from atlas-based methods and to assess what weaknesses may remain.

Artificial Neural Networks

Deep learning traces its roots back to the work done in the 1950s and 1960s on artificial neurons. The concept of a mathematical model neuron contributing to a network behavior was first explored in 1943 by McCulloch and Pitts [20]. In their theoretical paper, they mathematically modeled each neuron as an “all-or-nothing” activation, exploring conceptually what this would mean for neurol- ogy/psychology. This model of a neuron could be implemented in simple binary logic but is limited in that any response to a stimulus becomes a simple logical algorithm rather than an attempt to accurately model biological behavior. In 1958, Rosenblatt proposed the “perceptron”, in which neuron behavior was modeled probabilistically instead. Importantly, this model allowed for some element of learning whereby the response to stimulus varied with experience. This perceptron model was subsequently simulated in a computer, demonstrating machine learning [21], using what could be described as a neural network.

Model of an artificial neuron. The response R is determined based on whether the sum of the stimuli (here x, y, z) is greater than a threshold ©

FIGURE 6.2 Model of an artificial neuron. The response R is determined based on whether the sum of the stimuli (here x, y, z) is greater than a threshold ©.

An example of the fully connected network architecture used in a study using neural networks to evaluate treatment plans [22]

FIGURE 6.3 An example of the fully connected network architecture used in a study using neural networks to evaluate treatment plans [22]. In that study, 13 input nodes (treatment plan features) were used to predict a one-hot encoded score using five output nodes. Five hidden nodes were used, resulting in 90 weights to tune.

These artificial neurons follow' a simple model, as illustrated in Figure 6.2; the response activation (R) is a binary output determined by whether the sum of all input stimuli (in this example three are shown, x, y, z; howover, there could be more or less) exceed a threshold (0). Rosenblatt’s model of each neuron has an activation response that is a binary model, 1 or 0; however, multiple neurons are combined into an “А-unit” that has a response that is the weighted sum of multiple neurons. With varying thresholds for different neurons, a weighted output system is generated.

Fast-forw'arding to modern networks, the approach used in artificial neural networks does not differ much from this early model. Each neuron takes a weighted sum of its input stimuli and provides an output response based on an activation function. This activation function typically is some form of continuous threshold function, such as a sigmoid.

While there have been many notable contributions to the development of neural networks, as illustrated in Figure 6.1, two factors may be considered as critical to the success that has been achieved in auto-contouring: convolution and computation.

Convolution Neural Networks

A basic neural network design is to have every layer fully connected to the next, as illustrated in Figure 6.3. Such a network architecture works woll for small numbers of inputs and a few hidden layers. How'ever, a challenge arises with images in the number of potential inputs. A CT image, as used for radiotherapy treatment planning, is normally 512 pixels square. A planning volume may consist of 200 or so such 2D images. Thus, a 2D processing network would have 262,144 inputs to a network. Fully connecting such an input to a single hidden layer would result in 68.7 billion parameters (the input weights) to be tuned, making training intractable.

Convolutional networks recognize that “Distinctive features of an object can appear at various locations on the input image. Therefore it seems judicious to have a set of feature detectors that

The network architecture introduced by LeCun et al

FIGURE 6.4 The network architecture introduced by LeCun et al. [24]. Multiple convolution layers were used to enable complex features to be learned, before fully connected layers were used to do character classification.

can detect a particular instance of a feature anywhere on the input plane” [23]. Therefore, the same weights are required throughout the image. This sharing of weights between connections is convolution [24]. This achieves two things; the number of parameters for each layer is vastly reduced since each layer only learns the required convolution kernels, and the network becomes less sensitive to the location of objects within the input array. Stacking of multiple convolution layers, as shown in Figure 6.4, enables complex features to be learned and provides greater learning capacity, while at the same time keeping the number of parameters more tractable. Recent networks used for organ-at-risk segmentation have only had in the order of 10-20 million weights to tune [25].

Computational Power

Neural networks, and particularly convolutional neural networks, are inherently suited to parallel processing, both through task parallelization and data parallelization. During training, the gradient updates for each of the neurons can be computed in parallel. Furthermore, the computation can be split according to the data, with each item of data providing independent updates during training. Such parallel processing lends itself to the use of graphics processing units (GPUs) for computation. As shown in Figure 6.5, the processing power of GPUs has grown enormously in the past decade. At the same time, memory capacity, and bandwidth have also increased. Meanwhile, the costs have remained relatively stable, enabling a wider participation of researchers without the need to access high-performance computers.

The IBM 704 used by Rosenblatt for the initial experiments in neural networks had a processing power of 4 kFLOPS [26], cost $2m [27] and only 123 were produced [27], whereas the NVIDIA Titan RTX produced in 2018 had a processing speed of 16.3 TFLOPS [28], and cost around $2,500 on first release. While NVIDIA does not disclose number of units produced, distribution can be assumed to be in the millions (although only a fraction will be used for deep learning) with the company’s revenues for 2018 totaling $2.21bn [29]. Such readily available processing power has not only enabled more researchers to participate in this field, but also facilitated training of deeper networks with many more degrees of freedom. In turn, with greater capacity to learn, more complex tasks have been tackled such as organ-at-risk segmentation for radiotherapy.

Improvement in GPU processing power has been rapid over the last decade, enabling large networks to be trained and used in reasonable amounts of time

FIGURE 6.5 Improvement in GPU processing power has been rapid over the last decade, enabling large networks to be trained and used in reasonable amounts of time. Figure reproduced using approximated data from the NVIDIA Cuda C Programming Guide [30].


High-Level Consideration of the Properties of Various Segmentation Methods

Segmentation Type


Use of Data

Degrees of Freedom


Organ shape can be represented by modes of variation

Data used to train parameters. Data not required at run-time

In the order of thousands


Atlas case(s) anatomical similar to patient case

Data aligned to each patient at run-time

In the order of tens to hundreds of thousands depending on the registration method

Deep learning-based

Information in the image can identify the object

Data used to train parameters. Data not required at run-time

In the order of tens of millions

< Prev   CONTENTS   Source   Next >