II Deep Learning for AutoSegmentation
Deep Learning for AutoSegmentation
Introduction to Deep LearningBased AutoContouring for Radiotherapy
Mark ). Gooding
Introduction
The 2017 AAPM Thoracic Autosegmentation Challenge could be regarded as a turning point for autosegmentation in radiation oncology. This was the first challenge in the domain where deep learningbased approaches were used, but not all entries exclusively used this method. All entries to the previous similar challenge in radiotherapy, at the 2015 conference on medical imaging computing and computer assisted intervention (MICCAI), either used atlasbased or modelbased segmentation approaches, while in the subsequent challenge at AAPM 2019, all entries were deep learningbased. It was also in 2017 that the first journal publications using deep learning for organatrisk [1] and target volume [24] segmentation were published, although conference publications had preceded these [5, 6]. Following this, an early clinical validation of commercial systems was reported in a journal publication as early as 2018 [7], again preceded by conference publications [8, 9].
Historical Context
Given this rapid shift in focus to using deep learning for autosegmentation in radiotherapy, one might be led to believe that this innovation was developed within the field of radiotherapy. However, a look at references of some of the earliest papers quickly leads back to the broader fields of medical imaging and computer vision. This innovation in autosegmentation for radiation oncology is very much standing on the shoulders of giants. Figure 6.1, reproduced from the work of Wang and Raj [10], gives an indication of the significant steps taken in the development of deep learning.
With the focus of this book being on autosegmentation, it is not the place of this chapter, or book, to provide a detailed history of deep learning or technical introduction. For historical context,
Tabic 1: Major milestones that will be covered in this paper
Year 
Contributer 
Contribution 
300 ВС 
Aristotle 
introduced Associationism, started the history of human’s attempt to understand brain. 
1873 
Alexander Bain 
introduced Neural Groupings as the earliest models of neural network, inspired Hebbian Learning Rule. 
1943 
McCulloch & Pitts 
introduced MCP Model, which is considered as the ancestor of Artificial Neural Model. 
1949 
Donald Hebb 
considered as the father of neural networks, introduced Hebbian Learning Rule, which lays the foundation of modern neural network. 
1958 
Frank Rosenblatt 
introduced the first perceptron, which highly resembles modern perceptron. 
1974 
Paul Werbos 
introduced Backpropagation 
1980  
Teuvo Kolronen 
introduced Self Organizing Map 
Kunihiko Fukushima 
introduced Neocogitron, which inspired Convolutional Neural Network 

1982 
John Hopficld 
introduced Hopficld Network 
1985 
Hilton & Scjnowski 
introduced Boltzmann Machine 
1986 
Paul Smolensky 
introduced Harmonium, which is later known as Restricted Boltzmann Machine 
Michael I. Jordan 
defined and introduced Recurrent Neural Network 

1990 
Yann LeCun 
introduced LeNet, showed the possibility of deep neural networks in practice 
1997  
Schuster & Paliwal 
introduced Bidirectional Recurrent Neural Network 
Hochreiter & Schmidhuber 
introduced LSTM, solved the problem of vanishing gradient in recurrent neural networks 

2006 
Geoffrey Hinton 
introduced Deep Belief Networks, also introduced layerwise pretraining technique, opened current deep learning era. 
2009 
Salakhutdiuov & Hinton 
introduced Deep Boltzmann Machines 
2012 
Geoffrey Hinton 
introduced Dropout, an efficient way of training neural networks 
FIGURE 6.1 Key stages in the development of deep learning. Table reproduced from Wang and Raj [10]. Reproduced with permission.
readers should look to overview reviews of Wang and Raj [10] or of Schmidhuber [11], while for a preliminary introduction to artificial neural networks and deep learning, there are numerous papers (e.g. [12,13]) and books available (e.g. [14,15]). There are also a large number of online courses (e.g. [16, 17]) and videos available (e.g. [18, 19]). Nevertheless, it is worth pursuing a brief recap, in order to motivate an analysis of why the field has migrated to deep learning from atlasbased methods and to assess what weaknesses may remain.
Artificial Neural Networks
Deep learning traces its roots back to the work done in the 1950s and 1960s on artificial neurons. The concept of a mathematical model neuron contributing to a network behavior was first explored in 1943 by McCulloch and Pitts [20]. In their theoretical paper, they mathematically modeled each neuron as an “allornothing” activation, exploring conceptually what this would mean for neurol ogy/psychology. This model of a neuron could be implemented in simple binary logic but is limited in that any response to a stimulus becomes a simple logical algorithm rather than an attempt to accurately model biological behavior. In 1958, Rosenblatt proposed the “perceptron”, in which neuron behavior was modeled probabilistically instead. Importantly, this model allowed for some element of learning whereby the response to stimulus varied with experience. This perceptron model was subsequently simulated in a computer, demonstrating machine learning [21], using what could be described as a neural network.
FIGURE 6.2 Model of an artificial neuron. The response R is determined based on whether the sum of the stimuli (here x, y, z) is greater than a threshold ©.
FIGURE 6.3 An example of the fully connected network architecture used in a study using neural networks to evaluate treatment plans [22]. In that study, 13 input nodes (treatment plan features) were used to predict a onehot encoded score using five output nodes. Five hidden nodes were used, resulting in 90 weights to tune.
These artificial neurons follow' a simple model, as illustrated in Figure 6.2; the response activation (R) is a binary output determined by whether the sum of all input stimuli (in this example three are shown, x, y, z; howover, there could be more or less) exceed a threshold (0). Rosenblatt’s model of each neuron has an activation response that is a binary model, 1 or 0; however, multiple neurons are combined into an “Аunit” that has a response that is the weighted sum of multiple neurons. With varying thresholds for different neurons, a weighted output system is generated.
Fastforw'arding to modern networks, the approach used in artificial neural networks does not differ much from this early model. Each neuron takes a weighted sum of its input stimuli and provides an output response based on an activation function. This activation function typically is some form of continuous threshold function, such as a sigmoid.
While there have been many notable contributions to the development of neural networks, as illustrated in Figure 6.1, two factors may be considered as critical to the success that has been achieved in autocontouring: convolution and computation.
Convolution Neural Networks
A basic neural network design is to have every layer fully connected to the next, as illustrated in Figure 6.3. Such a network architecture works woll for small numbers of inputs and a few hidden layers. How'ever, a challenge arises with images in the number of potential inputs. A CT image, as used for radiotherapy treatment planning, is normally 512 pixels square. A planning volume may consist of 200 or so such 2D images. Thus, a 2D processing network would have 262,144 inputs to a network. Fully connecting such an input to a single hidden layer would result in 68.7 billion parameters (the input weights) to be tuned, making training intractable.
Convolutional networks recognize that “Distinctive features of an object can appear at various locations on the input image. Therefore it seems judicious to have a set of feature detectors that
FIGURE 6.4 The network architecture introduced by LeCun et al. [24]. Multiple convolution layers were used to enable complex features to be learned, before fully connected layers were used to do character classification.
can detect a particular instance of a feature anywhere on the input plane” [23]. Therefore, the same weights are required throughout the image. This sharing of weights between connections is convolution [24]. This achieves two things; the number of parameters for each layer is vastly reduced since each layer only learns the required convolution kernels, and the network becomes less sensitive to the location of objects within the input array. Stacking of multiple convolution layers, as shown in Figure 6.4, enables complex features to be learned and provides greater learning capacity, while at the same time keeping the number of parameters more tractable. Recent networks used for organatrisk segmentation have only had in the order of 1020 million weights to tune [25].
Computational Power
Neural networks, and particularly convolutional neural networks, are inherently suited to parallel processing, both through task parallelization and data parallelization. During training, the gradient updates for each of the neurons can be computed in parallel. Furthermore, the computation can be split according to the data, with each item of data providing independent updates during training. Such parallel processing lends itself to the use of graphics processing units (GPUs) for computation. As shown in Figure 6.5, the processing power of GPUs has grown enormously in the past decade. At the same time, memory capacity, and bandwidth have also increased. Meanwhile, the costs have remained relatively stable, enabling a wider participation of researchers without the need to access highperformance computers.
The IBM 704 used by Rosenblatt for the initial experiments in neural networks had a processing power of 4 kFLOPS [26], cost $2m [27] and only 123 were produced [27], whereas the NVIDIA Titan RTX produced in 2018 had a processing speed of 16.3 TFLOPS [28], and cost around $2,500 on first release. While NVIDIA does not disclose number of units produced, distribution can be assumed to be in the millions (although only a fraction will be used for deep learning) with the company’s revenues for 2018 totaling $2.21bn [29]. Such readily available processing power has not only enabled more researchers to participate in this field, but also facilitated training of deeper networks with many more degrees of freedom. In turn, with greater capacity to learn, more complex tasks have been tackled such as organatrisk segmentation for radiotherapy.
FIGURE 6.5 Improvement in GPU processing power has been rapid over the last decade, enabling large networks to be trained and used in reasonable amounts of time. Figure reproduced using approximated data from the NVIDIA Cuda C Programming Guide [30].
TABLE 6.1
HighLevel Consideration of the Properties of Various Segmentation Methods
Segmentation Type 
Assumptions 
Use of Data 
Degrees of Freedom 
Modelbased 
Organ shape can be represented by modes of variation 
Data used to train parameters. Data not required at runtime 
In the order of thousands 
Atlasbased 
Atlas case(s) anatomical similar to patient case 
Data aligned to each patient at runtime 
In the order of tens to hundreds of thousands depending on the registration method 
Deep learningbased 
Information in the image can identify the object 
Data used to train parameters. Data not required at runtime 
In the order of tens of millions 