Evaluation of Image-Based Atlas Selection

While there has been a wealth of research looking at approaches to atlas selection, few have addressed the underlying assumption as to whether the image similarity measured is a good surrogate for contouring performance. Although it has been shown that the contouring performance is improved by using a more similar image, e.g. Aljabar et al. [9], other studies have called into question whether such image-based measures work well. In motivating learned image-based selection, Sanroma et al. [15] state, “However, the problem of atlas selection still remains unexplored. Traditionally, image similarity is used to select a set of atlases. Unfortunately, this heuristic criterion is not necessarily related to the final segmentation performance”. Furthermore, Ramus and Malandain [16] found greater correlation of image-based selection methods with random selection than with ground truth measures of contouring performance. Thus, the question remains as to whether image-based selection is a good method of atlas selection. This section mirrors the investigation of Schipaanboord et al. [17] in assessing how good image-based atlas selection is compared to the optimal, but uses the data from the Thoracic Contouring challenge [18] as an example.


While the study by Schipaanboord et al. is valuable research using a large dataset, the use of proprietary code in their experiments and the restricted access to the clinical dataset means that exact recreation of the results by others is impossible. To allow full reproducibility of this study and the figures presented in this chapter, the Python code implementation has been made available on GitHub at: https://github.com/Auto-segmentation-in-Radiation-Oncology/Chapter-3.

Brute-Force Search

First, the concept of an Oracle in order to be able to compare to optimal selection is introduced. The Oracle has perfect foreknowledge of the result of auto-contouring and is able to select atlases based on the resulting contouring performance. At the other end of the selection spectrum, there is Random selection whereby atlases are selected without any consideration of performance or similarity. Where image-based selection lies between these two extremes needs to be assessed.

In this experiment, the online test cases (patient IDs in the form LSTSC-Test-SX-2YY) are taken as patient cases. The training data (patient IDs in the form LSTSC-Train-SX-OYY) and the offline test cases (patient IDs in the form LSTSC-Test-SX-lYY) are used as atlases. In the challenge the offline test cases were not available to use as atlases as the contours were not provided. However, these are included as atlases to increase the size of the atlas pool now that these contours are available.

A brute-force search approach is implemented where every atlas is used to contour every patient image. As noted previously, the use of templates and manifolds may optimize this search either in terms of efficiency or projecting the similarity more appropriately far from the test patient. However, the underlying assumption remains the same. A brute-force search removes any impact that the choice of template or manifold may have, ensuring that the best possible selection using the similarity measure is assessed. Image similarity measures (root mean square error of intensities [19] and NMI [3]) are computed after both rigid and deformable registration over the whole image. Furthermore, these image measures are computed for each organ within the deformed atlas contour following deformable registration, as a measure of local image similarity. These measures can be used for atlas selection based on image similarity.

The contouring performance using each atlas is calculated against the “ground truth” contour for each patient case using the DSC [20] implemented using a voxel mask. DSC has been used in this instance as it is easy to compute, and necessary for comparison to previous publications. Additional, and perhaps more clinically relevant, measures are discussed in Chapter 15. The final DSC measure is deemed known to the Oracle for atlas selection.

Atlas Selection Performance Assessment

First using the results of selection, it is possible to produce a plot similar to Figure 3.8 following the work of Aljabar et al. [9]. Figure 3.10 shows the average DSC for all test cases plotted for atlases ranked according to NMI for the esophagus. Similar figures can be produced for all organs in the challenge case. Such figures suggest that there is broad correlation between the similarity measure (in this instance, NMI) and the performance (as measured by DSC). However, as with Aljabar et al. [9], this figure shows average performance for the test cases. While this indicates that there is some benefit to be gained on average, it reveals nothing about the impact of selection on an individual case.

Figure 3.11 shows the contouring performance for all individual cases when plotted against the atlas rank according to NMI for the esophagus. Each color in the figure represents a different test case. Showing the data in this way reveals a much weaker correlation, suggesting that the performance following selection may vary substantially. Thus, while on average the performance may be improved - for any particular case there is no guarantee of improved performance.

Average contouring performance on the esophagus over all test cases by atlases ranked according to NMI

FIGURE 3.10 Average contouring performance on the esophagus over all test cases by atlases ranked according to NMI.

Contouring performance of atlases ranked by NM1. Each color represents a different test case

FIGURE 3.11 Contouring performance of atlases ranked by NM1. Each color represents a different test case.

Figure 3.7 also showed improvement (on average) against random atlas selection. A similar plot is shown in Figure 3.12 using the thoracic data for all six image-based selection measures implemented. The average performance of the first ten selected atlases is shown, rather than the result after contour fusion. The bars indicate the mean performance over all test cases, while the whiskers indicate the minimum and maximum observed performance over the 12 test cases. Ten atlases were selected, rather than 20 as in Figure 3.7, a result of the relatively low number of available atlases (n = 48) compared to Aljabar et al. [9] (n = 274). Figure 3.12 also includes the Oracle, in addition to random selection, for comparison with the best achievable performance. As with [9], it is observed that image-based selection methods perform on average better than random selection. However, it is noted that there is variation between subjects leading to large whiskers. Nevertheless, of some organs, image-based selection appears to perform close to the best achievable, as observed in Rohlfing et al. [3] and shown in Figure 3.5.

So far there appears to be conflicting information. Figure 3.10 and Figure 3.11 appear to suggest selection that is better than random, but highly variable and unlikely to result in substantial performance gains, while Figure 3.12 suggests that performance close to the best achievable can be expected for some organs. To understand this, it is necessary to look at the rank of the selected atlases. Figure 3.13 shows the average rank of the ten selected atlases rather than their contouring performance. For the Oracle the mean rank is 5.5 for all test cases (atlases ranked 1 to 10 would be selected). For Random selection, a mean rank of around 24.5 would be expected, however this varies between cases and organs leading to the whiskers shown. The mean rank of atlases selected using the image-based measures ranges from around 12 to 22, with the whiskers extending this range for individual patients from around 6 to 36. In many cases the whiskers extend to a higher average rank than random showing that for some test cases selection has performed worse than might be expected with random selection. Therefore, this figure strengthens what is observed in Figure 3.10 and Figure 3.11, that image-based selection is on average better than random, but highly variable and not robust.

Average performance following selection of the ten best atlases using various selection methods

FIGURE 3.12 Average performance following selection of the ten best atlases using various selection methods. The whiskers indicate minimum and maximum performance observed within the 12 test cases. The image similarity measures (NMI, RMSE) used for selection were calculated over the whole image following affine registration (Affine) and following deformable registration (Def). The measures used for selection were also calculated within the deformed atlas contour only following deformable registration (Local Def) to give a local measure of similarity.

Next, performance with respect to atlas rank is considered. Figure 3.14 shows performance for each organ, averaged over the 12 test cases according to the Oracle’s ranking. The mean rank position for the ten selected atlases according to each selection method is indicated on the figure. For the lungs and heart, where DSC is normally expected to be high, as they are larger organs, the performance curve is quite shallow' except for approximately the last 20% of poorly performing atlases. Therefore, the performance improvement from atlas selection is expected to be small. Conversely, a performance close to the best achievable is still expected. However, for the esophagus and spinal cord, the impact of the choice of atlas is much greater. Here the image-based selection has a larger impact on performance, but the gulf between the observed performance and the best achievable remains large. Thus, this figure links the observations back to Figure 3.12, where it was observed that following selection some organs perform close to the best achievable, despite less than perfect atlas selection. Figure 3.14 also highlights the need to achieve perfect atlas selection, if the promise of near-perfect contouring suggested in Schipaanboord et al. [2] is to be realized. It can be observed the contouring performance has a small but marked improvement for the best first or second ranked atlases compared to even the third or fourth ranked ones. When searching for an atlas similar to the patient, there will only be a very tiny percentage of the atlas population that will constitute a very good match - thus placing high importance on an exceptional selection method to achieve exceptional contouring performance.

Discussion and Implications for Atlas Selection

The study conducted above has a few notable limitations - particularly with respect to learnt similarly measures. However, before these limitations are considered, what the study does show', rather than what it does not, should also be considered.

Average rank of ten selected atlases using various selection methods

FIGURE 3.13 Average rank of ten selected atlases using various selection methods. The bar indicates the mean performance over the 12 test cases, while whiskers show the minimum and maximum performance. The image similarity measures (NMI, RMSE) used for selection were calculated over the whole image following affine registration (Affine) and following deformable registration (Def). The measures used for selection were also calculated within the deformed atlas contour only following deformable registration (Local Def) to give a local measure of similarity.

Performance of the Oracle. Performance on all organs of atlases ranked by performance averaged over all test cases

FIGURE 3.14 Performance of the Oracle. Performance on all organs of atlases ranked by performance averaged over all test cases.


Clinical Impact of Atlas Contouring and Atlas Selection




of atlases


saving (mins)


saving (%)

Selection method

Teguh et al. [26]

Head and neck




Online selection - mutual information following rigid registration

Stapleford et al. [45]

Head and neck





Young et al. [27]





Online selection - commercial software, using mutual information

Gambacorta et al. [46]






Hwee et al. [47]





Online selection - commercial software, no details given

Lin et al. [48]





Stratification by bladder size, followed by online selection - commercial software, no details given

Granberg et al. [49]






Langmack et al. [50]






Recapping, atlas selection was first motived by Rohlfing et al. [3]. In Figure 3.4, it was seen that NMI-based selection outperformed the use of a single fixed atlas or an average atlas. The height for the bar chart represents the percentage of structures with a DSC higher than a particular threshold. While a greater percentage of structures are above a threshold of 0.7 for image-based selection than for the average atlas, it is seen that the conclusion would change as the threshold increases. An average atlas would outperform the similarity selected one at a threshold of 0.75 or greater. At a threshold of 0.85 the performance of a fixed atlas looks equivalent to a selected one. Therefore, the conclusion could be drawn that image-based selection is capable of increasing performance by rejecting poor performance at the lower end of the performance spectrum rather than improving it at the extreme.

Rohlfing et al. also considered the best possible performance (the Oracle) and found that image similarity-based selection performed close to this, as shown in Figure 3.5. This is similar to the finding in this study in Figure 3.12. However, it has been seen in this study that looking at performance measures alone can be misleading in the evaluation of selection performance. The plot of performance against ranking (Figure 3.14) demonstrates the importance of understanding the performance profile for a particular structure to evaluate how close selection is to optimal.

While Rohlfing et al. considered performance against the optimum, very few subsequent studies have done so. The majority of studies choose to consider any improvement in contouring performance with respect to some reference alternative, or just report the contouring performance of the method reported. Of the studies listed in Table 3.1 subsequent to Rohlfing et al. [3], only Lotjonen et al. [21], Akinyemi et al. [7], Raudaschl et al. [22], Sanroma et al. [15], Zhao et al. [23], and Zaffino et al. [24] consider performance with respect to the optimum. Furthermore, only Sanroma et al. [15] and Zhao et al. [23] consider performance in terms of rank rather than contouring performance, finding that only about a third of the atlases selected using NMI would be considered relevant selections by rank. Thus, the focus of most studies is improving performance from the current state, without considering how much room for improvement exists.


The study presented has clear limitations, only considering simple image similarity measures for selection. However, it noted that only such basic atlas selection approaches have been implemented with clinical software [25].

While it is argued that this assessment adequately addresses the potential of template and manifold type approaches, this is not demonstrated. Conversely, no study has investigated the impact of these techniques with respect to optimal selection. Therefore, this remains an area for future exploration.

Fusion of selected atlas contours is also not considered in this experiment, although it is touched on in Schipaanboord et al. [17]. Contour fusion has been repeatedly shown to result in improved contouring compared to single atlas segmentation, even in the extreme case [2], and therefore it is widely used. Yet, most atlas selection approaches ignore the subsequent fusion that is likely to take place, opting for a greedy atlas selection method, whereby the best atlases are individually chosen prior to generating the consensus. A better approach would be to choose the best set of atlases in combination [24], however this comes with a potentially prohibitive computational cost as database size increases. A recent contribution suggested exploring the large search space using a genetic algorithm to optimize the combinatorial selection [11], however, this approach has yet to be shown to result in an optimal selection. Zaffino et al. [24] trained a neural network to predict performance on groups of atlases, demonstrating that selection of a group equivalent to the single atlas oracle could be achieved - however, this was still below the performance of a group atlas oracle. Researching optimal combinatorial selection requires considerable computation, yet there could be substantial performance gains if an efficient method can be found that can be proven to be optimal, or near-optimal compared to an oracle-based selection.

Approaches of atlas selection using machine learning selection have also not been considered in this chapter, on account of the range of implementation that could be adopted and the need for an additional training set. Such methods have shown promise compared to image-based selection, but the evidence to date is that these approaches still fall short of optimal atlas selection [15].

< Prev   CONTENTS   Source   Next >