Subjective Evaluation

Quantitative evaluation tries to measure the error in contouring, but until recently has failed to properly consider the endpoint of auto-contouring - the clinical use of those contours. Clinical use will depend on the acceptance or editing of those contours by a human, and this is inherently subjective. Subjective assessment of auto-contouring in radiotherapy has largely followed the same format since 2011, when a number of studies were published using this approach [14, 16, 27], but can be broadly broken down into considering the acceptance of contouring, the source of contouring, and the preference for contouring

The Acceptance of Contours

The majority of investigations have posed the question as to whether the contours being evaluated are suitable for clinical use, although the exact framing of the question has varied. Hwee et al. [27] asked the question whether the contours were clinically acceptable, offering the binary choice of “yes” or “no”. In contrast, Teguh et al. [14] anticipated editing of contours asking experts to rate contours on a four-point scale as: 0 poor, 1 major deviation (editable), 2 minor deviation (editable), 3 perfect. Curiously, in the same study, a three-point scale was used to assess manual clinical contours and edited auto-contours making direct comparison difficult. In the same year, Granberg [16] asked clinicians to rate contours as to how helpful they perceived them to be on a three-point scale of: not at all, a little, very much.

Table 15.2 summaries the studies in the domain of radiotherapy that have used a subjective approach to evaluate contour acceptability. In most papers the question asked is not explicitly stated but can be inferred from the answers available, thus the table may not reflect the exact wording used by the investigators.

Since Hwee et al. [27], few papers have explicitly stated the score that is considered clinically acceptable, with most opting to allow major editing in some way. This implicit expectation of editing could introduce bias in the interpretation of results, with contours being considered useful. Furthermore, most investigations since Hwee et al. have not hidden the source of the contours from the evaluators, also potentially introducing bias. Gooding et al. [28],[1] recognizing that there may be bias if the evaluator knows the source of the contouring, implemented their assessment in a blinded manner but also expressed the question so as to place the assessment within a clinical context. Peer review of contouring is considered best practice, yet to reject another person’s contours requires confidence in the significance of the error. Thus, the question was phrased to express a peer review context, with a balanced scoring scale between acceptance and rejection of the contours. It is interesting to note that both Hwee et al. [27] and Gooding et al. [28] found that clinical observers will reject around 25% of clinical contours (ones that have been approved for clinical use previously) when blinded as to the source of contouring.

However, the acceptance of contours still requires agreement as to what the correct contouring standard is. As a consequence, the assessment of acceptance of contours from an auto-contouring system may reflect the style of contouring with which the system has been developed/trained, rather than the efficacy of the system itself. In Gooding et al. [29], wide variability between the acceptance of contours was found between institutions, despite the same contours being shown.

The Source of Contouring

This blinding to the source of the contour, together with the use of Al (deep learning) in their auto- contouring research [30], inspired Gooding et al. to frame the question not as one of acceptance but as one of identification [1, 28] following the idea of the Imitation Game proposed by Turing [31] and asking the question: Was this contour drawn by a human or a computer? While the question of the source of contours was also used by Hwee et al. [27], their purpose was to assess whether blinding observers to the source of contouring reduced potential bias in asking about contour acceptability. Gooding et al. propose that the inability to determine the source of the contours could itself be used as performance criterion [1].

They pose the argument that where quantitative measures are blind to the type of error, as discussed above, the human observer is not. If it is assumed that a clinically drawn contour is acceptable, then the inability to distinguish the clinical contour from the auto-contour would suggest that the auto-contour is equally good. This is not to say that either are correct or incorrect, overcoming the issue of inter-observer variability or disagreement, but that the types of error being made by the auto-contouring system are similar in nature to those made by a human expert. Thus, the auto-contouring requires no more editing than those of another clinical expert. In their initial study, they provide indicative results that the misclassification rate of the source of contouring is a better surrogate of editing time than quantitative measures such as Dice. One objection that can be leveled at this approach is that it does not easily allow for the auto-contouring to outperform manual

TABLE 15.2

Summary of Studies Using Subjective Evaluation to Ask about the Clinical Acceptability of Auto-Contours in the Domain of Radiotherapy. Question Phrasing Has Been Inferred Where Not Explicitly Stated in the Chapter







Hwee et al. [27]

Are the contours acceptable?



Teguh et al. [14]

How well do the contours agree with published guidelines?


Major deviation, editable

Minor deviation, editable


Granberg [16]

How useful are the segmentation proposals?

Not at all

A little

Very much

Gooding et al. [50, 29]

How useful are the contours, how much time would you expect to save?

None of the results would form a useful basis for further editing, no time is expected to be saved compared to manual contouring

Some of the results form a useful basis for further editing, little time would be saved compared to manual contouring

Many of the results form a useful basis for further editing, a moderate time saving is expected compared to manual contouring

Most of the results form a useful basis for further editing, a significant time saving is expected compared to manual contouring

Hoang Due et al. [53]

How well do the contours adhere to guidelines for clinical use?

The segmentation does not meet universal guidelines. Some slices show gross mis-delineation that cannot be attributed to segmentation variability

The segmentation is reasonably acceptable but needs some manual editing. Some contour lines need to be corrected to meet universal guidelines

The segmentation is clinically acceptable and satisfies universal OAR delineation guidelines and can be used as created for radiotherapy planning

Van Dijk-Peters et al. [54]

How much editing is required prior to clinical use?

Major editing required

Minor editing required

No editing required

Lustberg et al. [30]

How useful are the contours, how much time would you expect to save?

None of the results would form a useful basis for further editing, no time is expected to be saved compared to manual contouring

Some of the results form a useful basis for further editing, little time would be saved compared to manual contouring

Many of the results form a useful basis for further editing, a moderate time saving is expected compared to manual contouring

Most of the results form a useful basis for further editing, a significant time saving is expected compared to manual contouring

McCarroll et al. [ 11]

How much editing is required for use in dose-volume-histogram (DVH)-based planning?

Major editing needed

Minor editing needed

No editing needed

Gooding et al. 2018 [28]

You have been asked to QA these contours for clinical use by a colleague. Would you...

Require them to be corrected; there are large, obvious, errors

Require them to be corrected; There are minor errors that need a small amount of editing

Accept them as they are; there are minor errors, but these are clinically not significant

Accept them as they are; the contours are very precise

Preference for contouring for three different methods

FIGURE 15.11 Preference for contouring for three different methods (clinical contours, atlas-based auto- contouring, and deep learning contouring) after blinded side-by-side assessment, as reported in Gooding et al. [28].

contouring with the aim being to imitate human contouring behavior. To achieve immitigability, the auto-contour must be equally as good, but also equally as bad.

The Preference for Contouring

The remaining type of subjective assessment used to date was also posed in the discussion by Gooding et al. [I] and used in practice in Gooding et al. [28]; this is to ask which contour is preferred when two contours are shown side by side in a blinded fashion. This approach again sidesteps the question of contour correctness, allowing contours to be incorrect and/or subject to inter-observer variability. Such a question allows the comparison of multiple contouring methods. A reference contour (or manual clinical contour) can then be used to benchmark performance yet allowing auto- contouring to outperform this benchmark, as illustrated in Figure 15.П.

Challenges of Subjective Assessment

Subjective assessment is very helpful in understanding the acceptance and utility of auto-contour- ing. However, there are several limitations and challenges to using this approach.

As has already been noted when considering Table 15.2, there is a substantial risk in introducing bias into any subjective study since human observers are involved. Consequently, any study must be carefully designed. Much research has been conducted in polling[2] but as professionals in radiation oncology, medical physics, medical devices, computer science, etc., the authors are no experts in the psychology behind the design of such studies. Therefore, caution must be exercised to avoid introducing bias in how questions are phrased, and the choice of answers given. As is commonplace in medicine, blinding, and controls are important to reduce bias and should be used for any subjective assessment.

An additional factor that needs consideration is whether the assessment should be performed in 2D or 3D. 2D assessment reflects slice-by-slice contouring and offers more data for the same number of patients yet does not allow the assessment of correctness of the axial extent of contouring. However, 3D assessment will require more cases to be meaningful, and human slice-by-slice contouring results in telltale jaggedness on a coronal or sagittal view giving away the source of contouring. Similarly, efforts must be made to avoid clear differences resulting from the contouring method, but not related to the quality of the contouring. Some treatment planning systems and auto- contouring solutions produce pixelized RTSS, while other produce smooth contours. Such artifacts, if different between methods, can provide evidence for assessment that divert the observer from the purpose of the assessment.

The qualitative nature of these assessments introduces challenges in the development of auto-contouring methods. While results should be reasonably reproducible for large studies, a degree of variation would be expected, particularly for smaller evaluations. Consequently, it may be difficult to perform repeated assessments to demonstrate the improvement of a system over time as it is developed, or to assess the current performance during the development. This limitation is further impacted because subjective assessment requires the time of clinical experts. This time is valuable, and while their input is worthwhile where subjective assessment is useful in commissioning a clinical system, their involvement will be limited in the development or comparison of systems. Consequently, studies focusing on the development or comparison of methods, such as the grand challenges [2-4], will resort to quantitative assessment over subjective assessment.

Summary of Subjective Evaluation

In this section the use of subjective assessment for evaluation of auto-contouring in radiation oncology is considered. Three types of question have been considered relating to the acceptance of contouring, the source of contouring, and the preference for contouring. It has been highlighted that study design is an important factor in such investigations on account of the human element involved. Despite the challenges of implementing subjective evaluation well, this approach to assessing auto- contouring can overcome some of the limitations of quantitative assessment in determining the suitability of a system for clinical use, particularly those related to inter- and intra-observer variability.

  • [1] The published abstract does not contain this detail. However, the survey questions are still available online at
  • [2] As an outsider to the field of opinion polling, the most suitable review paper to cite is unknown to the author. Therefore,no reference is offered.
< Prev   CONTENTS   Source   Next >