Inter-Rater Reliability

Previously in this chapter, we introduced the notion that an assessment system needs to demonstrate reliability , which refers to the accuracy and consistency of the assessment tool. With respect to the assessment of non-technical skills in high-risk industries, the concept of inter-rater reliability is perhaps the most important feature of reliability, given that often a large number of instructors will be responsible for assessment across the workforce.

In an ideal world, the same performance in a scenario would be rated in exactly the same way by the same instructor on multiple occasions, or indeed by all the different instructors in the training department. In reality, there are a number of barriers to achieving perfect inter-rater reliability, and the research literature highlights that often only modest inter-rater reliability is achieved when using some of the most common behavioural marker assessment tools. Much of the modest level of interrater reliability may be due to sources of rating error, such as biases, which will be examined shortly. However, the lack of high levels of inter-rater reliability probably reflects the brevity of training instructors and assessors and spending time on the calibration of ratings.


A non-technical skills training program was being developed for intensive care teams at a large hospital to enhance skills in situation awareness and decision-making. A training needs analysis had been completed, and competency specifications with associated behavioural markers had been written. The training curriculum had also been finalised, and this involved core knowledge development in a workshop followed by a full team simulation with the scenario of a patient being admitted to the unit with significant post-operative complications and a post-simulation debrief

All that remained was to train the instructors in assessing the nontechnical skills described by the training needs analysis and competency specifications with associated behavioural markers. A train-the- trainer day was arranged, and videos of examples of good, average and poor performance in the scenario had been recorded.

To facilitate calibration of the instructors, a simple inter-rater reliability tool was used, based on the within-group agreement (r ) statistic. The instructors viewed each video, and afterwards, one of the non-technical skills was assessed, and the instructors made a rating on a four-point scale. These scores were collated and input into a spreadsheet that made the calculation of the rwg statistic. A value greater than . 7 was deemed to represent sufficient level of agreement. However, all instances where there was disagreement on scoring were discussed, and a calibrated score for the performance on the video was agreed on as part of the instructor training and calibration process. The inter-rater agreement tool is shown here.

Inter-rater agreement (rWg)

rwg = 1-(SX2/tfE2)

The degree of inter-rater reliability achieved in the assessment of non-technical skills is a product of the assessment tool itself combined with the knowledge and skills of those undertaking the assessment. The reliability of the assessment tool has been discussed in detail earlier in this chapter. Just as with non-technical skills themselves, developing skills in the assessment of these skills takes considerable knowledge development and practice in the application of a behavioural marker system.


Instructors and assessors in non-technical skills training programs require general skills in instructional techniques, specialised knowledge in non-technical skills domains and calibration in assessment techniques to ensure inter-rater reliability.

< Prev   CONTENTS   Source   Next >