Reliability refers to the consistency of a measure or the extent to which a score or outcome remains unchanged across assessments (if no change is expected) or among different assessors. For example, one would not expect a dramatic change in an individual’s weight within an hour if the same scale was used or if different nurses in a clinic took the weigh measurement. Similarly one would not choose a scale that assessed depression if the data indicated that the same individuals had radically different scores throughout a day. Such a lack of consistency and stability over time would make this scale a poor choice as an outcome measure for a clinical intervention for depression as it would be difficult to assess whether changes in initial and follow-up scores on depression were related to the effects of the intervention or represent the intrasubject variability or lack of reliability of the scale. Lack of reliability, or the consistency of measurement of an instrument, introduces considerable error variance in statistical models and makes it virtually impossible to detect true effects as a function of treatment.
For behavioral intervention research, the most critical types of reliability include test-retest reliability, inter-rater reliability, parallel forms, and the internal consistency of a measure.
Test-retest reliability refers to the consistency or stability of the measure when the measure is administered over relatively short time intervals. For example, if an individual completes a measure assessing attitudes about weight loss, one would not expect large changes in these attitudes if the measure was administered later that same day.
Inter-rater reliability refers to consistency of measurement across assessors. For example, if an outcome measure involved independent clinician ratings of anxiety, one would anticipate a high degree of agreement on level of anxiety of an individual across two clinicians. Sometimes, use of a checklist can help ensure inter-rater reliability. Measures with high inter-rater reliability (usually measured by a statistic such as kappa) help ensure that the measure is structured so that different raters will obtain similar results. Similarly, intra-rater reliability refers to the degree to which an assessor administers and scores a measure consistently.
Parallel forms is another indicator of stability. It is obtained by administering different versions of a measure to the same group of individuals. The correlation between the two parallel forms is the measure of reliability. The measures should be administered in different orders to reduce error rates. The measures must have the same content. For example, one might develop two versions of a knowledge test. One issue, of course, is the need to generate lots of items to reflect the same content.
Internal consistency is an important aspect of validity if a measure is designed to assess one overall construct or there are specific subscales within a measure. For example, the Short Form Health Survey (SF36) (Ware & Sherbourne, 1992) includes eight subscales related to various aspects of quality of life (e.g., emotional health, physical health). Internal consistency reflects the degree to which the items in a measure “hang together” or are addressing the same underlying construct.