Reliability

Reliability is an empirical estimate of the extent to which an instrument produces the same result (measure or score), if applied two or more times. Using BP readings allows a clinician to decide with reasonable confidence whether the patient is troubled or not. Imagine taking a blood pressure (BP) reading twice on the same person. On the first administration, a BP reading was 140/100—hypertension. Worried, another BP reading is taken: 120/80—normal. If a BP reading produced measures with this much variability, the procedure or instrument used are not reliable. The patient’s BP may be unstable, the instrument is not calibrated and needs to be recalibrated, or the patient assessor is poorly trained. Taking two or three BP readings should result in different measurement values, but they should be within +/- 5 mg/mL range.

Reliability may be better understood by considering the dimension of error. Think of most measures (M) as having a true score (T) component and an error (E) component: M = T + E. While we want a measure with a person’s true score (T), all measures have some percentage of error. We assume that these errors are random. Random errors include any effects that introduce something other than a true measure. Suppose a patient takes the same blood or urine test on five consecutive days. The deviations of each of the test administrations (a) from the true response indicate random error. Thus, errors are randomly distributed (+ or -) around the true score.

The five measures may vary around the true response for many reasons, for example, the person got up late one day and was rushed, was too tired 196 | Evaluation of HP-DP Programs one day, or was anxious at work one day. Because of random error, the obtained scores vary around the true score. If a larger number of tests, two/ day for 10 days, was performed on one person or group of 30 with a valid instrument and laboratory analyses, the mean of the measures would be very close to the true response.

The same test, however, will not be administered 20 times to document the distribution of scores around a true score. In administering a test once, the test result may be close to or far from the person’s true score. We assume, with trained assessors and standardized methods, however, these errors are randomly distributed across all people tested. Because random error is assumed, the mean of the multiple administrations is used as a best estimate of the true score. There may not be a totally stable/true score for a person because a person’s true score will vary from day to day. For example, how much fat/cholesterol is consumed on any day will vary. Errors decrease reliability, making it more difficult to detect a true score. Bias from error reduces statistical power, attenuates an effect size, and reduces or eliminates the probability of observing a significant impact. A reliable instrument will have less error and produce measurements close to true score.