# Criterion Validity—The Gold Standard

An instrument has high criterion validity if there is a close fit between the measures it produces and the measures produced by some other instrument that is known to be valid. This is the gold standard test.

A tape measure, for example, is known to be an excellent instrument for measuring height. If you knew that a man in the United States wore shirts with 35" sleeves, and pants with 34" cuffs, you could bet that he was over 6' tall and be right more than 95% of the time. On the other hand, you might ask: ‘‘Why should I measure his cuff length and sleeve length in order to know most of the time, in general, how tall he is, when I could use a tape measure and know all of the time, precisely how tall he is?’’

Indeed. If you want to measure someone’s height, use a tape measure. Don’t substitute a lot of fuzzy proxy variables for something that’s directly measurable by known, valid indicators. But if you want to measure things like acculturation or quality of life or socioeconomic class—things that don’t have universally accepted, valid indicators—then a complex measure will just have to do until something simpler comes along (box 2.4).

BOX 2.4

THE PRINCIPLE OF OCKHAM’S RAZOR

The preference in science for simpler explanations and measures over more complicated ones is called the principle of parsimony. It is also known as Ockham's razor, after William of Ockham (1285-1349), a medieval philosopher who argued Pluralitas non est ponenda sine necessitate, or ''Don't make things more complicated than they need to be.''

You can tap the power of criterion validity for complex constructs with the known group comparison technique. If you develop a scale to measure political ideology, you might try it out on members of the American Civil Liberties Union and on members of the Christian Coalition of America. Members of the ACLU should get high ‘‘left’’ scores, and members of the CCA should get high ‘‘right’’ scores. If they don’t, there’s probably something wrong with the scale. In other words, the known-group scores are the criteria for the validity of your instrument.

A particularly strong form of criterion validity is predictive validity—whether an instrument lets you predict accurately something else you’re interested in. ‘‘Stress’’ is a complex construct. It occurs when people interpret events as threatening to their lives. Some people interpret a bad grade on an exam as a threat to their whole life, and others just blow it off. Now, stress is widely thought to produce a lowered immune response and increase the chances of getting sick. A really good measure of stress, then, ought to predict the likelihood of getting sick.

Remember the life insurance problem? You want to predict whether someone is likely to die in the next 365 days to know how much to charge them in premiums. Age and sex tell you a lot. But if you know their weight, whether they smoke, whether they exercise regularly, what their blood pressure is, whether they have ever had any one of a list of diseases, and whether they test-fly experimental aircraft for a living, then you can predict—with a higher and higher degree of accuracy—whether they will die within the next 365 days. Each piece of data—each component of a construct you might call ‘‘lifestyle’’—adds to your ability to predict something of interest.