Cano and Hobart have been two of the most vocal and consistent critics of the use of traditional psychometric methods to develop PROMs. They have also been two of the most ardent supporters of the use of “new psychometric methods.” In this section, I focus on Cano and Hobart’s (2011) suggestion for correcting PROMs’ current limitations.

While Hobart et al. agree that most of the PROMs in use lack theoretical development, they trace this error to CTT. The problem with CTT is that it does not provide the theoretical resources needed to model the measuring instrument, in this case a PROM. CTT theorizes that a person’s observed score on the scale is the sum of the unobserved score to be estimated, i.e., the person’s true score, plus measurement error (Hobart et al. 2007). Consider a physical functioning scale with a scoring range of 11-44, where higher scores indicate more limited functioning. Imagine that someone’s observed score was 23. CTT tells us that this observed score is the result of their true score plus measurement error. Respondents’ true scores are what we would like to know, but to get them, we need some idea of what, e.g., quality of life scores look like for this particular cohort of respondents (say, respondents with lung cancer). At the same time, we need some idea of what counts as instances of measurement error. For instance, does response shift count as an instance of measurement error or part of a person’s true score? Response shift is defined as the change in the meaning of one’s self-evaluation of a target construct (Schwartz and Sprangers 1999). A classic example of response shift is when a respondent becomes accustomed to her disease/illness/disability and recalibrates her internal standard of measurement. Is this recalibration best understood as measurement error—as much of the quality of life literature treats it—or is it best understood as a legitimate aspect of the quality of life construct, i.e., to be incorporated in our theory of quality of life? CTT leaves us unable to answer this question.^{[1]}

The problem with CTT is that it does not give us a theoretical ideal for the true score as, say, the measurement of time has a theoretical ideal, i.e., the second is defined as the duration of exactly 9,192,631,770 periods of the radiation corre?sponding to a hyperfine transition of cesium-133 in the ground state (BIPM (Bureau International des Poids et Measures) 2006). Nor are CTT’s target constructs sufficiently embedded within theories that would allow for approximations of measurement error as, for example, time is embedded within physical theory, e.g., the definition of the second assumes that cesium is in a flat space-time, but the cesium fountains (primary standards) that metrologists build are subject to gravitational redshift. Relativity theory helps us to estimate the error associated with these phenomena (Tal 2011). In other words, CTT does not provide a theoretical representation of the measurement interaction, i.e., the relationship between the construct of interest and its instruments (McClimans 2015).

In place of CTT, Hobart et al. argue for the use of new psychometric methods, particularly Rasch methodology. How is Rasch different from CTT? One important difference is that Rasch has an explicit mathematical model that provides a representation of the measurement interaction. Rasch measurement theory says that a person’s response to an item is determined by the difference between a person’s location on the ruler (i.e., how much ability they have) and an item’s location on the ruler (i.e., how much ability an item requires). Thus it provides a representation of the measurement interaction. In particular, Rasch states that the higher a person’s ability with respect to the difficulty of an item, the higher the probability that a person will answer “yes” to an item. The Rasch scale runs from plus to minus infinity with the zero point at the place where the difficulty of the items in the survey is equal to the ability of the sample population. Each item is located on the ruler relative to the point at which there is equal probability of respondents answering “yes” or “no” to that particular item. In Rasch, the probability of answering “yes” or “no” is modeled as a logistic function. The mathematical equation that governs this function is the model that represents Rasch measurement theory.

A second difference between Rasch and CTT is that Rasch prioritizes its mathematical model over the data. CTT defines what it is measuring, say, quality of life, by the items that are generated, usually from a qualitative sample. It is a descriptive approach. The Rasch model, however, operationally defines the construct one is trying to measure as a relationship between a person’s ability and the probability they will answer “yes” to an item. When applied to a sample population, this model provides the characteristics and regression weights for selecting items and determining their difficulty, i.e., their place on the ruler (Stenner and Smith 1982). For example, if I ask respondents if they are balanced when seated and 95% say yes and if I ask if they can climb a flight of stairs and only 60% say yes, then balancing when seated is considered less difficult (i.e., requires less ability) than climbing stairs. Climbing stairs will appear farther down one end of the ruler than balanced when seated. Likewise, those who answer yes to more difficult items are considered to have more ability.

Within Rasch, the validity of a measure is determined by how well the data (i.e., respondent answers to survey questions) fit the predictions of the mathematical model applied to a sample (i.e., is balanced sitting less difficult than climbing stairs?) These predictions are made explicit in the construct specification equation in terms of the amount of variance that we should expect to find around the mean with respect to the balancing and climbing questions. If the model correctly describes the probability distributions of people responding to questions about walking and climbing, then we can say that the observed rating scale data satisfy the measurement model, i.e., this is a valid measure. If the model does not correctly describe the observed data, then so much the worse for the data. Within the confines of Rasch, data that do not fit the model cannot be measured.

For Hobart et al., it is not only the validity of PROMs that Rasch improves but also their interpretability. As I discussed above, PROMs developed using CTT are notoriously difficult to interpret. The clinical significance of a 10-point increase on a particular scale is unclear in part because CTT can only deliver ordinal level data. They are also difficult to interpret because they have dubious validity, and if we do not know what something measures, it is difficult to interpret the significance of changes in scale scores. Rasch instruments, on the other hand, provide each item a precise location on the scale. If a mobility scale is validated, then an improvement from -1.5 to .5 indicates an improvement from, say, having the ability to climb the stairs to having the ability to walk on uneven ground. Every increase or decrease in ability is tagged on a Rasch scale to items of various difficulties and thus provides estimates of clinical significance for each move up or down the ruler.

[1] For a longer discussion of the difficulties that CTT has with distinguishing between true scoresand measurement error, see McClimans 2017.