The accuracy of subjective well-being measures
Accuracy is concerned with whether the measure in question accurately describes the qualities of the concept being measured. This, in turn, is usually assessed in terms of reliability and validity. Reliability concerns the extent to which a measure yields consistent results (i.e. whether it has a high signal-to-noise ratio). Validity, on the other hand, is about the extent to which it actually captures the underlying concept that it purports to measure (i.e. whether it is measuring the right thing). Some degree of reliability is a necessary but not sufficient condition for validity.
Reliability is a fundamental component of accuracy. For any statistical measure it is desirable that the measure produce the same results when carried out under the same circumstances. This is essential if the measure is to be able to be used to distinguish between changes in the measure due to a genuine change in the condition being measured as opposed to changes that simply represent measurement error. While no statistical measure is completely reliable, it would be of concern if measures of subjective well-being performed significantly worse than other commonly-used measures.
There are two main ways to measure reliability. Internal consistency reliability concerns the extent to which different items on an overall scale or measure agree with one another, and is assessed through examination of inter-item correlations. If the correlation between the two items is high, this suggests that the two measures capture the same underlying concept. On the other hand, if the correlation is low, it is not necessarily the case that both measures are poor but at least one of them must be.
The second approach involves looking at test-retest reliability, where the same question is administered to the same respondent more than once, separated by a fixed period of time. Test-retest reliability places a lower bound on the overall reliability of the measure, but not an upper bound. For example, a low test-retest score could indicate that a measure lacks reliability, but it could also be associated with a high level of actual reliability and a genuine change in the subject of interest.
Both aspects of reliability have been extensively tested for measures of life evaluation and affect over the past twenty years. There is strong evidence for convergence between different life satisfaction measures. In a meta-review of the reliability and validity of subjective well-being measures, Diener (2011) reports a Cronbach’s alpha8 for multiple item measures of life satisfaction (including the Satisfaction With Life Scale) of between 0.8
and 0.96. A Cronbach’s alpha of 0.7 is typically taken to be the threshold of acceptable convergence, and the scores Diener reports indicate a very high degree of convergence between the different questions used in the life satisfaction scales.
Whilst it is not possible to compute Cronbach’s alpha for single item measures, other estimation procedures can be employed. Comparisons across countries using different measures of the same construct generally show slightly lower correlations, but are still relatively high given that the scores not only represent different questions, but are also sourced from different surveys. Bjprnskov (2010), for example, finds a correlation of 0.75 between the average Cantril Ladder measure of life evaluation from the Gallup World Poll and life satisfaction as measured in the World Values Survey for a sample of over 90 countries.
Test-retest results for single item life evaluation measures tend to yield correlations of between 0.5 and 0.7 for time periods of 1 day to 2 weeks (Krueger and Schkade, 2008). Michalos and Kahlke (2010) report that a single-item measure of life satisfaction had a correlation of 0.67 for a one-year period and of 0.65 for a two-year period. In a larger study, Lucas and Donellan (2011) estimated the reliability of life satisfaction measures in four large representative samples with a combined sample size of over 68 000, taking into account specific errors. They found test-retest correlations in the range of 0.68 to 0.74, with a mean of 0.72 over a period of one year between reports.
Multiple item measures of subjective well-being also generally do better than single questions on test-retest reliability. Krueger and Schkade (2008) report test-retest scores in the range of 0.83 to 0.84 for a period of 2 weeks to 1 month between tests, with correlations declining to 0.64 at 2 months and to 0.54 over 4 years. The pattern of decline here is as expected, with longer periods of time showing lower reliability due to a higher likelihood that there has been a genuine change in the respondent’s circumstances. In another study, the “satisfaction with life” scale (a multi-item measure of life satisfaction) showed a correlation coefficient of 0.56, dropping to 0.24 after 16 years (Fujita and Diener, 2005).
Generally speaking, country averages of measures of subjective well-being show higher levels of stability than do those for individuals. Diener (2011), for example, reports a correlation coefficient of 0.93 for the Cantril Ladder in the Gallup World Poll over 1 year, and a correlation of 0.91 across 4-year intervals.
There is less information available on the reliability of measures of affect and eudaimonic well-being than is the case for measures of life evaluation. However, the available information is largely consistent with the picture for life satisfaction. In terms of internal consistency reliability, Diener et al. (2009) report that their Psychological Well-Being scale has a Cronbach’s alpha of 0.86 (N = 568), whilst the positive, negative and affective balance subscales of their Scale of Positive and Negative Experience (SPANE) have alphas of 0.84, 0.80 and 0.88, respectively (N = 572, 567 and 566).
In the case of test-retest reliability, which one might expect to be low in the case of momentary affect, but higher in the case of longer-term affective experiences, Krueger and Schkade (2008) report test-retest scores of between 0.5 and 0.7 for a range of different measures of affect over a 2-week period. Watson, Clark and Tellegen (1988) report slightly lower scores of between 0.39 and 0.71 for a range of different measures over an 8-week period. The lower scores are recorded by measures of momentary affect, while the upper scores are for questions focusing on affective states over a longer period of time, so the range of scores is consistent with expectations. Diener et al. (2009) meanwhile report a correlation of 0.71 (N = 261) between measures of Psychological Well-Being issued one month apart, whilst the positive, negative and affect balance measures of the SPANE had coefficients of 0.62, 0.62 and 0.68, respectively (N = 261). Clark and Senik (2011) meanwhile report a Cronbach’s alpha of 0.63 in their eudaimonia measure, which is derived from items in the well-being module of the European Social Survey (N = over 30 000 respondents).