As noted, the measurement of clinical significance has received considerable attention in the psychotherapy literature. The focus in this field is typically on measuring clinical significance in terms of symptom change (e.g., changes in level of depression, anxiety, social phobia). In this scenario, methods have been derived that focus on changes in symptomatology in specific clinical populations. However, clearly measures of symptoms also apply to other interventions and populations beyond those found in the field of psychotherapy. For example, caregiver intervention studies often include measures of symptoms such as indices of depression, anxiety, and health symptoms in which symptom reduction as a clinical significant outcome would be relevant.
A classic example of a measure of symptom reduction that has clinical meaning is the Patient Health Questionnaire-9 (PHQ-9) scale, a screen for depression that is widely used in trials. This scale yields a symptom severity score, and, in turn, scores can be categorized as no symptomatology, to mild, moderate, moderate severe, and severe depression, mapping on to DSM-5 classifications of depression. Interventions designed to reduce depressive symptoms and which use the PHQ-9 as an outcome measure can demonstrate clinical significance by showing the percentage of individuals who entered remission, who changed diagnostic categories, or who reduced symptomatology by 10 points—all approaches identified as having clinical relevance (Gitlin et al., 2013; Kroenke & Spitzer, 2002).
Comparison methods, which involve comparisons of individuals who receive an intervention with other individuals (e.g., normative samples, dysfunctional samples), are commonly used to determine the clinical significance of changes in symptoms. These methods can also be used with other measures if comparative norms are available. A widely used comparison approach is the method developed by Jacobson and Truax (1991) (Jacobson, Roberts, Berns, & McGlinchey, 1999), which is based on a change score approach where intraindividual comparisons are made on an outcome measure pre- and posttreatment. Participants who receive the treatment are compared posttreatment with the untreated sample that has a similar level of dysfunction prior to treatment. The idea is that, following treatment, individuals will be significantly different from that group. The method assumes that there are two distributions for the outcome measure of interest: a functional distribution and a dysfunctional distribution. Using this method, there are two criteria for establishing clinically significant change. First, a cutoff point must be established for the outcome measure of interest that a person must cross (their posttest score must cross this cutoff point) to move from the dysfunctional to the functional group. The cutoff is typically a weighted midpoint between the means of the two distributions. For example, a depressed caregiver who is involved in a coping skills-training intervention must have a CES-D score following treatment that is more similar to a CES-D score for the general population than to a score of a depressed caregiver who has not received the intervention. Different criteria can be used to determine if the change in the treated individual is significantly different from the untreated dysfunctional group.
Second, the change from pre- to posttest must be large enough to be reliable and not due to measurement error. Reliability is assessed by calculation of the Reliable Change Index (RCI), which is based on the pretreatment score, the posttreatment score, and the standard error of the difference between the two scores. For example, a common criterion that is used as an RCI greater than ±1.96 standard deviation units indicates reliable change. Using this method, one can determine the percentage of individuals who improved but did not recover, the percentage of individuals who are no longer depressed, and the percentage of individuals who remain unchanged or who have gotten worse. These percentages can then be compared between groups (e.g., treatment vs. control) using contingency table analyses to determine whether the observed differences between the groups in symptom improvement are statistically significant.
There are limitations to this approach. Of course, the method works best when adequate norms are available for chosen outcome measures for both the dysfunctional and functional populations. It is also difficult to make comparisons about the clinical significance of a given treatment across studies if different outcome measures are used (e.g., the CES-D [Radloff, 1977] vs. the Beck Depression Inventory [Beck, Ward, Mendelson, Mock, & Erbaugh, 1961]). The method is also limited to the extent that return to normal functioning is a feasible goal of the intervention. There may be some populations for which return to normal functioning is not possible (e.g., schizophrenics) or for whom other outcomes such as better coping skills or QoL may be more reasonable or of greater practical value. In addition, the method does not address a person’s level of functioning at the end of an intervention. A significant change in the level of a symptom does not necessarily mean that a person is functioning at a “normal” level. As noted by Kazdin (1994), using statistical criteria to determine a clinically important change is problematic as is the reliance on assessing symptoms with paper and pencil tests as this may not adequately capture a person’s level of functioning.
Several alternatives to the Jacobson and Truax (JT) method have been proposed, which represent statistical refinements to the JT method and are designed to improve sensitivity in detecting clinically meaningful change. The Edwards-Nunnally method (McGlinchey, Atkins, & Jacobson, 2002) derives reliable change by observing an individual’s posttest score relative to an established confidence interval, which is intended to reduce problems with measurement error and misclassification of individuals. Hierarchical Linear Modeling (HLM) method (Speer, 2001) is useful for studies that have missing data points. Studies (McGlinchey et al., 2002; Speer & Greenbaum, 1995) have been conducted to compare the predictive utility of these methods. The results indicate that there is little evidence to suggest that these refinements yield different information or are superior to the JT approach.
An alternative comparison approach for estimating clinically significant change is based on normative comparison (Kendall et al., 1999), where the behavior or symptoms of individuals at posttreatment are compared to a sample of peers who are functioning well or without significant problems on the outcome measure of interest. In essence, normative comparisons are used to determine if treated individuals are indistinguishable from “well-functioning” individuals on the outcome measure(s) of interest. Clinical significance is defined as end-state functioning that falls within normal range on the critical dependent measures.
For example, Kazdin and colleagues (Kazdin, Siegel, & Bass, 1992) evaluated three interventions for children with aggression and antisocial behavior patterns: a problem-solving skills training (PSST) intervention, parent management training (PMT), and PSST + PMT. Treatment was provided to the children and/or their parents, and the outcome measures included standardized scales that were completed by both the children and parents and had available normative data. The investigators identified that the using the 90th percentile cutoff on the measures from the normative sample best separated the clinical from community samples. In addition, in the intervention study, scores at this percentile were used to define the upper limit of the range of problematic behaviors. They defined clinically significant change as scores that fell below this cutoff. Overall, they found that, although the statistical evaluation of change was evident across many measures, when considering clinical significance (return to normative levels of function), the findings were more modest. For example, using the parent evaluation measure, 33% of the PSST group, 39% of the PMT, and 64% of the combined treatment group returned to “normative” levels of performance.
Typically, equivalence testing is employed to determine if an intervention group performs in a manner that is statistically equivalent to a functional sample. The use of equivalence testing requires the availability of a normative nonpatient sample that is comparable to the treatment group on key dimensions (e.g., age, ethnicity, socioeconomic status). Thus, careful consideration must be given to the selection of the normative group in terms of sample representativeness and sampling equivalence. For example, if the study is concerned with a weight loss intervention for obese adults, it is important that the normative data used for comparison are based on adults with similar characteristics as the obese sample included in the study (e.g., age, gender, height). Decisions about which group will serve as the reference group have a large impact on conclusions regarding clinical significance. It is also important that the normative data is current as norms for various metrics can change. Another potential shortcoming with this approach is that there may be a lack of normative data for the outcome of interest.
One general issue with the comparison approaches relates to the clinical relevance of the measures used to evaluate treatment outcomes. For example, the Revised Memory and Behavior Problem Checklist (RMBPC) (Teri et al., 1992) is often used in caregiver intervention studies to assess the extent and severity of behavior problems in persons with dementia. A significant reduction in ratings of behavior problems by caregivers following an intervention does not necessarily equate with real changes in behavioral occurrences, a change in the caregiver’s level of distress, or an improvement in the quality of his or her life. Also, what constitutes a meaningful change in certain measures is unclear; for example, there is no agreed-upon cutoff score for many psychosocial measures such as burden or well-being. Furthermore, some people may experience a change in functioning that is not within normative limits, but the change makes a significant improvement in their everyday functioning. A caregiver may experience a reduction in the frequency of behavioral symptoms although the behaviors still persist. Nevertheless, the reduction in the frequency or severity of their occurrence may be of importance to the caregiver. A person who is severely depressed might experience a reduction in symptoms sufficient to allow a return to work even though he or she is still more depressed at the end of treatment than someone in the normative range.
Second, most measures are unidimensional and tap constructs such as depression, anxiety, or burden. Yet, many intervention studies target multidimensional problems. Thus, one issue is determining the measure or measures that best reflect that an intervention has achieved a clinically significance impact. This problem is compounded if there is discordance among measures. For example, a caregiver may not show any change in symptoms of depression, but report a decrease in burden and better coping skills.
As noted, domains other than symptoms also hold importance in defining clinical significance. Thus, symptoms are not the sole criteria for making judgments about clinical significance. There are other key constructs along which clinical significance could be evaluated depending on the goals of the intervention. For example, the intervention goals might be aimed at increasing mobility or amount of exercise, or enhancing knowledge about a topic or coping skills.