# P-VALUE (CONFIDENCE INTERVAL) FUNCTIONS

To illustrate how estimation does a better job of expressing strength of relation and precision, we describe a curve that is often called a P-value function but is also referred to as a confidence interval function. The P-value function enlarges on the concept of the P value. The P value is a statistic that can be viewed as a measure of the compatibility between the data in hand and the null hypothesis. We can enlarge on this concept by imagining that instead of testing just the null hypothesis, we also calculate a P value for a range of other hypotheses. Consider the rate ratio measure, which can range from 0 to infinity and equals 1.0 if the null hypothesis is correct. The ordinary P value is a measure of the consistency between the data and the hypothesis that RR = 1.0. Mathematically, however, we are not constrained to test only the hypothesis that RR = 1.0. For any set of data, we can in principle calculate a P value that measures the compatibility between those data and any value of RR. We can even calculate an infinite number of P values that test every possible value of RR. If we did so and plotted the results, we end up with the P-value function. An example of a P-value function is given in Figure 8-1, which is based on the data in Table 8-1 describing a case-control study of drug exposure during pregnancy and congenital heart disease.1 Figure 8-1 P-value function for the case-control data in Table 8-1.

The curve in Figure 8-1, which resembles a tepee, plots the P value that tests the compatibility of the data in Table 8-1 with every possible value of RR. When RR = 1.0, the curve gives the P value testing the hypothesis that RR = 1.0; this is the usual P value testing the null hypothesis. For the data depicted in Figure 8-1, the ordinary P value is .08. This value would be described by many observers as not significant, because the P value is greater than .05. To many people, not significant implies that there is no relation between exposure and disease in the data. It is a fallacy, however, to infer a lack of association from a P value. The curve also gives the P values testing every other possible value of the RR, thus indicating the degree of compatibility between the data and every possible value of RR. The full P-value function in Figure 8-1 makes it clear that there is a strong association in

Table 8-1 Case-Control Data for Congenital Heart Disease and Chlordiazepoxide Use in Early Pregnancy

 Chlordiazepoxide Use Yes No Total Cases 4 386 390 Controls 4 1250 1254 Total 8 1636 1644 OR = (4 x1250)/(4 X386) = 3.2

Data from Rothman et al.

the data, despite the ordinary P value being greater than .05. Where the curve reaches its maximum (for which P = 1.0), the value of RR at that point is the value most compatible with the observed data. This RR value is called the point estimate. In Figure 8-1, the point estimate is RR = 3.2. As the RR departs from the point estimate in either direction, the corresponding P values decline, indicating less compatibility between the data and these relative risk hypotheses. The curve provides a quantitative overview of the statistical relation between exposure and disease. It indicates the best single value for the RR based on the data, and it gives a visual appreciation for the degree of precision of the estimate, which is indicated by the narrowness or the breadth of the tepee.

For those who rely on statistical significance for their interpretation of data, the ordinary P value (testing the hypothesis that RR = 1.0) of .08 in Figure 8-1 may be taken to imply that there is no relation between exposure and disease. But that interpretation is already contradicted by the point estimate, which indicates that the best estimate is more than a threefold increase in risk among those who are exposed. Moreover, the P-value function shows that values of RR that are reasonably compatible with the data extend over a wide range, from roughly RR = 1 to RR = 10. The P value for RR = 1 is identical to the P value for RR = 10.5, so there is no reason to prefer the interpretation of RR = 1 over the interpretation that RR = 10.5. A better estimate than either of these is RR = 3.2, the point estimate. The main lesson here is how misleading it can be to try to base an inference on a test of statistical significance, or, for that matter, on a P value.

The lesson is reinforced when we consider another P-value function that describes a set of hypothetical data given in Table 8-2. These hypothetical data lead to a narrow P-value function that reaches a peak slightly above the null value, RR = 1. Figure 8-2 contrasts the P-value function for the data in Table 8-2 with the P-value function given earlier for the data in Table 8-1. The narrowness of the second P-value function reflects the larger size of the second set of data. Large size translates to better precision, for which the visual counterpart is the narrow P-value function.

There is a striking contrast in messages from these two P-value functions. The first function suggests that the data are imprecise but reflect an association that is strong; the data are readily compatible with a wide range of effects, from very little or nothing to more than a 10-fold increase in risk. The first set of data thus raises the possibility that the exposure is a strong risk factor. Although the data do not permit a precise estimate of effect, the range of effect values consistent with the data includes mostly strong effects that would warrant concern about the exposure. This concern comes from data that give a "nonsignificant” result for a test of the null hypothesis. In contrast, the other set of data, from Table 8-2, gives a precise estimate of an effect that is close to the null. The data are not very compatible with a strong effect and, indeed, may be interpreted as reassuring about the absence of a strong effect. Despite this reassurance, the P value testing the null hypothesis is .04; a test of the null hypothesis would give a "statistically significant” result, rejecting the null hypothesis. In both cases, reliance on the significance test would be misleading and conducive to an incorrect interpretation. In the first case, the association is "not significant,” but the study is properly interpreted as raising concern about the effect of the exposure. In the second case, the study provides reassurance about the absence of a strong effect, but the

Table 8-2 Hypothetical Case-Control Data

 Exposure Yes No Total Cases 1,090 14,910 16,000 Controls 1,000 15,000 16,000 Total 2,090 29,910 32,000 II S' Q 'Ф' X о о о S о о 'XT X о ON о II О

significance test gives a result that is "significant,” rejecting the null hypothesis. This perverse behavior of the significance test should serve as a warning against using significance tests to interpret data.

Although it may superficially seem like a sophisticated application of quantitative methods, significance testing is only a qualitative proposition. The end result is a declaration of "significant” or "not significant” that provides no quantitative clue about the size of the effect. Contrast that approach with the P-value function, which is a quantitative visual message about the estimated size of the effect. The message comes in two parts, one relating to the strength of the effect and the other to precision. Strength is conveyed by the location of the curve along the horizontal axis and precision by the amount of spread of the function around the point estimate.

Because the P value is only one number, it cannot convey two separate quantitative messages. To get the message about both strength of effect and precision, Figure 8-2 P-value function for the data in Table 8-1 and the hypothetical case-control data in Table 8-2.

at least two numbers are required. Perhaps the most straightforward way to get both messages is from the upper and lower confidence limits, the two numbers that form the boundaries to a confidence interval. The P-value function is closely related to the set of all confidence intervals for a given estimate. This relation is depicted in Figure 8-3, which shows three different confidence intervals for the data in Figure 8-1. These three confidence intervals differ only in the arbitrary level of confidence that determines the width of the interval. In Figure 8-3, the 95% confidence interval can be read from the curve along the horizontal line where P = .05 and the 90% and 80% intervals along the lines where P = .1 and .2, respectively. The different confidence intervals in Figure 8-3 reflect the same degree of precision but differ in their width only because the level of confidence for each is arbitrarily different. The three confidence intervals depicted in Figure 8-3 are described as nested confidence intervals. The P-value function is a graph of all possible nested confidence intervals for a given estimate, reflecting all possible levels of confidence between 0% and 100%. It is this ability to find all possible confidence intervals from a P-value function that leads to its description as either a P-value function or a confidence interval function.

It is common to see confidence intervals reported for an epidemiologic measure, but it is uncommon to see a full P-value function or confidence interval function. Fortunately, it is not necessary to calculate and display a full P-value function to infer the two quantitative messages, strength of relation and precision, for an estimate. A single confidence interval is sufficient, because the upper and lower confidence bounds from a single interval are sufficient to determine the entire P-value function. If we know the lower and upper limit to the confidence interval, we know the location of the P-value function along the horizontal axis Figure 8-3 P-value function for the data from Table 8-1, showing how nested confidence intervals can be read from the curve.

and the spread of the function. Thus, from a single confidence interval, we can construct an entire P-value function. We do not need to go through the labor of calculating this function if we can visualize the two messages that it can convey directly from the confidence interval.

Regrettably, confidence intervals are too often not interpreted with the image of the corresponding P-value function in mind. A confidence interval can unfortunately be used as a surrogate test of statistical significance: a confidence interval that contains the null value within it corresponds to a significance test that is "not significant,” and a confidence interval that excludes the null value corresponds to a significance test that is "significant.” The allure of significance testing is so strong that many people use a confidence interval merely to determine "significance” and thereby ignore the potentially useful quantitative information that the confidence interval provides.

Example: Is Flutamide Effective in Treating Prostate Cancer?

In a randomized trial of flutamide, which is used to treat prostate cancer, Eisenberger et al.2 reported that patients who received flutamide fared no better than those who received placebo. Their interpretation that flutamide was ineffective contradicted the results of 10 previous studies, which collectively had pointed to a modest benefit. The 10 previous studies, on aggregate, indicated about an 11% survival advantage for patients receiving flutamide [odds ratio (OP) = 0.89]. The actual data reported by Eisenberger et al. are given in Table 8-3. From these data, we can calculate an OR of 0.87, almost the same result (slightly better) as was obtained in the 10 earlier studies. (We usually calculate odds ratios only for case-control data; for data such as these from an experiment, we normally calculate risk ratios or mortality rate ratios. The meta-analysis of the first 10 experiments on flutamide, however, reported only the OR, so we use that measure also for consistency.) Why did Eisenberger et al.2 interpret their data to indicate no effect when the data indicated about the same beneficial effect as the 10 previous studies? They based their conclusion solely on a test of statistical significance, which gave a result of P = .14. By focusing on statistical significance testing, they ignored the small beneficial effect in their data and came to an incorrect interpretation.

Table 8-3 Summary of Survival Data from the Study of Flutamide and Prostate Cancer

 Flutamide Placebo Died 468 480 Survived 229 205 Total 697 685 OR = 0.87 95% CI: 0.70-1.10

Data from Eisenberger et al.

The original 10 studies on flutamide were published in a review that summarized the results.3 It is helpful to examine the P-value function from these 10 studies and to compare it with the P-value function after adding the study of Eisenberger et al.2 to the earlier studies (Fig. 8-4).4 The only change apparent from adding the data of Eisenberger et al.2 is a slightly improved precision of the estimated benefit of flutamide in reducing the risk of dying from prostate cancer.

Example: Is St. John’s Wort Effective in Relieving Major Depression?

Extracts of St. John's Wort (Hypericum perforatum), a small, flowering weed, have long been used as a folk remedy. It is a popular herbal treatment for depression. Shelton et al.5 reported the results of a randomized trial of 200 patients with major depression who were randomly assigned to receive either St. John's Wort or placebo. Of 98 who received St. John's Wort, 26 responded positively, whereas 19 of the 102 who received placebo responded positively. Among those whose depression was relatively less severe at entry into the study (a group that the investigators thought might be more likely to show an effect of St. John's Wort), the Figure 8-4 P-value functions for the first 10 studies of flutamide and prostate cancer survival (solid line)3 and for the first 11 studies (dashed line) after adding the study by Eisenberger et al.2 The study by Eisenberger et al. did not shift the overall findings toward the null value but instead shifted the overall findings a minuscule step away from the null value. Nevertheless, because of an inappropriate reliance on statistical significance testing, the data were incorrectly interpreted as refuting earlier studies and indicating no effect of flutamide, despite the fact that the findings replicated previous results. (Reproduced with permission from Rothman et al.4)

Table 8-4 Remissions Among Patients with Less Severe Depression

 St. John’s Wort Placebo Remission 12 5 No remission 47 45 Total 59 50 RR = 2.0 90% CI: 0.90 -4.6

Data from Shelton et al.5

proportion of patients who had remission of disease was twice as great among the 59 patients who received St. John's Wort as among the 50 who received a placebo (Table 8-4).

In Table 8-4, risk ratio refers to the "risk” of having a remission in symptoms, which is an improvement, so any increase above 1.0 indicates a beneficial effect of St. John's Wort; the RR of 2.0 indicates that the probability of a remission was twice as great for those receiving St. John's Wort. Despite these and other encouraging findings in the data, the investigators based their interpretation on a lack of statistical significance and concluded that St. John's Wort was not effective. A look at the P-value function that corresponds to the data in Table 8-4 is instructive (Fig. 8-5).

Figure 8-5 shows that the data regarding remissions among the less severely affected patients hardly support the theory that St. John's Wort is ineffective. The data for other outcomes were also generally favorable for St. John's Wort but, for Figure 8-5 P-value function for the effect of St. John's Wort on remission from major depression among relatively less severely affected patients. (Data from Shelton et al.5)

almost all comparisons, not statistically significant. Instead of concluding, as they should have, that these data are readily compatible with moderate and even strong beneficial effects of St. John's Wort, the investigators drew the wrong conclusion, based on the lack of statistical significance in the data. Although the P value from this study is not statistically significant, the P value for the null hypothesis has the same magnitude as the P value testing the hypothesis that the RR = 4.1 (on the graph, the dashed line intersects the P-value function at RR = 1.0 and RR = 4.1). Although the investigators interpreted the data as supporting the hypothesis that RR = 1.0, the data are equally compatible with values of 1.0 or 4.1. Furthermore, it is not necessary to construct the P-value function in Figure 8-5 to reach this interpretation. An investigator need look no farther than the confidence interval given in Table 8-4 to appreciate the location and the spread of the underlying P-value function.