PVALUE (CONFIDENCE INTERVAL) FUNCTIONS
To illustrate how estimation does a better job of expressing strength of relation and precision, we describe a curve that is often called a Pvalue function but is also referred to as a confidence interval function. The Pvalue function enlarges on the concept of the P value. The P value is a statistic that can be viewed as a measure of the compatibility between the data in hand and the null hypothesis. We can enlarge on this concept by imagining that instead of testing just the null hypothesis, we also calculate a P value for a range of other hypotheses. Consider the rate ratio measure, which can range from 0 to infinity and equals 1.0 if the null hypothesis is correct. The ordinary P value is a measure of the consistency between the data and the hypothesis that RR = 1.0. Mathematically, however, we are not constrained to test only the hypothesis that RR = 1.0. For any set of data, we can in principle calculate a P value that measures the compatibility between those data and any value of RR. We can even calculate an infinite number of P values that test every possible value of RR. If we did so and plotted the results, we end up with the Pvalue function. An example of a Pvalue function is given in Figure 81, which is based on the data in Table 81 describing a casecontrol study of drug exposure during pregnancy and congenital heart disease.^{1}
Figure 81 Pvalue function for the casecontrol data in Table 81.
The curve in Figure 81, which resembles a tepee, plots the P value that tests the compatibility of the data in Table 81 with every possible value of RR. When RR = 1.0, the curve gives the P value testing the hypothesis that RR = 1.0; this is the usual P value testing the null hypothesis. For the data depicted in Figure 81, the ordinary P value is .08. This value would be described by many observers as not significant, because the P value is greater than .05. To many people, not significant implies that there is no relation between exposure and disease in the data. It is a fallacy, however, to infer a lack of association from a P value. The curve also gives the P values testing every other possible value of the RR, thus indicating the degree of compatibility between the data and every possible value of RR. The full Pvalue function in Figure 81 makes it clear that there is a strong association in
Table 81 CaseControl Data for Congenital Heart Disease and Chlordiazepoxide Use in Early Pregnancy
Chlordiazepoxide Use 

Yes 
No 
Total 

Cases 
4 
386 
390 
Controls 
4 
1250 
1254 
Total 
8 
1636 
1644 
OR = (4 x1250)/(4 X386) = 3.2 
Data from Rothman et al.
the data, despite the ordinary P value being greater than .05. Where the curve reaches its maximum (for which P = 1.0), the value of RR at that point is the value most compatible with the observed data. This RR value is called the point estimate. In Figure 81, the point estimate is RR = 3.2. As the RR departs from the point estimate in either direction, the corresponding P values decline, indicating less compatibility between the data and these relative risk hypotheses. The curve provides a quantitative overview of the statistical relation between exposure and disease. It indicates the best single value for the RR based on the data, and it gives a visual appreciation for the degree of precision of the estimate, which is indicated by the narrowness or the breadth of the tepee.
For those who rely on statistical significance for their interpretation of data, the ordinary P value (testing the hypothesis that RR = 1.0) of .08 in Figure 81 may be taken to imply that there is no relation between exposure and disease. But that interpretation is already contradicted by the point estimate, which indicates that the best estimate is more than a threefold increase in risk among those who are exposed. Moreover, the Pvalue function shows that values of RR that are reasonably compatible with the data extend over a wide range, from roughly RR = 1 to RR = 10. The P value for RR = 1 is identical to the P value for RR = 10.5, so there is no reason to prefer the interpretation of RR = 1 over the interpretation that RR = 10.5. A better estimate than either of these is RR = 3.2, the point estimate. The main lesson here is how misleading it can be to try to base an inference on a test of statistical significance, or, for that matter, on a P value.
The lesson is reinforced when we consider another Pvalue function that describes a set of hypothetical data given in Table 82. These hypothetical data lead to a narrow Pvalue function that reaches a peak slightly above the null value, RR = 1. Figure 82 contrasts the Pvalue function for the data in Table 82 with the Pvalue function given earlier for the data in Table 81. The narrowness of the second Pvalue function reflects the larger size of the second set of data. Large size translates to better precision, for which the visual counterpart is the narrow Pvalue function.
There is a striking contrast in messages from these two Pvalue functions. The first function suggests that the data are imprecise but reflect an association that is strong; the data are readily compatible with a wide range of effects, from very little or nothing to more than a 10fold increase in risk. The first set of data thus raises the possibility that the exposure is a strong risk factor. Although the data do not permit a precise estimate of effect, the range of effect values consistent with the data includes mostly strong effects that would warrant concern about the exposure. This concern comes from data that give a "nonsignificant” result for a test of the null hypothesis. In contrast, the other set of data, from Table 82, gives a precise estimate of an effect that is close to the null. The data are not very compatible with a strong effect and, indeed, may be interpreted as reassuring about the absence of a strong effect. Despite this reassurance, the P value testing the null hypothesis is .04; a test of the null hypothesis would give a "statistically significant” result, rejecting the null hypothesis. In both cases, reliance on the significance test would be misleading and conducive to an incorrect interpretation. In the first case, the association is "not significant,” but the study is properly interpreted as raising concern about the effect of the exposure. In the second case, the study provides reassurance about the absence of a strong effect, but the
Table 82 Hypothetical CaseControl Data
Exposure 

Yes 
No 
Total 

Cases 
1,090 
14,910 
16,000 
Controls 
1,000 
15,000 
16,000 
Total 
2,090 
29,910 
32,000 
II S' Q 'Ф' X о о о S о о 'XT X о ON о II О 
significance test gives a result that is "significant,” rejecting the null hypothesis. This perverse behavior of the significance test should serve as a warning against using significance tests to interpret data.
Although it may superficially seem like a sophisticated application of quantitative methods, significance testing is only a qualitative proposition. The end result is a declaration of "significant” or "not significant” that provides no quantitative clue about the size of the effect. Contrast that approach with the Pvalue function, which is a quantitative visual message about the estimated size of the effect. The message comes in two parts, one relating to the strength of the effect and the other to precision. Strength is conveyed by the location of the curve along the horizontal axis and precision by the amount of spread of the function around the point estimate.
Because the P value is only one number, it cannot convey two separate quantitative messages. To get the message about both strength of effect and precision,
Figure 82 Pvalue function for the data in Table 81 and the hypothetical casecontrol data in Table 82.
at least two numbers are required. Perhaps the most straightforward way to get both messages is from the upper and lower confidence limits, the two numbers that form the boundaries to a confidence interval. The Pvalue function is closely related to the set of all confidence intervals for a given estimate. This relation is depicted in Figure 83, which shows three different confidence intervals for the data in Figure 81. These three confidence intervals differ only in the arbitrary level of confidence that determines the width of the interval. In Figure 83, the 95% confidence interval can be read from the curve along the horizontal line where P = .05 and the 90% and 80% intervals along the lines where P = .1 and .2, respectively. The different confidence intervals in Figure 83 reflect the same degree of precision but differ in their width only because the level of confidence for each is arbitrarily different. The three confidence intervals depicted in Figure 83 are described as nested confidence intervals. The Pvalue function is a graph of all possible nested confidence intervals for a given estimate, reflecting all possible levels of confidence between 0% and 100%. It is this ability to find all possible confidence intervals from a Pvalue function that leads to its description as either a Pvalue function or a confidence interval function.
It is common to see confidence intervals reported for an epidemiologic measure, but it is uncommon to see a full Pvalue function or confidence interval function. Fortunately, it is not necessary to calculate and display a full Pvalue function to infer the two quantitative messages, strength of relation and precision, for an estimate. A single confidence interval is sufficient, because the upper and lower confidence bounds from a single interval are sufficient to determine the entire Pvalue function. If we know the lower and upper limit to the confidence interval, we know the location of the Pvalue function along the horizontal axis
Figure 83 Pvalue function for the data from Table 81, showing how nested confidence intervals can be read from the curve.
and the spread of the function. Thus, from a single confidence interval, we can construct an entire Pvalue function. We do not need to go through the labor of calculating this function if we can visualize the two messages that it can convey directly from the confidence interval.
Regrettably, confidence intervals are too often not interpreted with the image of the corresponding Pvalue function in mind. A confidence interval can unfortunately be used as a surrogate test of statistical significance: a confidence interval that contains the null value within it corresponds to a significance test that is "not significant,” and a confidence interval that excludes the null value corresponds to a significance test that is "significant.” The allure of significance testing is so strong that many people use a confidence interval merely to determine "significance” and thereby ignore the potentially useful quantitative information that the confidence interval provides.
Example: Is Flutamide Effective in Treating Prostate Cancer?
In a randomized trial of flutamide, which is used to treat prostate cancer, Eisenberger et al.^{2} reported that patients who received flutamide fared no better than those who received placebo. Their interpretation that flutamide was ineffective contradicted the results of 10 previous studies, which collectively had pointed to a modest benefit. The 10 previous studies, on aggregate, indicated about an 11% survival advantage for patients receiving flutamide [odds ratio (OP) = 0.89]. The actual data reported by Eisenberger et al. are given in Table 83. From these data, we can calculate an OR of 0.87, almost the same result (slightly better) as was obtained in the 10 earlier studies. (We usually calculate odds ratios only for casecontrol data; for data such as these from an experiment, we normally calculate risk ratios or mortality rate ratios. The metaanalysis of the first 10 experiments on flutamide, however, reported only the OR, so we use that measure also for consistency.) Why did Eisenberger et al.^{2} interpret their data to indicate no effect when the data indicated about the same beneficial effect as the 10 previous studies? They based their conclusion solely on a test of statistical significance, which gave a result of P = .14. By focusing on statistical significance testing, they ignored the small beneficial effect in their data and came to an incorrect interpretation.
Table 83 Summary of Survival Data from the Study of Flutamide and Prostate Cancer
Flutamide 
Placebo 

Died 
468 
480 
Survived 
229 
205 
Total 
697 
685 
OR = 0.87 

95% CI: 0.701.10 
Data from Eisenberger et al.
The original 10 studies on flutamide were published in a review that summarized the results.^{3} It is helpful to examine the Pvalue function from these 10 studies and to compare it with the Pvalue function after adding the study of Eisenberger et al.^{2} to the earlier studies (Fig. 84).^{4} The only change apparent from adding the data of Eisenberger et al.^{2} is a slightly improved precision of the estimated benefit of flutamide in reducing the risk of dying from prostate cancer.
Example: Is St. John’s Wort Effective in Relieving Major Depression?
Extracts of St. John's Wort (Hypericum perforatum), a small, flowering weed, have long been used as a folk remedy. It is a popular herbal treatment for depression. Shelton et al.^{5} reported the results of a randomized trial of 200 patients with major depression who were randomly assigned to receive either St. John's Wort or placebo. Of 98 who received St. John's Wort, 26 responded positively, whereas 19 of the 102 who received placebo responded positively. Among those whose depression was relatively less severe at entry into the study (a group that the investigators thought might be more likely to show an effect of St. John's Wort), the
Figure 84 Pvalue functions for the first 10 studies of flutamide and prostate cancer survival (solid line)^{3} and for the first 11 studies (dashed line) after adding the study by Eisenberger et al.^{2} The study by Eisenberger et al. did not shift the overall findings toward the null value but instead shifted the overall findings a minuscule step away from the null value. Nevertheless, because of an inappropriate reliance on statistical significance testing, the data were incorrectly interpreted as refuting earlier studies and indicating no effect of flutamide, despite the fact that the findings replicated previous results. (Reproduced with permission from Rothman et al.^{4})
Table 84 Remissions Among Patients with Less Severe Depression
St. John’s Wort 
Placebo 

Remission 
12 
5 
No remission 
47 
45 
Total 
59 
50 
RR = 2.0 

90% CI: 0.90 
4.6 
Data from Shelton et al.^{5}
proportion of patients who had remission of disease was twice as great among the 59 patients who received St. John's Wort as among the 50 who received a placebo (Table 84).
In Table 84, risk ratio refers to the "risk” of having a remission in symptoms, which is an improvement, so any increase above 1.0 indicates a beneficial effect of St. John's Wort; the RR of 2.0 indicates that the probability of a remission was twice as great for those receiving St. John's Wort. Despite these and other encouraging findings in the data, the investigators based their interpretation on a lack of statistical significance and concluded that St. John's Wort was not effective. A look at the Pvalue function that corresponds to the data in Table 84 is instructive (Fig. 85).
Figure 85 shows that the data regarding remissions among the less severely affected patients hardly support the theory that St. John's Wort is ineffective. The data for other outcomes were also generally favorable for St. John's Wort but, for
Figure 85 Pvalue function for the effect of St. John's Wort on remission from major depression among relatively less severely affected patients. (Data from Shelton et al.^{5})
almost all comparisons, not statistically significant. Instead of concluding, as they should have, that these data are readily compatible with moderate and even strong beneficial effects of St. John's Wort, the investigators drew the wrong conclusion, based on the lack of statistical significance in the data. Although the P value from this study is not statistically significant, the P value for the null hypothesis has the same magnitude as the P value testing the hypothesis that the RR = 4.1 (on the graph, the dashed line intersects the Pvalue function at RR = 1.0 and RR = 4.1). Although the investigators interpreted the data as supporting the hypothesis that RR = 1.0, the data are equally compatible with values of 1.0 or 4.1. Furthermore, it is not necessary to construct the Pvalue function in Figure 85 to reach this interpretation. An investigator need look no farther than the confidence interval given in Table 84 to appreciate the location and the spread of the underlying Pvalue function.