 # Hypothesis Tests

The third major frequentist statistical tool for inference is hypothesis testing. Rather than estimate the parameter or gauge the strength of the evidence against a hypothesized value (or set of values) for the parameter, frequentist hypothesis testing prescribes epistemic decisions with regard to these values. It gives us decision rules that are shaped by the probabilities of erroneous decisions. A report of a committee of the National Academy of Sciences (National Research Council Committee, 2004) summarized the basic idea for formulating a statistical decision rule for matching bullet lead fragments on the basis of measurements of the concentrations of various trace elements as follows (Kafadar and Spiegelman, 2004, pp. 170-71):

The classical approach to deciding between the two hypotheses was developed in the 1930s. The standard hypothesis-testing procedure consists of these steps:

• 1. Set up the two hypotheses. The "assumed" state of affairs is generally the mill hypothesis, for example, "drug is no better than placebo." In the compositional anal- ysis of bullet lead (CABL) context, the null hypothesis is "bullets do not match" or "mean concentrations of materials from which these two bullets were produced are not the same" (assume "not guilty"). The converse is called the alternative hypoth- esis, for example, "drug is effective" or in the CABL context, "bullets match" or "mean concentrations are the same."
• 2. Determine an acceptable level of risk posed by rejecting the null hypothesis when it is actually true. The level is set according to the circumstances. Conventional values in many fields are 0.05 and 0.01; that is, in one of 20 or in one of 100 cases when this test is conducted, the test will erroneously decide on the alternative hypothesis ("bullets match") when the null hypothesis actually was correct ("bullets do not match"). The preset level is considered inviolate; a procedure will not be considered if its "risk" exceeds it....
• 3. Calculate a quantity based on the data (for example, involving the sample mean concentrations of the seven elements in the two bullets), known as a test statistic. The value of the test statistic will be used to test the null hypothesis versus the alternative hypothesis.
• 4. The preset level of risk and the test statistic together define two regions, corre- sponding to the two hypotheses. If the test statistic falls in one region, the decision is to fail to reject the null hypothesis; if it falls in the other region (called the criti- cal region), the decision is to reject the null hypothesis and conclude the alternative hypothesis.

The critical region has the following property: Over the many times that this protocol is followed, the probability of falsely rejecting the null hypothesis does not exceed the preset level of risk. [Ideally, a] test procedure ... has a further property: if the alternative hypothesis holds, the procedure will have the greatest chance of correctly rejecting the null hypothesis.

We will show how this procedure works using the glass example of Section 2.5 and how it relates to the p-value described there as well as to confidence intervals.

## Classical Hypothesis Tests for Refractive Index Matching

### Type I Errors and the Size of a Test

As in Section 2.5, the null hypothesis is Ho: = fyz which is equivalent to saying that 8, the difference in the parameter for the two groups, is 8 = — A = 0. Now we explicitly add

the two-sided alternative hypothesis Hi: fy, which is to say that their difference is not

zero (<5 0). In deciding between the two hypotheses, we can make two types of error, as

indicated in Table 2.6.

If we conclude that the two sets of glass do not have the same refractive index (a non- match) when they do (a false rejection), we falsely exclude the window as the source of the glass on the suspect. That falsely exonerates the suspect. If we conclude that the two sets

TABLE 2.6

Two Types of Error and Two Types of Correct Decisions for a Hypothesis Test of the Equality of the Refractive Index

 Hq is true (5 = 0) Hl is true (5 ÿé 0) Do not reject Hq Correct decision True acceptance True negative True match (inclusion) Type II error False acceptance False negative False match (inclusion) Reject Hg Type I error False rejection False positive False nonmatch (exclusion) Correct decision True rejection True positive True nonmatch (exclusion)

have the same refractive index (a match) when they do not, we falsely include the window as a possible source of the questioned glass. That false match incriminates the suspect, since it implies (albeit incorrectly) that the glass on the suspect either came from this window or another one like it.

The usual nomenclature for statistical hypothesis testing labels a false rejection as a Type I error, a false positive, or a false alarm, and the researcher tries to avoid it. To do that here, we must choose a small number for the tolerable risk of a false rejection. Suppose we select a = 0.05. This false rejection probability is the size or significance level of the test, and data that lead to rejection at this level will be called statistically significant.

Our test statistic is D, the difference in the two sample means. Under Hq, D is normal with mean ya = 8 = 0 and standard error = 2.19 x 10-5 (Section 2.5.1). Because 5% of the area under the standard normal curve lies in the region z > 1.96, the critical region for D with a = 0.05 is d > 1.96(/ = 4.29 x 10-5. If d falls into this critical region (also called the rejection region R), we will reject Ho and conclude that H| is true—the broken window will be excluded as a possible source of the questioned glass on the suspect. This outcome will occur (in the long run) in no more than 5% of the cases in which a window is the source. These false rejections are known as Type I errors, and defining R in this fashion protects us from a larger expected rate of these errors.

If we wanted still more protection from Type I errors, we could pick a smaller value for a. Suppose we choose a = 0.01. The rejection region then is |d| > 2.58<7,/ = 5.65 x 10-5. The data must be more extreme (in terms of what Ho predicts) to permit us to reject Ho. We are more attached to the null hypothesis because we are more averse to prematurely exonerating guilty suspects.

With respect to the test size a, writers sometimes draw an analogy between a clas- sical hypothesis test and a criminal trial in which the defendant is not guilty until proven innocent (e.g., Chihara and Hesterberg, 2011, p. 221); Wasserman, 2004; Saks and Neufeld, 2012, p. 150). The analogy is questionable (and has proved misleading) because the significance level relates to Pr(data|Ho), whereas proof beyond a reasonable doubt pertains to Pr(Ho|data) (Kaye, 1987). In any event, the analogy does not apply to the hypothesis Ho that 8 = 0 in this forensic context. If we demand extreme data (a small rejection region so as to keep a small), it will be harder to dislodge a "presump- tion" of guilt, or more accurately, an assumption that a defendant is associated with the crime (Zadora et al., 2013, p. 23). This is the opposite of the situation with examiners decisions on "a match" for a fingerprint. There, a false positive meant a false match—a false inclusion—whereas here it means a false exclusion. The low specificity compared to the sensitivity of the examiners in the FBI-Noblis experiment, if extrapolated to case work, would protect defendants from false convictions (at the cost of increasing the rate of false acquittals).

This difference in the implications of Type I and Type II error in the two situations does not imply that it is wrong to use 8 = 0 for the null hypothesis and to demand a low risk of false rejection for that hypothesis (Kaye, 2017). The selection of 8 = 0 for Ho might be understood better by regarding the refractive index comparison as a screening test for more elaborate chemical tests (of the elemental composition of glass) or further investigation of the suspect (Kaye, 2015). We would not want such a screening test to be too demanding, so we elect to keep a small.

For the data in Table 2.5, the value of D is d = —1.4 x 10-5. As shown in Figure 2.5, it lies solidly in the acceptance region. The glass evidence seems to have incriminated the suspect.

### Type II Errors and the Power of a Test

The flip side of Type I error is Type II error—failing to reject H() when Ht is true. Letting A stand for the acceptance region and T be the test statistic, the probability of making a Type II error (false acceptance) is ft = Pr(t in A|Hi) = 1 - Pr(t in R|Hi). The probability of rejecting Ho when Hi is true is called the power: Power = Pr(f in R|Hi) = 1-/3.

It is relatively simple to compute the probability for a statistic when the hypothesis states a specific value for the parameter, as does the null hypothesis that the true difference FIGURE 2.5

Rejection regions for a test of size 0.05 and one of size 0.01 for the glass data. The observed value of the test statistic falls into the acceptance region, so the window cannot be ruled out as the source of the glass on the suspect's clothing.

is zero. It is not so straightforward when the alternative Hj is a set of many possible values. Hy. covers the entire parameter space for <5 except for the single point 8 = 0k - Oq = 0. Being composed of multiple possibilities for 8, it is called a composite hypoth- esis. In contrast, the hypothesis of equality (that 0^ Of) or no difference (<5 = Ok — 0q = 0) is a simple hypothesis (also called a point hypothesis or an exact hypothesis). There is no single error probability for a composite hypothesis. Instead, one has to compute the power Pr(f in A|<5) as a function of the unknown difference 8 = Ok — Oq. The bigger this difference, the greater the probability that the test will detect it. A powerful test has a good probability of detecting even a small true difference. A weak test does not. Data that produce a rejec- tion under a test that lacks such power are not much evidence that the no-difference null hypothesis is true.

Size and power are prospective concepts. The level a is set before the test is conducted, and the best region for maximizing power that is consistent with that choice becomes the rejection region (if possible). How powerful are the tests with our rejection regions |d| > 1.96cr,/ = 4.29 x 10-5 (for a = 0.05 and |d| > 2.58(; = 5.65 x 10-5 (for a = 0.01)? Because power is a function that varies across all the values of 8 rf 0, there is no single answer. But we will compute the power for one of these values to give a limited answer. Suppose the true difference is some number A of standard errors above the value of zero proposed by Hq. That is, A = z(8) — z(8q), where <5] is a point value for D in the zone demarcated by Hi. For our example, we will use <5i = ad = 2.19 x 10-5. Because the true difference 8 is the mean of the sampling distribution of D, this choice for 8 shifts the standard normal curve for Z = D/aa one unit to the right (A = 1). The critical region is the same because it comes from Hq. Figure 2.6 sketches the general picture.

The two shaded areas in the lower curve in Figure 2.6 represent the probability for obtain- ing measured differences d in the rejection region. If D falls into R, we accept the alternative hypothesis, which is the correct decision when <5 / 0. So these areas are the power—the probability of a correct rejection when the true difference 8 has the particular alternative value <5i.+

Table 2.7 lists this power and the complementary false-acceptance probability fl for ad = 2.19 x 10-5 and for three rejection regions: |d| > 1.96ct(i = 4.29 x 10-5 (for a = 0.05), |d| > 2.58rf = 5.65 x IO“5 (for a = 0.01), and |d| > 3.29(/ = 7.21 x IO“5 (for a = 0.001). Although the choice of this alternative parameter value is arbitrary, computations across a range of values would show the kind of differences that the statistical test has a reasonable chance of detecting. The observed difference here falls into the region for which power is small. This suggests that the failure to exclude the window as the source of the questioned fragments provides only weak evidence that the true refractive indices for the known and questioned fragments are very different.! FIGURE 2.6

The probability distribution of the measured difference Z = D/aj of the means for the two sets of glass fragments when the true difference is not 8 = 0, but an alternative value <5].

TABLE 2.7

Power and False Acceptance Probability (/3) of the Test for S = 0 Against the Alternative 8 = — 1 Standard Error When the False Rejection Probabilities Are a = 0.05,0.01, and 0.001

 Significance Level a Rejection Region Alternative Parameter Value Power for A = 1 0 for A = 1 0.05 ±1.96rf 0.17 0.83 0.01 ±2.58rf 0.058 0.94 0.001 ±3.29od 0.011 0.99

Notice that as the significance level a decreases (making it less likely to have a false posi- tive), the power also decreases (making it more likely to have a false negative). This tradeoff is a general phenomenon (for a given sample size). Looking back at Figure 2.6, as we stretch out the acceptance region for the null distribution to guard against false rejections, there is more area under the alternative probability density curve for the region: Conversely, if we are willing to tolerate a greater risk of false positives (here, false exclu- sions), we will have fewer false negatives (here, false matches) and thus greater power.

## Hypothesis Testing with p-Values

A p-value can be used to perform a hypothesis test. Instead of explicitly defining the critical region, we compute the p-value and compare it to the pre-set threshold a for rejection. If p we reject Ho and declare the data to be "statistically significant" at the a level; if p > a, we do not reject Hq. For the glass data, we saw that the p-value was p = 0.52. The difference obviously is "not statistically significant" at the 0.05 level (or any of the levels that are normally used).

A p-value sometimes is called an "attained significance probability" because it corre- sponds to a test whose size a is the p-value for the observed data. For a = 0.52, the observed difference d = 1.4 x 10-4 would have been (barely) significant. This idea of an attained significance level is misleading. The error probabilities a and fl are only appli- cable when the rejection region is known in advance of the data. The p-values are better understood as indicating how strongly the data seem to refute the null hypothesis. Nev- ertheless, there is nothing wrong with performing a hypothesis test with a preset a via a p-value.

## Hypothesis Testing with Confidence Intervals

Confidence intervals also are intimately related to hypothesis tests. Let the null hypothesis Hq be that a parameter 0, such as the proportion of a given DNA allele in the population, has the value 0q. We want to test Ho at a level a. We form a confidence interval using a confidence coefficient y of 1 — a and reject Ho if and only if the resulting interval covers &o- That does the trick. For example, Section 2.4.3.3 reported that the 95% CI for the false positive probability in the fingerprint study was [0.0008, 0.0036]. At a significance level of 0.05, we could not reject the null hypothesis that the long-run error rate is 0.36%. On the other hand, if we were willing to tolerate a greater risk of a false rejection, namely 0.10, we could reject this hypothesis; the 90% CI is [0.09%, 0.32%], which does not cover the proposed parameter value of 0.36%. Similarly, the 95% CI for estimating the true difference 8 between the two sample means for the glass data (Table 2.5) is the observed difference d bracketed by nearly two standard errors: d ± 1.96(1^ = [—5.69,2.89].+ Because the interval includes zero, we cannot reject the null hypothesis that 8 = 0 at the 0.05 level.

Figure 2.7 displays the fact that that "a coefficient y confidence set ... can be thought of as a set of null hypotheses that would be accepted at significance level 1 — y" (DeGroot and Schervish, 2002, p. 457). All the sample intervals in the figure do not cover 0q. Rejecting Hq for all such intervals, but not for the intervals (excluded from the picture) that cover 0q, creates a gap that is the acceptance region A for the test of Hq whose size is a = 1 — y. The midpoints of all the sample Cis that lead to rejection thus fill the rejection region R for the test.

Another way to say it is that, in the long run, if many Cis are formed for an estimator in repeated samples, an expected fraction y of them will cover 0q (when it is the true value for 0), so using them to say whether 0q is within them will lead us astray (when Ho is FIGURE 2.7

Using confidence intervals for a sample proportion to test the hypothesis that the population proportion is the value 9q proposed by the null hypothesis. The confidence (coverage probability y) that corresponds to a significance level a is 1 — a.

true) only in the remaining fraction 1 — y = a of the samples. Hence, the decision proce- dure falsely rejects the claim that 0 = 0q about 100a % of the time. The confidence-interval procedure keeps the long run error rate at or below the desired level.

•  There could be several regions R for which Pr(f in A|Hg) = a. In that case, when testing two simple hypotheses,we choose the region that has the largest power. In other words, we pick a rejection region that minimizes flfor the fixed a. When the alternative hypothesis is composite, we select the region R for which the test is morepowerful than any other rejection region/or every value of the parameter as given in H[. Although such uniformlymost powerful tests do not always exist, they do for the examples in this chapter, and the test statistics wehave used result in regions that minimize fl (maximize power) for the stipulated . The proof relies on likelihoodratios. + By symmetry, the power is the same for an equal but negative <5i. When A = -1, for example, the normal curvefor the null distribution is shifted one standard unit to the left to obtain the alternative distribution. t One might be tempted to compute only the power or false-acceptance probability for the simple alternativehypothesis that the true difference 8 is the observed difference d = -1.4 x 10-5 (Z = —0.639 standard errors).Simulation studies (e.g., Yuan and Maxwell, 2005) have shown that this post hoc power is not a reliable indicatorof the true power of a study. It is not a recommended procedure (Gelman, 2019).
•  This approach is a hybrid of the perspectives associated with Sir Ronald Fisher, on the one hand, and JerzyNeyman and Egon Pearson, on the other. How completely the Fisherian conception of p-values can be reconciledwith the Neyman-Pearson theory of decision-oriented hypothesis tests is open to question (e.g., Lehmann, 1993). + In the glass example, the standard error is known with good precision from separate studies. In most applica-tions, it has to be estimated from the one set of sample data at hand. As a result, the width of the confidenceintervals in the hypothetical ensemble of repeated intervals will vary. See §2.4.2.2. That is why the hypotheticalintervals in Figure 2.7 have different widths.