Hypothesis Tests
The third major frequentist statistical tool for inference is hypothesis testing. Rather than estimate the parameter or gauge the strength of the evidence against a hypothesized value (or set of values) for the parameter, frequentist hypothesis testing prescribes epistemic decisions with regard to these values. It gives us decision rules that are shaped by the probabilities of erroneous decisions. A report of a committee of the National Academy of Sciences (National Research Council Committee, 2004) summarized the basic idea for formulating a statistical decision rule for matching bullet lead fragments on the basis of measurements of the concentrations of various trace elements as follows (Kafadar and Spiegelman, 2004, pp. 17071):
The classical approach to deciding between the two hypotheses was developed in the 1930s. The standard hypothesistesting procedure consists of these steps:
 1. Set up the two hypotheses. The "assumed" state of affairs is generally the mill hypothesis, for example, "drug is no better than placebo." In the compositional anal ysis of bullet lead (CABL) context, the null hypothesis is "bullets do not match" or "mean concentrations of materials from which these two bullets were produced are not the same" (assume "not guilty"). The converse is called the alternative hypoth esis, for example, "drug is effective" or in the CABL context, "bullets match" or "mean concentrations are the same."
 2. Determine an acceptable level of risk posed by rejecting the null hypothesis when it is actually true. The level is set according to the circumstances. Conventional values in many fields are 0.05 and 0.01; that is, in one of 20 or in one of 100 cases when this test is conducted, the test will erroneously decide on the alternative hypothesis ("bullets match") when the null hypothesis actually was correct ("bullets do not match"). The preset level is considered inviolate; a procedure will not be considered if its "risk" exceeds it....
 3. Calculate a quantity based on the data (for example, involving the sample mean concentrations of the seven elements in the two bullets), known as a test statistic. The value of the test statistic will be used to test the null hypothesis versus the alternative hypothesis.
 4. The preset level of risk and the test statistic together define two regions, corre sponding to the two hypotheses. If the test statistic falls in one region, the decision is to fail to reject the null hypothesis; if it falls in the other region (called the criti cal region), the decision is to reject the null hypothesis and conclude the alternative hypothesis.
The critical region has the following property: Over the many times that this protocol is followed, the probability of falsely rejecting the null hypothesis does not exceed the preset level of risk. [Ideally, a] test procedure ... has a further property: if the alternative hypothesis holds, the procedure will have the greatest chance of correctly rejecting the null hypothesis.
We will show how this procedure works using the glass example of Section 2.5 and how it relates to the pvalue described there as well as to confidence intervals.
Classical Hypothesis Tests for Refractive Index Matching
Type I Errors and the Size of a Test
As in Section 2.5, the null hypothesis is Ho: = fyz which is equivalent to saying that 8, the difference in the parameter for the two groups, is 8 = — A = 0. Now we explicitly add
the twosided alternative hypothesis Hi: fy, which is to say that their difference is not
zero (<5 0). In deciding between the two hypotheses, we can make two types of error, as
indicated in Table 2.6.
If we conclude that the two sets of glass do not have the same refractive index (a non match) when they do (a false rejection), we falsely exclude the window as the source of the glass on the suspect. That falsely exonerates the suspect. If we conclude that the two sets
TABLE 2.6
Two Types of Error and Two Types of Correct Decisions for a Hypothesis Test of the Equality of the Refractive Index
Hq is true (5 = 0) 
Hl is true (5 ÿé 0) 

Do not reject Hq 
Correct decision True acceptance True negative True match (inclusion) 
Type II error False acceptance False negative False match (inclusion) 
Reject Hg 
Type I error False rejection False positive False nonmatch (exclusion) 
Correct decision True rejection True positive True nonmatch (exclusion) 
have the same refractive index (a match) when they do not, we falsely include the window as a possible source of the questioned glass. That false match incriminates the suspect, since it implies (albeit incorrectly) that the glass on the suspect either came from this window or another one like it.
The usual nomenclature for statistical hypothesis testing labels a false rejection as a Type I error, a false positive, or a false alarm, and the researcher tries to avoid it. To do that here, we must choose a small number for the tolerable risk of a false rejection. Suppose we select a = 0.05. This false rejection probability is the size or significance level of the test, and data that lead to rejection at this level will be called statistically significant.
Our test statistic is D, the difference in the two sample means. Under Hq, D is normal with mean ya = 8 = 0 and standard error
If we wanted still more protection from Type I errors, we could pick a smaller value for a. Suppose we choose a = 0.01. The rejection region then is d > 2.58<7,/ = 5.65 x 10^{5}. The data must be more extreme (in terms of what Ho predicts) to permit us to reject Ho. We are more attached to the null hypothesis because we are more averse to prematurely exonerating guilty suspects.
With respect to the test size a, writers sometimes draw an analogy between a clas sical hypothesis test and a criminal trial in which the defendant is not guilty until proven innocent (e.g., Chihara and Hesterberg, 2011, p. 221); Wasserman, 2004; Saks and Neufeld, 2012, p. 150). The analogy is questionable (and has proved misleading) because the significance level relates to Pr(dataHo), whereas proof beyond a reasonable doubt pertains to Pr(Hodata) (Kaye, 1987). In any event, the analogy does not apply to the hypothesis Ho that 8 = 0 in this forensic context. If we demand extreme data (a small rejection region so as to keep a small), it will be harder to dislodge a "presump tion" of guilt, or more accurately, an assumption that a defendant is associated with the crime (Zadora et al., 2013, p. 23). This is the opposite of the situation with examiners decisions on "a match" for a fingerprint. There, a false positive meant a false match—a false inclusion—whereas here it means a false exclusion. The low specificity compared to the sensitivity of the examiners in the FBINoblis experiment, if extrapolated to case work, would protect defendants from false convictions (at the cost of increasing the rate of false acquittals).
This difference in the implications of Type I and Type II error in the two situations does not imply that it is wrong to use 8 = 0 for the null hypothesis and to demand a low risk of false rejection for that hypothesis (Kaye, 2017). The selection of 8 = 0 for Ho might be understood better by regarding the refractive index comparison as a screening test for more elaborate chemical tests (of the elemental composition of glass) or further investigation of the suspect (Kaye, 2015). We would not want such a screening test to be too demanding, so we elect to keep a small.
For the data in Table 2.5, the value of D is d = —1.4 x 10^{5}. As shown in Figure 2.5, it lies solidly in the acceptance region. The glass evidence seems to have incriminated the suspect.
Type II Errors and the Power of a Test
The flip side of Type I error is Type II error—failing to reject H_{()} when H_{t} is true. Letting A stand for the acceptance region and T be the test statistic, the probability of making a Type II error (false acceptance) is ft = Pr(t in AHi) = 1  Pr(t in RHi). The probability of rejecting Ho when Hi is true is called the power: Power = Pr(f in RHi) = 1/3.
It is relatively simple to compute the probability for a statistic when the hypothesis states a specific value for the parameter, as does the null hypothesis that the true difference
FIGURE 2.5
Rejection regions for a test of size 0.05 and one of size 0.01 for the glass data. The observed value of the test statistic falls into the acceptance region, so the window cannot be ruled out as the source of the glass on the suspect's clothing.
is zero. It is not so straightforward when the alternative Hj is a set of many possible values. Hy. covers the entire parameter space for <5 except for the single point 8 = 0k  Oq = 0. Being composed of multiple possibilities for 8, it is called a composite hypoth esis. In contrast, the hypothesis of equality (that 0^ Of) or no difference (<5 = Ok — 0q = 0) is a simple hypothesis (also called a point hypothesis or an exact hypothesis). There is no single error probability for a composite hypothesis. Instead, one has to compute the power Pr(f in A<5) as a function of the unknown difference 8 = Ok — Oq. The bigger this difference, the greater the probability that the test will detect it. A powerful test has a good probability of detecting even a small true difference. A weak test does not. Data that produce a rejec tion under a test that lacks such power are not much evidence that the nodifference null hypothesis is true.
Size and power are prospective concepts. The level a is set before the test is conducted, and the best region for maximizing power that is consistent with that choice becomes the rejection region (if possible).^{[1]} How powerful are the tests with our rejection regions d > 1.96cr,/ = 4.29 x 10^{5} (for a = 0.05 and d > 2.58
The two shaded areas in the lower curve in Figure 2.6 represent the probability for obtain ing measured differences d in the rejection region. If D falls into R, we accept the alternative hypothesis, which is the correct decision when <5 / 0. So these areas are the power—the probability of a correct rejection when the true difference 8 has the particular alternative value <5i.^{+}
Table 2.7 lists this power and the complementary falseacceptance probability fl for ad = 2.19 x 10^{5} and for three rejection regions: d > 1.96ct_{(}i = 4.29 x 10^{5} (for a = 0.05), d > 2.58
FIGURE 2.6
The probability distribution of the measured difference Z = D/aj of the means for the two sets of glass fragments when the true difference is not 8 = 0, but an alternative value <5].
TABLE 2.7
Power and False Acceptance Probability (/3) of the Test for S = 0 Against the Alternative 8 = — 1 Standard Error When the False Rejection Probabilities Are a = 0.05,0.01, and 0.001
Significance Level a 
Rejection Region 
Alternative Parameter Value 

Power for A = 1 
0 for A = 1 

0.05 
±1.96 
0.17 
0.83 
0.01 
±2.58 
0.058 
0.94 
0.001 
±3.29o_{d} 
0.011 
0.99 
Notice that as the significance level a decreases (making it less likely to have a false posi tive), the power also decreases (making it more likely to have a false negative). This tradeoff is a general phenomenon (for a given sample size). Looking back at Figure 2.6, as we stretch out the acceptance region for the null distribution to guard against false rejections, there is more area under the alternative probability density curve for the region:
Conversely, if we are willing to tolerate a greater risk of false positives (here, false exclu sions), we will have fewer false negatives (here, false matches) and thus greater power.
Hypothesis Testing with pValues
A pvalue can be used to perform a hypothesis test. Instead of explicitly defining the critical region, we compute the pvalue and compare it to the preset threshold a for rejection.^{[2]} If p we reject Ho and declare the data to be "statistically significant" at the a level; if p > a, we do not reject Hq. For the glass data, we saw that the pvalue was p = 0.52. The difference obviously is "not statistically significant" at the 0.05 level (or any of the levels that are normally used).
A pvalue sometimes is called an "attained significance probability" because it corre sponds to a test whose size a is the pvalue for the observed data. For a = 0.52, the observed difference d = 1.4 x 10^{4} would have been (barely) significant. This idea of an attained significance level is misleading. The error probabilities a and fl are only appli cable when the rejection region is known in advance of the data. The pvalues are better understood as indicating how strongly the data seem to refute the null hypothesis. Nev ertheless, there is nothing wrong with performing a hypothesis test with a preset a via a pvalue.
Hypothesis Testing with Confidence Intervals
Confidence intervals also are intimately related to hypothesis tests. Let the null hypothesis Hq be that a parameter 0, such as the proportion of a given DNA allele in the population, has the value 0q. We want to test Ho at a level a. We form a confidence interval using a confidence coefficient y of 1 — a and reject Ho if and only if the resulting interval covers &o That does the trick. For example, Section 2.4.3.3 reported that the 95% CI for the false positive probability in the fingerprint study was [0.0008, 0.0036]. At a significance level of 0.05, we could not reject the null hypothesis that the longrun error rate is 0.36%. On the other hand, if we were willing to tolerate a greater risk of a false rejection, namely 0.10, we could reject this hypothesis; the 90% CI is [0.09%, 0.32%], which does not cover the proposed parameter value of 0.36%. Similarly, the 95% CI for estimating the true difference 8 between the two sample means for the glass data (Table 2.5) is the observed difference d bracketed by nearly two standard errors: d ± 1.96(1^ = [—5.69,2.89].^{+} Because the interval includes zero, we cannot reject the null hypothesis that 8 = 0 at the 0.05 level.
Figure 2.7 displays the fact that that "a coefficient y confidence set ... can be thought of as a set of null hypotheses that would be accepted at significance level 1 — y" (DeGroot and Schervish, 2002, p. 457). All the sample intervals in the figure do not cover 0q. Rejecting Hq for all such intervals, but not for the intervals (excluded from the picture) that cover 0q, creates a gap that is the acceptance region A for the test of Hq whose size is a = 1 — y. The midpoints of all the sample Cis that lead to rejection thus fill the rejection region R for the test.
Another way to say it is that, in the long run, if many Cis are formed for an estimator in repeated samples, an expected fraction y of them will cover 0q (when it is the true value for 0), so using them to say whether 0q is within them will lead us astray (when Ho is
FIGURE 2.7
Using confidence intervals for a sample proportion to test the hypothesis that the population proportion is the value 9q proposed by the null hypothesis. The confidence (coverage probability y) that corresponds to a significance level a is 1 — a.
true) only in the remaining fraction 1 — y = a of the samples. Hence, the decision proce dure falsely rejects the claim that 0 = 0q about 100a % of the time. The confidenceinterval procedure keeps the long run error rate at or below the desired level.
 [1] There could be several regions R for which Pr(f in AHg) = a. In that case, when testing two simple hypotheses,we choose the region that has the largest power. In other words, we pick a rejection region that minimizes flfor the fixed a. When the alternative hypothesis is composite, we select the region R for which the test is morepowerful than any other rejection region/or every value of the parameter as given in H[. Although such uniformlymost powerful tests do not always exist, they do for the examples in this chapter, and the test statistics wehave used result in regions that minimize fl (maximize power) for the stipulated . The proof relies on likelihoodratios. + By symmetry, the power is the same for an equal but negative <5i. When A = 1, for example, the normal curvefor the null distribution is shifted one standard unit to the left to obtain the alternative distribution. t One might be tempted to compute only the power or falseacceptance probability for the simple alternativehypothesis that the true difference 8 is the observed difference d = 1.4 x 105 (Z = —0.639 standard errors).Simulation studies (e.g., Yuan and Maxwell, 2005) have shown that this post hoc power is not a reliable indicatorof the true power of a study. It is not a recommended procedure (Gelman, 2019).
 [2] This approach is a hybrid of the perspectives associated with Sir Ronald Fisher, on the one hand, and JerzyNeyman and Egon Pearson, on the other. How completely the Fisherian conception of pvalues can be reconciledwith the NeymanPearson theory of decisionoriented hypothesis tests is open to question (e.g., Lehmann, 1993). + In the glass example, the standard error is known with good precision from separate studies. In most applications, it has to be estimated from the one set of sample data at hand. As a result, the width of the confidenceintervals in the hypothetical ensemble of repeated intervals will vary. See §2.4.2.2. That is why the hypotheticalintervals in Figure 2.7 have different widths.