THE LOGIC OF HYPOTHESIS TESTING
One thing we can do, however, is test whether the mean of a sample of data, x, is likely to represent the mean of the population, p, from which the sample was drawn. We’ll test whether the mean of FEMILLIT in table 20.7 is likely to represent the mean of the population of the 50 countries in table 20.8. To do this, we will use the logic of hypothesis testing. This logic is used very widely—not just in the social sciences, but in all probabilistic sciences, like meteorology and genetics.
The key to this logic is the statement that we can test whether the mean of the sample is likely to represent the mean of the population. Here’s how the logic works.
- 1. First, we set up a null hypothesis, written H0, which states that there is no difference between the sample mean and the mean of the population from which the sample was drawn.
- 2. Then we set up the research hypothesis (also called the alternative hypothesis), written H1, which states that, in fact, the sample mean and the mean of the population from which the sample was drawn are different.
- 3. Next, we decide whether the research hypothesis is only about magnitude or is directional. If H1 is only about magnitude—that is, it’s nondirectional—then it can be stated just as it was in (2) above: The sample mean and the mean of the population from which the sample was drawn are different. Period.
If H1 is directional, then it has to be stated differently: The sample mean is [bigger than] [smaller than] the mean of the population from which the sample was drawn. This decision determines whether we will use a one-tailed or a two-tailed test of the null hypothesis.
To understand the concept of one- and two-tailed tests, suppose you have a bell curve that represents the distribution of means from many samples of a population. Sample means are like any other variable. Each sample has a mean, and if you took thousands of samples from a population you’d get a distribution of means (or proportions). Some would be large, some small, and some exactly the same as the true mean of the population. The distribution would be normal and form a bell curve like the one in figure 20.4 and in figure 6.1.
The unlikely means (the very large ones and the very small ones) show up in the narrow area under the tails of the curve, while the likely means (the ones closer to the true mean of the population) show up in the fat, middle part. In research, the question you want to answer is whether the means of variables from one, particular sample (the one you’ve got) probably represent the tails or the middle part of the curve.
Hypothesis tests are two tailed when you are interested only in whether the magnitude of some statistic is significant (i.e., whether you would have expected that magnitude by chance). When the direction of a statistic is not important, then a two-tailed test is called for.
As we’ll see in chapter 21, however, when you predict that one of two means will be higher than the other another (like two tests taken a month apart), you would use a onetailed test. After all, you’d be asking only whether the mean was likely to fall in one tail of the normal distribution. Look at appendix A carefully. Scores significant at the .10 level for a two-tailed test are significant at the .05 level for a one-tailed test.
4. Finally, we determine the alpha level, written a, which is the level of significance for the hypothesis test. Typically, alpha is set at the .05 level or at the .01 level of significance What this means is that if a mean or a proportion from a sample is likely to occur more than alpha—say, more than 5% of the time—then we fail to reject the null hypothesis.
And conversely: If the mean or a proportion of a sample is likely to occur by chance less than alpha, then we reject the null hypothesis. Alpha defines the critical region of a sampling distribution—that is, the fraction of the sampling distribution small enough to reject the null hypothesis.
In neither case do we prove the research hypothesis, Hj. We either reject or fail to reject the null hypothesis. Failing to reject the null hypothesis is the best we can do, since, in a probabilistic science, we can’t ever really prove any research hypothesis beyond any possibility of being wrong (box 20.3).
ON BEING SIGNIFICANT
By custom—and only by custom—researchers generally accept as statistically significant any outcome that is not likely to occur by chance more than 5 times in 100 tries. This p value, or probability value, is called the .05 level of significance. A p of .01 is usually considered very significant, and .001 is often labeled highly significant.
Many researchers use asterisks instead of p values in their writing to cut down on number clutter. A single asterisk signifies a p of .05, a double asterisk signifies a value of .01 or less, and a triple asterisk signifies a value of .001 or less. If you read: ''Men were more likely than women** to report dissatisfaction with local schoolteacher training,'' you'll know that the double asterisk means that the difference between men and women on this variable was significant at the .01 level or better.
And remember: Statistical significance is one thing, but substantive significance is another matter entirely. In exploratory research, you might be satisfied with a .10 level of significance. In evaluating the side effects of a medical treatment, you might demand a .001 level—or an even more stringent test of significance.