The power of a statistical test is the probability of correctly accepting your research hypothesis. If you’re thinking: ‘‘You mean it’s the probability of taking ‘yes’ for an answer?’’ then you’re right on track. As you know, the traditional way to conduct research is to: (1) formulate a hypothesis; (2) turn the hypothesis around into a null hypothesis; and then (3) try as hard as we can to prove the null hypothesis.

This is not perverse. All of us positivists out here know that it’s impossible to absolutely, positively, prove any hypothesis to be forever unfalsifiably true. So we do the next best thing. We try our very best to disprove our best ideas (our research hypotheses) and hope that we fail, leaving us with the right to say that our best guess is that we were right to begin with.

What this means in the real life of researchers is that statistical power is the probability of avoiding both Type I and Type II errors: rejecting the null hypothesis when it’s really true (Type I error) or accepting a null hypothesis when it’s really false (Type II error).

This probability depends on two things: (1) the minimum size of the difference between two outcomes that you will accept as a real difference and (2) the size of the sample. So, to achieve a given amount of statistical power in any experiment or survey, you need to calculate the size of the sample required, given the minimum size of the difference between two outcomes—the effect size—that you will accept as a real difference (Cohen 1988; Kraemer and Thiemann 1987).

This is a very important and subtle issue. Suppose you ask 100 men and 100 women, matched for socioeconomic status, race, and religion, to take the Attitudes Toward Women Scale (AWS). The null hypothesis is that there is no difference between the mean scores of the men and the mean scores of the women on this scale. How big a difference do you need between the mean of the men and the mean of the women on this scale to reject the null hypothesis and conclude that, in fact, the difference is real—that men and women really differ on their attitudes toward women as expressed in the AWS?

The answer depends on the power of the test of the difference in the means. Suppose you analyze the difference between the two means with a t-test, and suppose that the test is significant at the .05 level. Statistical power is the probability that you are wrong to report this result as an indicator that you can reject the null hypothesis.

The result, at the p = .05 level, indicates that the difference you detected between the mean for the men and the mean for the women would be expected to occur by chance fewer than 5 times in 100 runs of the same experiment. It does not indicate that you are 1 — p, or 95% confident that you have correctly rejected the null hypothesis. The power of the finding of p = .05 depends on the size of the sample and on the size of the difference that you expected to find before you did the study.

In the case of the AWS, there are 40 years of data available. These data make it easy to say how big a difference you expect to find if the men and women in your sample are really different in their responses to the AWS. Many surveys, especially those done in foreign fieldwork, are done without this kind of information available. You can offer a theory to explain the results from one experiment or survey. But you can’t turn around and use those same data to test your theory. As replications accumulate for questions of importance in the social sciences, the question of statistical power becomes more and more important.

So, what’s the right amount of statistical power to shoot for? Cohen (1992) recommends that researchers plan their work—that is, set the effect size they recognize as important, set the level of statistical significance they want to achieve (.05, for example, or .01), and calculate the sample size—to achieve a power of .80.

A power value of .80 would be an 80% chance of recognizing that our original hypothesis is really true and a 20% chance of rejecting our hypothesis when it’s really true. If you shoot for a power level of much lower than .80, says Cohen, you run too high a risk of making a Type II error.

On the other hand, power ratings much higher than .80 might require such large ns that researchers couldn’t afford them (Cohen 1992:156). If you want 90% power for a .01 (1%) two-tailed test of, say, the difference between two Pearson’s rs, then, you’d need 364 participants (respondents, subjects) to detect a difference of .20 between the scores of the two groups. If you were willing to settle for 80% power and a .05 (5%) two-tailed test, then the number of participants drops to 192. (To find the sample size needed for any given level of power, see Kraemer and Theimann 1987:105-12.)