At the end of this chapter, you should understand why it’s possible to estimate very accurately, most of the time, the average age of the 228 million adults in the United States by talking to just 1,600 of them. And you should understand why you can also do this pretty accurately, much of the time, by talking to just 400 of them. Sampling theory is partly about distributions, which come in a variety of shapes. Figure 6.1 shows what is known as the normal distribution.

FIGURE 6.1.

The normal curve and the first, second, and third standard deviations.

THE NORMAL CURVE AND Z-SCORES

The so-called normal distribution is generated by a formula that can be found in many intro statistics texts. The distribution has a mean of 0 and a standard deviation of 1. The standard deviation is a measure of how much the scores in a distribution vary from the mean score. The larger the standard deviation, the more dispersion around the mean. Here’s the formula for the standard deviation, or sd. (We will take up the sd again in chapter 20. The sd is the square root of the variance, which we’ll take up in chapter 21.)

The symbol x in formula 6.1 is read ‘‘x-bar’’ and is used to signify the mean of a sample. The mean of a population (the parameter we want to estimate), is symbolized by p (the Greek lower-case letter ‘‘mu,’’ pronounced ‘‘myoo’’). The standard deviation of a population is symbolized by о (the Greek lower-case letter ‘‘sigma’’), and the standard deviation of a sample is written as SD or sd or s. Read formula 6.1 as follows: The standard deviation is the square root of the sum of all the squared differences between every score in a set of scores and the mean, divided by the number of scores minus 1.

The standard deviation of a sampling distribution of means is the standard error of the mean, or SEM. The formula for calculating SEM is:

where n is the sample size. In other words, the standard error of the mean gives us an idea of how much a sample mean varies from the mean of the population that we’re trying to estimate.

Suppose that in a sample of 100 merchants in a small town in Malaysia, you find that the average income is RM12,600 (about $3,500 in 2009 U.S. dollars), with a standard deviation of RM4,000 (RM is the symbol for the Malaysian Ringgit). The standard error of the mean is:

Do the calculation:

In normal distributions—that is, distributions that have a mean of 0 and a standard deviation of 1—exactly 34.135% of the area under the curve (the white space between the curve and the baseline) is contained in between the perpendicular line that represents the mean in the middle of the curve in figure 6.1 and the line that rises from the baseline at 1 standard deviation above and 1 standard deviation below the mean.

Appendix A is a table of z-scores, or standard scores. These scores are the number of standard deviations from the mean in a normal distribution, in increments of 1/100th of a standard deviation. For each z-score, beginning with 0.00 standard deviations (the mean) and on up to 3.09 standard deviations (on either side of the mean), appendix A shows the percentage of the physical area under the curve of a normal distribution. That percentage represents the percentage of cases that fall within any number of standard deviations above and below the mean in a normally distributed set of cases.

We see from appendix A that 34.13% of the area under the curve is one standard deviation above the mean and another 34.13% is one standard deviation below the mean. Thus, 68.26% of all scores in a normal distribution fall within one standard deviation of the mean. We also see from appendix A that 95.44% of all scores in a normal distribution fall within two standard deviations and that 99.7% fall within three standard deviations.

Look again at figure 6.1. You can see why so many cases are contained within 1 sd above and below the mean: The normal curve is tallest and fattest around the mean and much more of the area under the curve is encompassed in the first sd from the mean than is encompassed between the first and second sd from the mean.

If 95.44% of the area under a normal curve falls within two sd from the mean, and if 99.7% fall within 3 sd, then exactly 95% (a nice, round number) should fall within slightly less than two and exactly 99% should fall within slightly less than three standard deviations. And indeed, from appendix A, we see that 1.96 standard deviations above and below the mean account for 95% of all scores in a normal distribution and that 2.58 sd account for 99% of all scores. This, too, is shown in figure 6.1.

The normal distribution is an idealized form. In practice, many variables are not distributed in the perfectly symmetric shape we see in figure 6.1. Figure 6.2 shows some other shapes for distributions. Figure 6.2a shows a bimodal distribution. Suppose the x- axis in figure 6.2a is age, and the у-axis is the percentage of people who respond ‘‘yes’’ to the question ‘‘Did you like the beer commercial shown during the Superbowl yesterday?’’ The bimodal distribution shows that people in their 20s and people in their 60s liked the commercial, but others didn’t.

FIGURE 6.2.

Bimodal and skewed distributions.

Figure 6.2b and figure 6.2c are skewed distributions. A distribution can be skewed positively (with a long tail going off to the right) or negatively (with the tail going off to the left). Figures 6.2b and 6.2c look like the distributions of scores in two very different university courses. In figure 6.2b, most students got low grades, and there is a long tail of students who got high grades. In figure 6.2c, most students got relatively high grades, and there is a long tail of students who got lower grades.

FIGURE 6.3.

Three symmetric distributions including the normal distribution.

The normal distribution is symmetric, but not all symmetric distributions are normal. Figure 6.3 shows three variations of a symmetric distribution—that is, distributions for which the mean and the median are the same. The one on the left is leptokurtic (from Greek, meaning ‘‘thin bulge’’) and the one on the right is platykurtic (meaning ‘‘flat bulge’’). The curve in the middle is the famous bell-shaped, normal distribution. In a leptokurtic, symmetric distribution, the standard deviation is less than 1.0; and in a platykurtic, symmetric distribution, the standard deviation is greater than 1.0. The physical distance between marriage partners (whether among tribal people or urbanites) usually forms a leptokurtic distribution. People tend to marry people who live near them, and there are fewer and fewer marriages as the distance between partners increases (Sheets 1982). By contrast, we expect the distribution of height and weight of athletes in the

National Basketball Association to be more platykurtic across teams since coaches are all recruiting players of more-or-less similar build.

The shape of a distribution—normal, skewed, bimodal, and so on—contains a lot of information about what is going on, but it doesn’t tell us why things turned out the way they did. A sample with a bimodal or highly skewed distribution is a hint that you might be dealing with more than one population or culture.