MEASURES OF DISPERSION II: VARIANCE AND THE STANDARD DEVIATION

The best-known and most-useful measure of dispersion for a sample of interval data is the standard deviation, usually written just s or sd. The sd is a measure of how much, on average, the scores in a distribution deviate from the mean score. It is gives you a feel for how homogeneous or heterogeneous a population is. (We use s or sd for the standard deviation of a sample; we use the lowercase Greek sigma, ct, for the standard deviation of a population.)

The sd is calculated from the variance, written s^{2}, which is the average squared deviation from the mean of the measures in a set of data. To find the variance in a distribution: (1) Subtract each observation from the mean of the set of observations; (2) Square the difference, thus getting rid of negative numbers; (3) Sum the differences; and (4) Divide that sum by the sample size. Here is the formula for calculating the variance:

where s^{2} is the variance, x represents the raw scores in a distribution of interval-level observations, x is the mean of the distribution of raw scores, and n is the total number of observations.

Notice that we need to square the difference of each observation from the mean and then take the square root later. As we saw in calculating the mean, 2 (x — x) = 0. That is, the simple sum of all the deviations from the mean is zero. Squaring each x — x gets rid of the negative numbers.

Variance describes in a single statistic how homogeneous or heterogeneous a set of data is, and by extension, how similar or different are the units of analysis described by those data. Consider the set of scores in table 20.2 on the variable called REDUCE. These scores show people’s support for the idea that ‘‘Americans are going to have to drastically reduce their consumption over the next few years.’’ Suppose that for each level of education, you could predict the level of support for that attitudinal item about cutting back on consumption. If you could do this in 100% of all cases, then you would speak of ‘‘explaining all the variance’’ in the dependent variable.

I’ve never encountered this strength of association between two variables in the social sciences, but some things come pretty close, and in any event, the principle is what’s important.

The standard deviation, s, is the square root of the variance, s^{2}. The formula for the standard deviation is:

Table 20.9 shows how to calculate the standard deviation for the data on female illiteracy in table 20.7. The sum of the (x — x)^{2} is 2,008.86 and n = 10.

Substituting in the formula for standard deviation, we get:

Table 20.9 How to Calculate the Standard Deviation for the Data on Female Illiteracy in Table 20.7

COUNTRY

FEMILLIT

Score - Mean x - X

El Salvador

18.6

(18.6 - 13.9)^{2} = 22.09

Iran

20.7

(20.7 - 13.9)^{2} = 46.24

Latvia

0.2

(0.20 - 13.9)^{2} = 187.69

Namibia

11.1

(11.1 - 13.9)^{2} = 7.84

Panama

6.6

(6.6 - 13.9)^{2} = 53.29

Slovenia

0.3

(0.3 - 13.9)^{2} = 184.96

Suriname

7.4

(7.4 - 13.9)^{2} = 42.25

Armenia

1.4

(1.4 - 13.9)^{2} = 156.25

Chad

48.5

(48.5 - 13.9)^{2} = 1,197.16

Ghana

24.2

(24.2 - 13.9)^{2} = 109.09

2(x - X)^{2} = 2,008.86/9

s = V2008.86 = 14.94

If we were reporting these data, we would say that ‘‘the average percentage of adult female illiteracy is 13.9 with sd 14.9.’’

For grouped data, we take the midpoint of each interval as the raw score and use formula 20.6 to calculate the standard deviation:

Table 20.10 shows the procedure for calculating the standard deviation for the grouped data in table 20.4. We know from table 20.6 that x = 45 for the data in table 20.4.

Table 20.10 Calculating the Standard Deviation for the Grouped Data in Table 20.4

AGEX

f

Midpoint x

X

1

XI

(x — x )^{2}

f(x — x )^{2}

20-29

6

25

25 —

45 =

— 20

400

2,400

30-39

6

35

35 —

45 =

10

100

600

40-49

5

45

45 —

45 =

0

0

0

50-59

8

55

55 —

45 =

10

100

800

60 +

5

65

65 —

45 =

20

400

2,000

О

CO

II

X

и

S(x — X)

= 0

Sf (X — x )^{2} = 5,800

Substituting in formula 20.6 for sd, we get:

and we report that ‘‘the mean age is 45.00, sd 14.14.’’ For comparison, the mean age and sd of the 30 ages (the ungrouped data) in table 20.3b are 45.033 and 15.52. Close, but not dead on.

Are these numbers describing ages and percentages of illiteracy large, or small, or about normal? There is no way to tell except by comparison across cases. By themselves, numbers like means and standard deviations simply describe a set of data. But in comparative perspective, they help us produce theory; that is, they help us develop ideas about what causes things, and what those things, in turn, cause. We’ll compare cases when we get to bivariate analysis in the chapter coming up next (Further Reading: exploring data).