# Maximizing Between-Group Variance: The Wichita Study

Whenever you do multistage cluster sampling, be sure to take as large a sample as possible from the largest, most heterogeneous clusters. The larger the cluster, the larger the between-group variance; the smaller the cluster, the smaller the between-group variance. Counties in the same state in the United States are more like each other for many variables—such as average income, distribution of racial and ethnic groups, average age, etc.—than states are. Towns within a county are more like each other than counties are; neighborhoods in a town are more like each other than towns are. And blocks within a neighborhood are more like each other than neighborhoods are. In sampling, the rule is: maximize between-group variance.

What does this mean in practice? Following is an actual example of multistage sampling from John Hartman’s study of Wichita, Kansas (Hartman 1978; Hartman and Hed- blom 1979:160). At the time of the study, in the mid-1970s, Wichita had a population of about 193,000 persons over the age of 16. This was the population to which the study team wanted to generalize. The team decided that they could afford only 500 interviews. There are 82 census tracts in Wichita from which they randomly selected 20. These 20 tracts then became the actual population of their study. We’ll see in a moment how well their actual study population simulated (represented) the study population to which they wanted to generalize.

Hartman and Hedblom added up the total population in the 20 tracts and divided the population of each tract by the total. This gave the percentage of people that each tract, or cluster, contributed to the new population total. The researchers were going to do 500 interviews, so each tract was assigned that percentage of the interviews. If there were 50,000 people in the 20 tracts, and one of the tracts had a population of 5,000, or 10% of the total, then 50 interviews (10% of the 500) would be done in that tract.

Next, the team numbered the blocks in each tract and selected blocks at random until they had enough for the number of interviews that were to be conducted in that tract. When a block was selected it stayed in the pool, so in some cases more than one interview was to be conducted in a single block. This did not happen very often, and the team wisely left it up to chance to determine this.

This study team made some excellent decisions that maximized the heterogeneity (and hence the representativeness) of their sample. As clusters get smaller and smaller (as you go from tract to block to household, or from village to neighborhood to household), the homogeneity of the units of analysis within the clusters gets greater and greater. People in one census tract or village are more like each other than people in different tracts or villages. People in one census block or barrio are more like each other than people across blocks or barrios. And people in households are more like each other than people in households across the street or over the hill.

This is very important. Most researchers would have no difficulty with the idea that they should only interview one person in a household because, for example, husbands and wives often have similar ideas about things and report similar behavior with regard to kinship, visiting, health care, child care, and consumption of goods and services. Somehow, the lesson becomes less clear when new researchers move into clusters that are larger than households.

But the rule stands: Maximize heterogeneity of the sample by taking as many of the biggest clusters in your sample as you can, and as many of the next biggest, and so on, always at the expense of the number of clusters at the bottom where homogeneity is greatest. Take more tracts or villages and fewer blocks per tract or barrios per village. Take more blocks per tract or barrios per village and fewer households per block or barrio. Take more households and fewer persons per household.

Many survey researchers say that, as a rule, you should have no fewer than five households in a census block. The Wichita group did not follow this rule but only had enough money and person power to do 500 interviews and they wanted to maximize the likelihood that their sample would represent faithfully the characteristics of the 193,000 adults in their city.

The Wichita team drew two samples—one main sample and one alternate sample. Whenever they could not get someone on the main sample, they took the alternate. That way, they maximized the representativeness of their sample because the alternates were chosen with the same randomized procedure as the main respondents in their survey. They were not forced to take ‘‘next door-neighbors’’ when a main respondent wasn’t home. (This kind of winging it in survey research has a tendency to clobber the representativeness of samples. In the United States, at least, interviewing only people who are at home during the day produces results that represent women with small children, shut- ins, telecommuters, and the elderly—and not much else.)

Next, the Wichita team randomly selected the households for interview within each block. This was the third stage in this multistage cluster design. The fourth stage consisted of flipping a coin to decide whether to interview a man or a woman in households with both. Whoever came to the door was asked to provide a list of those in the household over 16 years of age. If there was more than one eligible person in the household, the interviewer selected one at random, conforming to the decision made earlier on sex of respondent.

Table 5.2 shows how well the Wichita team did. All in all, they did very well. In addition to the variables shown in the table here, the Wichita sample was a fair representation of marital status and occupation, though it was off a bit on education. For example, at the time, 8% of the population of Wichita had less than 8 years of schooling, but only 4% of the sample had this characteristic. Only 14% of the general population had completed from 1 to 2 years of college, but 22% of the sample had that much education.

Table 5.2 Comparison of Survey Results and Population Parameters for the Wichita Study by Hartman and Hedblom

 Wichita in 1973 Hartman and Hedblom's Sample for 1973 White 86.8% 82.8% African 9.7% 10.8% Chicano 2.5% 2.6% Other 1.0% 2.8% Male 46.6% 46.9% Female 53.4% 53.1% Median age 38.5 39.5

SOURCE: Methods for the Social Sciences: A Handbook for Students and Non-Specialists, by J. J. Hartman and J. H. Hedblom, 1979, p. 165. Reproduced with permission of Greenwood Publishing Group.

All things considered, though, the sampling procedure followed in the Wichita study was a model of technique, and the results show it. Whatever they found out about the

500 people they interviewed, the researchers could be very confident that the results were generalizable to the 193,000 adults in Wichita.

In sum: If you don’t have a sampling frame for a population, try to do a multistage cluster sample, narrowing down to natural clusters that do have lists. Sample heavier at the higher levels in a multistage sample and lighter at the lower stages (box 5.5).

BOX 5.5

MULTISTAGE CLUSTER SAMPLING IN THE FIELD

Just in case you're wondering if you can do this under difficult field conditions, Oyuela-Cacedo and Vieco-Albarraci'n (1999) studied the social organization of the Ticuna Indians of the Colombian Amazon. Most of the 9,500 Ticuna in Colombia are in 32 hamlets, along the Amazon, the Loreta Yacu, the Cotuhe, and the Putumayo Rivers. The Ticuna live in large houses that comprise from one to three families, including grandparents, unmarried children, and married sons with their wives and children. To get a representative sample of the Ticuna, Oyuela-Cacedo and Vieco-Albarraci'n selected six of the 32 hamlets along the four rivers and made a list of the household heads in those hamlets. Then, they numbered the household heads and randomly selected 50 women and 58 men. Oyuela-Cacedo and Vieco-Albarraci'n had to visit some of the selected houses several times to secure an interview, but they wound up interviewing all the members of their sample.

Is the sample representative? We can't know for sure, but take a look at figure 5.2. Figure 5.2a shows the distribution of the ages of the household heads in the sample; figure 5.2b shows the distribution of the number of children in the households of the sample. Both curves look very normal—just what we expect from variables like age and number of children (more about normal distributions in chapter 6). If the sample of Ticuna household heads represents what we expect from age and number of children, then any other variables the research team measured are likely (not guaranteed, just likely) to be representative, too.