Data for constrained spatial clustering
As described in Appendix A.5, the data for the analyses presented here come from a variety of sources: 1) a spatial contiguity network file; 2) county shape files from the Global Administrative Areas website (gadm.org); 3) statistical data about US counties from the US Census Bureau (census.gov/support/USACdataDownloads.html); and 4) additional county data from the National Oceanic and Atmospheric Administration (noaa.gov). Combining information from these databases was complicated (see Appendix A.5). We selected county-level variables congruent with Nine Nations of North America and available10 for 2000. Many variables were excluded due to collinearity. We selected 42 variables covering the broad areas listed in Section 9.1.1. These variables are listed in Table 9.2.
Discriminant analysis for Garreau's nations
Starting with the Nine Nations set of nations implies having a categorical variable (nations are the categories) into which counties are placed. It is reasonable to check whether these 42 variables can predict this categorical variable. If they cannot do so, then using them for the clustering with a relational constraint becomes suspect. But if they do have predictive value, this cluster analysis is justified. There are eight nations in the USA according to Garreau. These groups are known (see Figure 9.2) so no clustering (partitioning) using these variables was sought. Instead, discriminant function analysis was used to 'predict' membership in the categorical variable. In short, we sought to see if these variables can discriminate the categories (nations).
10 There were some exceptions when data from nearby years were used.
Table 9.2 Final county variables used in constrained clustering analyses.
Variable |
Variable name |
1 |
Median Age |
2 |
Civilian labor force unemployment rate |
3 |
Percent of population with at least a Bachelor's (for people 25 and older) |
4 |
Percentage change of housing units between 1990 and 2000 |
5 |
Median household income |
6 |
Percentage of population living in poverty |
7 |
Per capita personal income |
8 |
Percentage change of population between 1990 and 2000 |
9 |
Population density (population per square mile) |
10 |
Percentage of population female |
11 |
Percentage of population Black or African American |
12 |
Percentage of population Native American |
13 |
Percentage of population Asian |
14 |
Percentage of population Hispanic or Latino |
15 |
Overall birth rate |
16 |
Overall death rate |
17 |
Infant mortality rate |
18 |
Per capita water use |
19 |
Percentage of population under 18 |
20 |
Percentage of population over 85 |
21 |
Percentage of land area in farms |
22 |
Percentage of labor force employed in the construction industry |
23 |
Percentage of labor force employed in manufacturing |
24 |
Percentage of labor force employed in transportation and warehousing |
25 |
Percentage of labor force employed in finance and insurance |
26 |
Percentage of labor force employed in professions, science, and technology |
27 |
Percentage of labor force employed in education and health |
28 |
Percentage of population over 25 with less than a ninth grade education |
29 |
Percentage of labor force employed in farming |
30 |
Percentage of labor force employed in government (federal, state, and local) |
31 |
Percentage of housing units owner occupied |
32 |
Percentage of occupied housing units lacking indoor plumbing |
33 |
Percentage of population that is rural |
34 |
Percentage change of the urban population between 1990 and 2000 |
35 |
Change in per capita income between 1989 and 1999 |
36 |
Per capita ground water use |
37 |
Percentage net domestic migration |
38 |
Percentage of native population born in state of residence |
39 |
Ratio in labor force: male to female |
40 |
Percentage voting difference of Democrats over Republicans |
41 |
Percentage public high school enrollment |
42 |
Percentage change of poverty between 1995 and 2000 |
Table 9.3 Centroids for the first two discriminant functions.
Nations are ordered from the lowest to the highest centroid scores. DF stands for discriminant function.
A total of seven discriminant functions were established. Of these, the first four discriminant functions discriminate nations well. However, the last three discriminant functions fare less well.^{[1]} All are interpretable. For an interpretation we use the loadings (coefficients for the linear combination defining a discriminant function) of the discriminant functions and centroids (means of the discriminant functions for each nation). The first discriminant function is defined primarily by the following variables (loadings with absolute values higher than 0.5 are shown in parentheses):
• race (percent Hispanic or Latino population (-0.850) and percent Black or African American population (-0.775))
• age (percent of the population under 18 (-0.775) and the percent of the population over 85 (-0.775))
• percentage of the population living in poverty (0.531)
• median household income
• education (percentage of population over 25 with less than a ninth grade education (0.553))
• percentage of land area in farms
• percentage of native population born in the state of residence
• civilian labor force unemployment rate
Table 9.3 contains the centroids for each of Garreau's nations in the USA. For the first discriminant function centroids (left panel), higher positive values occur for lower percentages of Hispanic or Latino populations, lower percentages of young people (below 18) in the population, higher percentages of Black or African Americans in the population, lower percentages of old people (above 85) in the population, higher percentages of people with lower education levels in the population, and higher percentages of the population living in poverty. MexAmerica typically has: the highest percentage of Hispanics or Latinos and the
Figure 9.10 Plot of the first two discriminant functions with counties marked by nation colors.
The eight nations in the US are distinguished by colors. The black circles mark the locations of the centroids.
lowest percentage of Blacks or African Americans in the population; the highest percentage of the young people in the population; the highest percentage of older people in the population; the highest percentage of people with lower educational levels and the highest rate of poverty. The Breadbasket, The Empty Quarter, and Ecotopia follow. Dixie is at the opposite extreme compared to MexAmerica.
The second discriminant function is defined mostly by percentages of Hispanics or Latinos in the population (coefficient is -1.223) followed weakly by median household income (coefficient is 0.458).^{[2]} High values on the second discriminant function are defined by lower values (higher percentages) of Hispanics or Latinos in the population and higher values (lower levels) for household income. The centroids for this function are in the right panel of Table 9.3. Without surprise, MexAmerica has by far the largest presence of the Hispanic or Latino population. The Islands and Ecotopia follow. New England is at the opposite extreme. Both The Foundary and The Breadbasket have a lower percentage of Hispanics and Latinos.
The values of the first two discriminant functions are plotted in Figure 9.10. Dixie (green points) has the most tightly clustered set of counties in this (arbitrary) two-dimensional space. The Breadbasket (gray points) also has tightly clustered counties. The Foundry (red points) comes next with overlaps in the spaces defined by Dixie and The Breadbasket. MexAmerica's counties, while not so tightly clustered in space, are located entirely in the lower left part of Figure 9.10. Most of The Empty Quarter counties (yellow points) occupy one region in the two-dimensional space. Ecotopia's counties (magenta points) are more scattered, overlapping the regions for The Empty Quarter and MexAmerica. Although New England (orange points)
Table 9.4 Centroids for the second two discriminant functions.
Nations are ordered from the lowest to the highest centroid scores. DF stands for discriminant function.
and The Islands (purple points) are less clearly occupying a space, Figure 9.10 suggests that the eight nations in the USA are discriminated well in the two-dimensional space defined by the first two discriminant functions.
We consider next the third and fourth discriminant functions. High values on the third discriminant function are driven mainly by two variables (with coefficients in parentheses) high values for the percentage of land that is in farms (0.722) and high percentages of Blacks or African Americans in the population (0.560). The ordered centroids for this function are in the left panel of Table 9.4. Here, the most typical nations with the lowest percentage of farm land and the lowest percentage of Black or African American population are New England, Ecotopia, The Empty Quarter, and The Foundry. Dixie, The Bread Basket, and MexAmerica have the opposite characteristics but they are less extreme.
The fourth discriminant function is mostly defined by the following variables (coefficients in parentheses): median household income (-1.042); percent Hispanic or Latino population (-0.711); percent native population born in the state of residence (-0.615); percent people living in poverty (-0.583); percent of labor force employed in manufacturing (-0.553); and percent of the population 25 years old and over having a Bachelor's degree or higher (0.553). The centroids for this discriminant function are in the right panel of Table 9.4. The Empty Quarter has lower household income and also lower percentages of Hispanic or Latino populations, lower percentages of native-born people in the state of residence, lower percentages of the labor force employed in manufacturing and lower levels of higher education attainment. Both New England and The Foundry, overall, have the opposite characteristics. The remaining discriminant functions fare less well in discriminating nations, consistent with their low eigenvalues. Even so, the results of using them are interpretable. The centroids for these dimensions are shown in Table 9.5. The fifth discriminant function is defined by the following variables (with their coefficients in parentheses): percent of the population 25 years or over with less than a ninth grade education (0.709); percent of the population at least 25 years old with Bachelor's degree or higher (0.672); difference of voting Democratic over Republican (0.533); the civilian labor force unemployment rate (-0.519); percent of the native-born population in their state of residence (-0.515). The corresponding centroids are in the left panel of Table 9.5. Typically, New England and also The Islands (here, the southern tip of Florida) have higher percentages of both people with very low educational attainment and with very high educational attainment, higher levels of voting Democratic over Republican, lower unemployment rates, and lower percentages of native-born people
Table 9.5 Centroids for the last three discriminant functions.
Nation |
Fifth |
Nation |
Sixth |
Nation |
Seventh |
centroids |
centroids |
centroids |
|||
Foundry |
-0.5228 |
Ecotopia |
-1.98706 |
Ecotopia |
-0.36186 |
Empty Quarter |
-0.2202 |
Islands |
-1.38047 |
New England |
-0.22428 |
MexAmerica |
0.0208 |
Breadbasket |
-0.04461 |
MexAmerica |
-0.04802 |
Dixie |
0.0373 |
Dixie |
-0.00244 |
Dixie |
-0.01740 |
Breadbasket |
0.0435 |
Foundry |
0.09901 |
Breadbasket |
-0.00498 |
Ecotopia |
0.1092 |
MexAmerica |
0.20593 |
Empty Quarter |
0.01749 |
Islands |
1.9537 |
Empty Quarter |
0.30408 |
Foundry |
0.08922 |
New England |
2.4103 |
New England |
0.45916 |
Islands |
4.23150 |
Nations are ordered from the lowest to the highest centroid scores.
living in the state where they were born. At the opposite end of this distribution is The Foundry and The Empty Quarter, but they are far less extreme in these characteristics.
The sixth discriminant function is mostly defined by the percentage of Asians in the population (-0.756) and the civilian unemployment rate (-0.508). The corresponding centroids are shown in the middle panel of Table 9.5. Ecotopia's counties, in general, are extreme in having the highest percentage of Asians in the population as well as the highest unemployment rate. The counties of the southern tip of Florida follow. The last discriminant function is defined by the following variables (coefficients in parentheses): percentages of the population 25 years old and over with Bachelor's degree or higher (-0.858); percentages of Asians in the population (-0.682); percentages of the population under 18 years (-0.645); and percentages of the workforce employed in the professions, sciences, and technology (0.537). Their centroids are shown in the right panel of Table 9.5. The Islands (southern tip of Florida) stands out as extreme having a high presence of people with low education attainment, lower percentages of Asians in the population, lower percentages of young people (under 18), and a higher percentage of people employed in the professions, sciences, and technology.
The discriminant analysis results using attributes of counties are consistent with much of the descriptive narrative of Garreau even though he described conditions in late 1970s and early 1980s while the data we use come mostly from 2000. This is consistent with the idea advanced by Woodard about change in these large regions being relatively slow.
Because, Woodard's account is based on historical records and contexts, we did not think it justifiable to construct a set of quantitative variables as we did for Garreau's account. This implied that we could not perform a discriminant analysis for his set of nations.
The analysis of this section did not utilize the adjacency of counties in geographic space. We turn next to consider clustering counties in terms of their attributes but constrained by their adjacency relations. We provide clustering results for different values of k as Garreau and Woodard have different numbers of nations in their characterizations of the USA.