# Examining the Strength of Association and Direction of All Paired Variables Using a Scatterplot Matrix

The overall patterns among most of the variables suggest possible linear relationships (increasing/decreasing trends in both x and у variables—some are positively correlated, whereas others are negatively correlated); exceptions include the pairings that involve the Asian group or proximity to police stations (Figure 5.2). These appear to show neither clear (weak correlation) association nor direction.

## Fitting the Ordinary Least Squares Regression Model

We need to fit the best OLS regression model to ensure that we have a properly specified model before moving ahead with the GWR model.

### Primary Model

We need to determine if the primary model is statistically significant at a = 0.05. We do this by investigating whether

TABLE 5.4

A Correlation Matrix for all Paired Variables

 HS POV UEM EDU AEA INC HI W В H A PL HP HS 1 POV 0.32 1 UEM 0.14 0.76 1 EDU 0.91 0.42 0.32 1 AEA 0.24 0.40 0.60 0.42 1 INC -0.55 -0.53 -0.61 -0.71 -0.76 1 HI 0.69 0.77 0.74 0.83 0.68 -0.84 1 W -0.42 -0.72 -0.75 -0.57 -0.61 0.72 -0.83 1 В -0.17 0.65 0.79 -0.06 0.49 -0.36 0.45 -0.73 1 H 0.72 -0.19 -0.27 0.71 0.02 -0.29 0.27 -0.06 -0.60 1 A -0.05 -0.07 -0.30 -0.04 -0.28 0.19 -0.18 0.23 -0.34 -0.10 1 PL -0.10 -0.23 -0.17 -0.09 0.11 -0.04 -0.13 0.23 -0.16 0.08 -0.19 1 HP 0.02 -0.26 -0.06 0.04 0.31 -0.17 -0.01 0.16 -0.21 0.21 -0.15 0.52 1

Note: Tire 16 paired variables shown in bold typeface have been identified to exhibit some level of collinearity.

Abbreimtions: A, Asian; AEA, percent of the population under 18 or over 64 years of age; B, Black; EDU, percent of persons aged 25 years or older without a high school diploma; H, Hispanic; HP, proximity to hospitals; HS, percent of occupied housing units with more than one person per room; INC, per capita income; PL, proximity to police station; POV, percent of households living below the federal poverty level; UEM, percent of persons aged 16 years or older in the labor force who are unemployed; W, White.

FIGURE 5.2

A scatterplot matrix and histogram showing all paired variables.

Given that the observed joint F-statistic is 258.03, and it is greater than the critical value at (12, 64) degrees of freedom, we can reject the null hypothesis and conclude that at least one regression coefficient is not equal to zero.

## Examining Variance Inflation Factor Results

The VIF is another formal measure of detecting the presence of collinearity. It is used to eliminate—by adding or deleting a predictor variable—any potential redundancy among independent variables, X,. VIF indicates how much the variance of the coefficient estimate is being inflated by multicollinearity. Simply put, the existence of this problem in a regression model suggests a large amount of standard error in the coefficient estimates. Most standard statistical textbooks suggest a VIF cutoff point greater than five to indicate a concern for collinearity. This is because the expected sum of squared errors in standardized regression coefficients is nearly five times as large as it would be if the predictor variables were uncorrelated. However, Neter et al. (1996) have suggested the examination of VIF values that greatly exceed 10. In ESRI's ArcGIS, the cutoff is placed at larger than 7.5 when examining an OLS model for the collinearity problem. This book recommends anything above the rule of thumb, that is, VIF values that exceed five should be critically reviewed when deriving the best model.

TABLE 5.5

Variance Inflator Factor (VIF) Values for the Three Ordinary Least Squares Models

### Reduced Model

 Factors Primary Model Best Model HS 8.188543“ 3.242602 1.063577 POV 4.261475 UEM 6.238082“ 3.758031 1.577382 EDU 23.531012“ AEA 3.766619 2.419355 1.642695 INC 5.055292“ W >1000.0“ В >1000.0“ 7.444437“ H >1000.0“ 5.985656“ A 283.243408“ 1.720664 PL 1.662634 1.504726 HP 2.069709 1.858672

“ VIF values that exceed 5, a consecutive threshold is being applied to critically evaluate the presence of collinearity.

Abbreviations: A, Asian; AEA, percent of the population under 18 or over 64 years of age; B, Black; EDU, percent of persons aged 25 years or older without a high school diploma; H, Hispanic; HP, proximity to hospitals; HS, percent of occupied housing units with more than one person per room; INC, per capita income; PL, proximity to police station; POV, percent of households living below the federal poverty level; UEM, percent of persons aged 16 years or older in the labor force who are unemployed; W, White.

Table 5.5 summarizes the VIF values for the three OLS models that were generated using ESRI's ArcGIS.

The primary model has eight variables with VIF values that are larger than five (HS, UEM, EDU, INC, W, В, H, and A), reduced model has two variables with VIF values that exceed the threshold (Black and Hispanic), and best model shows a remarkable improvement of VIF values with the highest VIF value only being observed in AEA (1.643). This is far below the required threshold.

Reduced Model

After examining the well-being factors using a scatterplot and correlation analysis, the reduced model is as follows:

We need to determine if the primary model is statistically significant at a = 0.05. We do this by investigating whether

TABLE 5.6

A Summary of the Ordinary Least Squares Results for the Three Models

 Variables Primary Model Coefficient Estimate (f-value) Reduced Model Coefficient Estimate (f-value) Best Model Coefficient Estimate (f-value) HI -88.1853 (-1.0549) -39.206192 (-8.379899)- -32.621131 (-6.612045)' HS 1.0057 (2.518495)- 3.532561 (8.853233)- 4.374369 (16.390050)- POV 0.623354 (6.802502)- UEM 0.647681 (3.573078)- 1.782008 (7.976269)- 2.105701 (12.456289)- EDU 1.024476 (5.112745)- AEA 0.594223 (4.361404)- 0.837226 (4.828465)- 0.910480 (5.456300)- INC -0.000116 (-1.507690) W 67.405221 (0.781687) В 76.487877 (0.889839) 21.010896 (3.846550)- H 68.516060 (0.786059) 27.896733 (3.816535) A 73.666656 (0.818366) 50.236394 (4.509104)- PL 0.000018(0.140818) 0.000302 (1.560332) HP 0.000016 (0.118153) -0.000293 (-1.449490) AIC 468.93 533.37 550.25 r2 0.976 0.9394 0.9173 Observations 77 77 77 Moran's / -0.062 (-0.677) -0.0204 (-0.2102) 0.0254(1.1158)

- Statistically significant coefficient estimates.

Abbreviations: A, Asian; AEA, percent of the population under 18 or over 64 years of age; AIC, Akaike's information criterion (a measure of model performance with the smallest value preferred); B, Black; EDU, percent of persons aged 25 years or older without a high school diploma; H, Hispanic; HP, proximity to hospitals; HS, percent of occupied housing units with more than one person per room; INC, per capita income; PL, proximity to police station; POV, percent of households living below the federal poverty level; UEM, percent of persons aged 16 years or older in the labor force who are unemployed; W, White.

Given that the observed joint F-statistic is 148.17, and it is greater than the critical value at (8, 68) degrees of freedom, we can reject the null hypothesis and conclude that at least one regression coefficient is not equal to zero.

All three regression models can explain more than 90% of the total variation in the well-being significance that is attributable to all the independent variables, X„ as defined by model fit to the data (Table 5.6). Additionally, all three predictor variables identified in the best model have positive coefficients, implying that as these variables increase, the level of hardship in the community areas also increases. However, due to the severe concern of collinearity problems in the primary and reduced models, we must resolve this concern by finding a meaningful model.

### Best Model

The best model after reviewing fitness statistics, lack of fit test, and analyzing other relevant collinearity diagnostics is as follows:

We need to determine if the primary model is statistically significant at a = 0.05. We do this by investigating whether

Given that the observed joint F-statistic is 281.95, and it is greater than the critical value at (3, 73) degrees of freedom, we can reject the null hypothesis and conclude that at least one regression coefficient is not equal to zero.

In selecting the best equation, we must also determine which of the independent variables, X„ is statistically significant at a = 0.05. We do this by investigating whether

The Jarque-Bera statistics that measures whether model predictors are biased or not—a goodness-of-fit test that shows whether residuals are normally distributed at 2 degrees of freedom using a chi-square distribution—indicates the primary model was 1.701 (p-value < 0.427), reduced model was 2.219 (p-value < 0.329), and best model was 3.1615 (p-value < 0.164). We concluded that all the residuals in the three OLS models are normally distributed and unbiased.