# Path Analysis

SEM has many applications in health and medical research. SEMs are useful for understanding the measurement properties of clinical screening, assessment and symptom scales. Relationships among these scales, demographic characteristics, risk factors and health outcomes can be evaluated using SEM approaches. SEM is useful for studying latent variables and hypothesized relationships between variables.

With the preceding discussion, we have provided the reader with some background on latent variables. Now, we discuss *path analysis.* Path analysis is a technique used to examine hypothesized causal relationships between variables and is a special case of SEM with only observed variables.

In clarifying the purpose of path analysis, we discuss how it diverges from other more traditional analyses. That is, in order to apply path analysis, one needs to understand explicitly how a typical biomedical interpretation of path analysis differs from traditional regression analyses. In referencing traditional regression, we are not merely discussing a technique, but also the standard assumptions and application of the technique for evaluating an associative relationship. In these traditional applications, one hypothesizes an associative relationship and draws conclusions regarding association (or lack of). For traditional ordinary least squares linear regression, statistical assumptions include linearity, homoscedasticity, independence of errors, weak exogeneity and no or little multicollinear- ity in the predictors. See, for example, Neter et al. [12] or Weisberg [13] for a full review of the statistical assumptions and standard applications of traditional regression analysis.

For this introduction to path analysis, again, we are merely discussing a type of analysis as opposed to a specific statistical framework. The technique of multivariate multiple regression could be used to perform path analysis, but one makes different assumptions (i.e. causal assumptions) in path analysis than in the traditional application of multiple regression. Similarly, path analysis can be done by fitting one regression equation at a time, once the model has been established. PROCESS written by Andrew F. Hayes is a macro for SPSS and SAS that can be used to conduct path analysis using this approach [14].

Let us suppose that, from our example in Figure 2.1, we are focused on weekly exercise and BMI. We hypothesize that in our participants, BMI is inversely associated with weekly exercise. We use traditional linear regression analysis. Our hypothetical results, simulating 550 study participants, show a negative regression slope estimate with a confidence interval that does not contain zero and a statistically significant p-value (Figure 2.4).

What can we conclude? We can generally make a statement like, "in our sample, given our study assumptions, BMI is negatively associated with weekly exercise." We are assuming that we checked traditional regression assumptions and evaluated for multicollinear- ity, autocorrelation of residuals and high leverage values/outliers and found no substantial issues before drawing these conclusions.

We could use traditional multiple regression to evaluate other clinical risk factors (socioeconomic status, depressive symptoms, age, sex, race, etc.) that associate with BMI. Traditional multiple regression could also be used for evaluating the wellness program vs. treatment as usual for lowering BMI, while adjusting for various clinical patient characteristics, such as age, race and sex. In these examples, we have given relevant data a process for making biomedical interpretations following the left side of Figure 2.5 for traditional regression analysis. Unlike in the example corresponding to Figure 2.1, we did not make any causal assumptions in the traditional regression approach.

Path analysis is an extension of traditional regression analysis, in which a researcher estimates hypothesized causal relationships among a set of observed variables. The path analysis process flow on the right side of Figure 2.5 is importantly different from the traditional regression analysis. In path analysis, the researcher is often more interested in causal relationships rather than just associations among a set of measures. Such modeling under strong causal assumptions is more complex than traditional regression analysis.

FIGURE 2.4

Scatterplot with linear regression line and 95% confidence region for regression of body mass index on weekly exercise (N = 550).

FIGURE 2.5

Process flow diagram for biomedical interpretation for traditional regression analysis vs. path analysis. A researcher makes causal assumptions based on logic, theory and prior literature.

In a traditional univariate or multiple regression approach, only a single equation is fit at one time. Multivariate regression modeling, path modeling and SEM, in a single analysis, allow for the simultaneous estimation of multiple unknown model parameters in multi-equation models. For example, using the SEM framework, estimating the magnitude and significance of the eight causal paths in the path diagram in Figure 2.1 is done simultaneously. Likewise, estimating the parameters representing the causal paths in the four regression-type mathematical equations is done simultaneously in Figure 2.1. Recall that each regression equation in the structural model has a *disturbance term* that reflects the residual error in regressing an outcome variable on a predictor (or set of predictors).

In traditional multiple regression analysis, potential confounders are accounted for in order to decrease bias in the estimate of the main effect. The testing of any single causal path in the SEM framework can be interpreted in lieu of adjustment for potential confounders in the form of other model relationships (and covariates), since the analyses is performed at the same time. Therefore, while simultaneously estimating all paths in a structural equation model, one can focus on statistical output about a single causal path or particular relationship across causal paths (e.g. Wellness program *—>* calorie intake *—>* anxiety) in order to evaluate a specific research hypothesis of interest. As a result, SEM simplifies testing in complex causal models.

Hypothesized correlational relationships in addition to cause and effect relationships can still be evaluated while using path analysis or SEM. The SEM framework is sufficiently general to be used for many traditional statistical techniques. ANOVA, traditional linear regression and traditional logistic regression can also be performed in the SEM framework.

The SEM framework can handle many more specialized techniques when a researcher makes use of causal assumptions. SEM can be used to analyze feedback loops (Chapter 5), latent constructs and path analysis models. SEM can be used to conduct multiple group comparisons across the same measurement and/or structural model. The SEM framework, compared to traditional frameworks, provides more empirical and theoretical flexibility for researchers. The assumptions one makes when using a traditional linear regression approach to evaluate an associative relationship do not allow a researcher to make causal conclusions. Traditional regression analysis implies a statistical relationship based on a conditional expected value. Additionally, perfect measurement in the observed variables (i.e. no measurement error) is assumed in traditional regression analysis. Strong a priori theory is more typically used in path analysis and SEM. These approaches more typically allow a functional relationship between variables expressed via a conceptual model, path diagram and mathematical equations [15]. In this way, path analysis and other SEM approaches are a combination of data independent methods (conceptual modeling and path diagramming) and data-driven empirical analyses. The presence of latent variables and explicitly modeling measurement error is outside of the paradigm of traditional regression analysis and path analysis. The structural model in the SEM framework in which relationships between latent variables can be examined can also be thought of as the combination of CFA and path analysis.

# Conducting SEM Analysis in Health and Medicine

In somewhat simplistic terms, a conventional structural equation model in health and medicine consists of a set of equations that fuse together CFA and path analysis, with a purpose in mind. That purpose is commonly to test hypotheses regarding relationships among latent and observed measures.

## Confirmatory Data Analysis for a Single Model

Recall, exploratory data analysis is data-driven, while confirmatory data analysis is hypothesis-driven (relies on a priori hypothesis). Here we discuss a single structural equation model used for a given study. Conducting confirmatory data analysis using SEM in health and medicine may involve four basic steps (Table 2.2).

## Model Specicfiation

A researcher first may have a hypothesis of interest in mind and a conceptual model to represent the phenomenon being investigated. David A. Kenny has defined the process of building a structural equation model as the "translation of theory, previous research, design, and common sense into a structural model" [16]. Aforementioned in this process of model specification, a researcher translates the conceptual model into a formal structural equation model.

*Model misspecification* (or *specification error)* is a term to describe where the model failed to account for everything it should. For example, a model that omits important explanatory variables or a meaningful causal path or correlation between two variables is misspeci- fied. A misspecified model leads to the possibility of coming to incorrect conclusions due to biased estimates. In practice, all models, as approximations of the truth, are somewhat

TABLE 2.2

Four Basic Steps for Conducting Confirmatory Data Analysis Using SEM

- 1. specify an identifiable model
- 2. estimate the model
- 3. assess model fit
- 4. test hypotheses of interest misspecified. SEM researchers using a team science approach and theory, logic and prior literature in developing models are employing sensible strategies toward minimizing the amount of specification error.

## Model Identicfiation

Importantly, for analytical purposes, a model must meet the condition of *identifiability. Identifiability* is a property in which a single solution is possible for all unknown parameters. Even with "perfect" data of an infinitely large sample size one would not be able to uniquely estimate unknown parameters if a model was not identifiable. Thus, lack of identifiability is not an issue regarding data quality. We dedicate much of Chapter 5 to the topic of identifiability. In that chapter we discuss some necessary conditions for identifiability and provide some examples to help one understand the topic further.

At this point in the textbook, we broadly mention the role of model identifiability within Table 2.2. There would be no substantial statistical analysis for the model without meeting this condition. Thus, one needs to specify an identifiable model before conducting data analysis. However, keep in mind, the existence of a unique solution for unknown parameters does not ensure that estimates will be unbiased. For example, the researcher may have applied a numerical optimization procedure that did not perform well on the data at hand or there may be systematic measurement error in measurement of the variables in the sample. There are many places to go astray in the application of SEM, starting from conceptualizing a model that does not translate into an identifiable model and/or having poor quality data.

## Model Estimation and Evaluation

An important feature of SEM analyses is that formal tests and indices are available for assessing the adequacy of the fit of a model to the observed data (Chapter 4). Much emphasis is given to model fit. The classic approaches to SEM analysis use the covariance matrix of the data to estimate the unknown parameters in a specified model. In SEM analysis, different algorithms for numerical optimization can be used to estimate parameters with the aim of closely reproducing the covariance matrix. The better the model fits the observed data, the better one reproduces the covariance matrix after plugging in the estimates of unknown parameters.

*Maximum likelihood (ML)* is a commonly used estimator in the SEM framework for interval data. ML is based on the assumption of multivariate normality of the endogenous variables [9]. Multivariate normality has the properties of (1) univariate normality, (2) normality of all linear combinations between variables, (3) linearity of all bivariate associations and the distribution of residuals is homoscedastic [8]. Statistical tests are available to help detect violations of univariate and multivariate normality (Chapter 3).

Continuous data in health and medical research are often not normally distributed (Chapter 3). *Robust ML* (or *ML with robust standard errors*) is a commonly used estimator in the SEM framework for interval data that violates normality assumptions [9]. Modifications of *ML, full information ML* (FIML), are useful techniques for handling missing data, under certain assumptions about the data missingness mechanism, for analysis of all available cases in a study sample. This is only a very brief sketch of model estimation and evaluation for SEM; we dedicate an entire Chapter 4 to providing details about these procedures.

## Hypothesis Testing

Assuming the model is identifiable, we have discussed how SEM allows for the estimation of multiple equation model parameters in a single analysis for simplifying hypothesis testing in a complex causal model. One can, for example, go on to test the strength of a causal hypothesis between two variables using a corresponding point estimate and confidence interval estimate for that regression path coefficient. A plausible model with a good model fit provides evidence for internal validity and as a result has a specific set of parameters that well-defines the problem at hand. However, in conducting confirmatory data analysis, the researcher might still perform hypothesis testing even with a poor fitting model [17]. The researcher would just report all the results and note that this model should be evaluated further in future studies. An alternative, which we will discuss below briefly, is to modify the model if the fit is poor; this would then be considered more along the lines of exploratory data analysis as opposed to confirmatory data analysis. Formal hypothesis testing for a structural model in the context of mediation analysis is discussed in Chapter 9.

## Exploratory Data Analysis, Model Re-specification and Comparison

There are many additional applications of SEM that do not involve a hypothesis of interest. One may use EFA in exploratory data analysis to help determine the factor structure of a set of questions, given no strong a priori hypothesis about the factor structure. One may also use SEM to revise an initial model for further use in reproducible research. That is, prior to any formal hypothesis testing, one may revise the model (and consider alternative models) if the model fit is not good and/or if the model is not theoretically defensible.

The analysis can be done iteratively with a single modification evaluated at each model comparison. The initial model is compared to a revised model. If the revised model has an improved model fit and is theoretically defensible, then that model may become the new standard model for the study. The new standard model may be compared to a model with another modification in a similar process until a "best" model is identified. We discuss in Chapter 6 how such an iterative procedure should be conducted prudently in which a researcher considers only a few, theoretically meaningful potential model revisions.

Model re-specification may be viewed in many applications as mostly (or entirely) exploratory data analysis. Some researchers would discourage others from using the same data for revising a model and conducting confirmatory data analysis. This may lead to overfitting the model used for confirmatory data analysis given it was revised for an improved fit based on the same data. Another view is to consider results as exploratory/preliminary when testing hypotheses about relationships between variables in the re-specified model using the same data (or holdout data). There are practical reasons a health and medical researcher may not have access to different data (other than a holdout sample). For example, the systematic collection of certain validated measures on a condition under study may only have ever been done in a single sample.

Another approach for model comparison is to begin with several plausible models for the same data and compare their fit in a more confirmatory manner [17]. Using this approach, a series of models can be compared to determine the "best" model among competing models. Plausible measurement and structural models can be determined using theory, logic and prior literature for this analysis. For example, using the PHQ-9 items one may specify and compare three measurement models: a one factor model of *depression, *two factor model of *cognitive/'affective* and *somatic* and two factor model of *affective* and *somatic (and cognitive).* Here a researcher is evaluating the dimensionality and measurement structure of a set of item responses. Hypothesis testing can be conducted using this approach (e.g. testing unidimensionality of a scale vs. plausible alternatives). We discuss model modification and comparison in Chapter 6 and measurement models and dimensionality in Chapter 7.