# I: Introduction to Concepts and Principles of Structural Equation Modeling for Health and Medical Research

- Introduction and Brief History of Structural Equation Modeling for Health and Medical Research
- An Overview of the Material in This Textbook
- Introduction to Structural Equation Modeling
- Path Diagrams, Confirmatory Factor Analysis and Path Analysis
- How Do Classic Approaches to SEM Analysis Work?
- First- and Second-Generation SEM

## Introduction and Brief History of Structural Equation Modeling for Health and Medical Research

### An Overview of the Material in This Textbook

Structural equation modeling (SEM) has its roots in the social sciences [1]. In writing this textbook, we look to make SEM accessible to a wider audience of researchers across many disciplines, addressing issues unique to health and medicine. As SEM enthusiasts situated among clinicians and multidisciplinary researchers in medical settings we provide a broad, current, on the ground understanding of the issues faced by clinical and health services researchers and decision scientists.

For the past several years we have given short courses, written statistical tutorials and authored applied research articles in the field of SEM. We have noticed in feedback from students, readers and editors a serious gap in the medical and bioscientific literature that makes this type of book fundamental for the fields of SEM, medicine and public health. Readers will be introduced to a full range of SEM methods within the context of health and medicine. Prior knowledge of SEM is not required. Readers with an understanding of the fundamentals of traditional regression analysis will benefit from this book. More seasoned SEM researchers with a focus on health and medicine will also benefit from it for the particular applications to health and medical research.

We present an overview of SEM principles, common nomenclature, diagrams and many real-world examples. We also present some more advanced latent variable approaches applicable to health and medicine, such as modeling to establish measurement invariance, latent variable mixture modeling and latent growth curve modeling. These techniques are very useful with a wide range of practical applications. For example, we will cover how to use SEM with patient-reported outcomes (PROs) and clinician ratings scales as found in both clinical trial data and electronic health records. Readers of this book will be able to understand the theory behind SEM, understand modeling guidelines, interpret SEM analyses and understand advantages and limitations of SEM relative to more traditional analytic techniques.

In brief:

- • The book covers basic, intermediate and advanced SEM topics.
- • Applications are detailed and relevant for health and medical scientists.
- • Topics and examples are relevant to both new SEM researchers as well as more seasoned SEM researchers.
- • Substantive issues in health and medicine in the context of SEM are discussed.
- • Chapters are referenced with both methodological and applied examples.
- • Illustrative figures and diagrams are included for the examples used.

### Introduction to Structural Equation Modeling

Defining *structural equation modeling* is a great task in itself. There is no straightforward manner for us to describe SEM to a novice user that will encompass all that it is or allow one to thoroughly understand what it does and does not do. Metaphorically speaking, we take the approach of immersing the reader in the water rather than merely getting their feet wet first. We provide here a broad view of SEM and focus on the big picture and then get more and more nuanced throughout the chapters in this textbook. In this context, this is likely the more comprehensible approach.

Let us begin our general description of SEM with a short segue for clarity to distinguish between "multivariable" and "multivariate" analysis. *Multivariable analysis* refers to statistical techniques used to evaluate multiple variables. *Multivariate analysis,* more specifically, refers to statistical techniques that can be used to evaluate multiple dependent variables simultaneously. "Multivariate" and "multivariable" analysis are somewhat synonymous terms (both types of analysis can include multiple dependent variables and multiple independent variables). However, the terminology "multivariable" is typically used in the context of a multiple regression model with a single dependent variable and multiple independent variables. Meanwhile, the terminology "multivariate" analysis is commonly used as defined for analysis with multiple dependent variables at the same time. With that explanation, we can now provide a broad definition of SEM.

*Definition:* SEM is a very general and flexible multivariate technique that allows relationships among variables to be examined.

Structural equation models (SEMs) are multi-equation models that commonly involve multiple dependent and independent variables. Variables may play the role of both a dependent and independent variable in different equations within a structural equation model (Chapter 2). In practical settings, SEM, among many applications, includes a diverse set of methods equipped to handle *measurement error* and evaluate the hypothesized causal relations among observed and unobserved variables [2]. SEM researchers must differentiate between observed and unobserved variables and understand the concept of measurement error.

Observed variables are variables which are measured and recorded in the data (e.g. sex, age, height and weight). Unobserved variables are variables that are not directly measured. Unobserved variables are also called *latent variables, latent traits* or *latent constructs. *Phenomena such as depression, intelligence, perceived pain and happiness may be treated as latent variables in research studies. In SEM, multiple observed variables are commonly used as surrogates of a latent variable(s). For example, responses to multiple observed items from a questionnaire about mental health (multiple observed variables) can be used to measure a hypothetical construct of mental health (unobserved variable).

Most typically and as theoretically appropriate, an SEM researcher hypothesizes that multiple observed items influence the manifestation of the latent variable(s) rather than the other way around. These causal assumptions will be elaborated on in an illustrative example in this chapter regarding a latent construct for pain behavior. Causal assumptions are strong assumptions. Given that one makes these assumptions about the hypothesized directionality between a set of observed items and latent variable(s), SEM uses the latent variable(s) to account for measurement error.

Measurement error is the difference between the underlying true values (e.g. mental health) and the actual values (e.g. a score based on responses to multiple items from a questionnaire about mental health). Observed measurements, being influenced by latent variables, have measurement error. Latent variables themselves do not have measurement error associated with them. Measurement error is accounted for in the model of relationships between a latent variable and its indicator variables. Refer to Chapter 2 for more details about different forms of measurement error.

We have now provided a very basic description of how one typically views a latent variable in SEM. As we have defined it, in summary, SEM is an approach that can model latent variables and then conduct multivariate regression of latent variables on each other (as well as observed variables).

#### Path Diagrams, Confirmatory Factor Analysis and Path Analysis

Latent variables are treated as continuous in what we shall refer to as conventional SEM (or what are sometimes called *first-generation SEM* [3,4]). Later we discuss *second-generation SEM* which includes more advanced, related techniques and uses a combination of continuous and categorical latent variables [3,4]. In this section, we fundamentally describe the methods included in conventional SEM as a basis.

One way SEM deals with continuous latent variables in practice is through *confirmatory factor analysis (CFA). Factor analyses* are approaches to represent the relationships among multiple observed variables in terms of a smaller number of hypothesized latent variables. In the context of factor analysis, latent variables are commonly also referred to as *factors. Exploratory factor analysis (EFA)* is a data-driven method used to help identify the underlying latent variable or variables from a set of observed variables. CFA is used to verify hypothesized relationships between the latent variables and the set of observed variables. EFA can also be viewed as a special case of CFA and vice versa. Both EFA and CFA employ the same common mathematical model under different constraints (Chapter 8).

Exploratory data analysis, such as EFA, is data-driven, while confirmatory data analysis, such as CFA, is hypothesis-driven (relies on a priori hypothesis). In connection, EFA can be used to help determine the number of factors to retain and then CFA can be used to evaluate prespecified relationships between factors and their indicators. These techniques (EFA and CFA) can be successively applied on the same set of indicators in different data (to prevent overfitting of the CFA model). We will introduce latent variables and factor analysis in some more detail in Chapter 2 and then dedicate full chapters to CFA and EFA (Chapters 7 and 8) in the context of health and medicine.

*Full SEM* [5] emphasizes not just the measuring of continuous latent variables but the regression of these latent variables on each other. The term "full" is used in the sense that in the hypothesized model every observed variable is an indicator of a latent variable and every variable in the regression analysis is a latent variable. Hypothesized relationships amongst latent and observed variables in a structural equation model can be illustrated with a *path diagram.* A path diagram is a visualization of the *conceptual model.* A conceptual model is a general idea of the relationships under study. Another way to think about the distinction is that path diagrams are usually a reduction or elaboration of a conceptual model to something that can be both represented and tested in the SEM framework. We will construct path diagrams for illustrative examples later in this chapter.

Path diagrams can be used to represent full SEMs. Path diagrams can also be used to represent many special cases of SEMs that are not full including a CFA model or a hypothesized structural equation model in which only observed variables are examined. *Path analysis* is a form of regression and multivariate statistical analysis most often used to evaluate hypothesized causal relationships between observed variables. The SEM framework links together conceptual models, path diagrams, CFA and path analysis.

#### How Do Classic Approaches to SEM Analysis Work?

A researcher may theorize a causal model making use of strong causal assumptions. In the process of *model specification,* a researcher translates the conceptual model into a formal structural equation model and indicates causal paths and directionality between variables (latent or observed) under study. The plausibility of the hypothesized model under the model specification can be tested using observed data in the SEM framework. Much emphasis is given in SEM to examining the fit of the model to the data.

There is no statistical methodology in and of itself that can uncover casual relationships among variables. The classic approaches to SEM analysis use the covariance matrix of the data to estimate the *free parameters* (e.g. causal path estimates) in a hypothesized model under the model specification. A free parameter is unknown and of interest to estimate. Thus, free parameters are also referred to as *unknown parameters.*

Essentially, the researcher can either supply the raw data itself or covariance matrix data in available software for conducting these classic approaches to SEM analysis. SEM then estimates the unknown parameters with the aim of most closely reproducing the covariance matrix. Additionally, a researcher can analyze means in SEM. We will provide details regarding model estimation in Chapter 4.

A full structural equation model consists of two models [6-8]. These two models are the *measurement model* (i.e. CFA) that measures latent variables using multiple observed variables and a *structural model* (i.e. path analysis incorporating latent variables) that evaluates the relationships between latent variables. Each regression equation in the structural model has an error term referred to as a *disturbance term.* Disturbance terms reflect the residual error in regressing an outcome variable on a predictor (or set of predictors).

In a full structural equation model every variable in the structural model is latent. Most typically in our own health and medical research a structural equation model is not full, but still consists of a measurement and structural model. That is, we use a structural model in which at least one of the variables is latent and at least one of the variables is observed [9]. Many different models are also a special case of a structural equation model (e.g. linear regression, ANOVA, CFA and path analysis).

We discuss in more detail the basic vocabulary, history, concepts and usages of SEM for health and medicine in these first two chapters of this textbook to help clarify this broad description of SEM. In Part II of this book we provide more technical details that allow the reader to represent and evaluate a structural equation model. In Part III, we build off these fundamentals to apply both first- and second-generation SEM to address problems in health and medicine.

#### First- and Second-Generation SEM

Some researchers (possibly unintentionally) will use the term SEM as an umbrella term for an extensive list of techniques involving latent variables. Other researchers prefer to distinguish between SEM as it was originally developed and the larger set of latent variable models that now coexist with SEM. In this setting, first-generation SEM applies to continuous latent variables. However, different approaches described later in this textbook will use continuous latent variables (e.g. CFA), categorical latent variables (e.g. latent class analysis and latent profile analysis) or both (e.g. growth mixture modeling). Second-generation SEM uses a combination of continuous *and* categorical latent variables [3]. Second-generation SEM is an expansion of the capacity of first-generation SEM [10]. If the first- and second-generation distinction and the particular methods mentioned are not clear as of yet, these nuances and approaches will become more evident throughout this textbook.

There are many other extensions of SEM that were developed during the first- and second-generations of SEM. SEM was originally derived to only consider continuous observed variables. We will present many applications of SEM with ordered-categorical observed variables and/or missing data. Applications of SEM can also involve other types of data such as count and survival time data. SEM has been extended for analysis with these other types of variables and data. Applications of SEM can involve either cross- sectional or longitudinal data. For example, latent growth curve modeling (Chapter 13) is an application of CFA for longitudinal data. Multilevel modeling (Chapter 13) and multigroup analysis (Chapters 9 and 10) have also been integrated for use with SEM. Further, Bayesian approaches have been adapted for SEM in parallel with the development of the first- and second-generation of SEM [4].

Beyond just using the methods in a straightforward manner, one has the opportunity to extend these methods and flexibly and creatively apply them to address specific research questions. For example, in applying latent growth curve modeling, one can analyze the antecedents and consequences of change, higher-order models and multiple growth trajectories related to each other. *We will use the term SEM hereafter in this textbook, unless denoting otherwise, as an umbrella term in reference to these extensions and first- and second-generation SEM.*

One necessitates a very broad framework to represent and flexibly and creatively apply the different latent variable methods commonly used in studies in health and medicine. These approaches (e.g. conventional linear SEM, conventional SEM with nonmetric data, factor analysis, latent class analysis, growth mixture modeling) can be described as belonging to a family of statistical techniques referred to as *general latent variable modeling (GLVM). *When one defines SEM as broadly as we have in this textbook, the terminology GLVM and SEM can be viewed as synonymous. GLVM is a unified latent variable framework that uses a combination of continuous and categorical latent variables for statistical analysis [11]. The vision of GLVM is to provide a unified an organized set of tools for conducting many different types of statistical analysis. The interested reader should refer to МиЛёп [11] for an introduction to GVLM.

SEM techniques have many advantages in practice including:

- • acknowledging and modeling forms of measurement error;
- • modeling of hypothesized causal relationships for analyses of direct and indirect effects;
- • modeling latent variable relationships;
- • evaluating multiple equation models in a single analysis;
- • flexible techniques for model selection and comparison;
- • evaluating measurement equivalence in scales based on group differences by variables such as sex, race/ethnicity, age, area-level deprivation, language and cross- cultural factors;
- • flexible techniques for understanding longitudinal relationships and subpopulation structures.

These advantages are well-suited to address research questions and theory development in the medical and health sciences.