# Deconfounder in Estimation of Treatment Effects

## Introduction

Dealing with confounders, which by definition affect both the treatment and outcomes, is a central problem in the estimation of treatment effects (T. VanderWeele 2015, Zhang et al. 2019, Lee, Mastronarde, and van der Schaar 2018) Consider an example, we assume that smoking is a treatment (T), mortality is the outcome (Y) and gender is the confounder (X) (Figure 7.8(a)). Assume that gender may affect smoking and mortality. We observe that males are more likely to smoke than females and are also more likely to die young due to other risk behaviors. A difference in mortality between smokers and non-smokers may be due to smoking, gender, or may be due to both. In this case, gender is a confounder. This example demonstrates that estimating treatment effect in the presence of confounders will be biased.

If the confounders can be measured, we can take the measured confounders to adjust for other confounders. However, many confounders cannot be measured or are simply unknown. For example, a social-economic status that affects both the treatment the patients receive and the health conditions of the patients is an unmeasured confounder. The unadjusted confounders will seriously influence the estimation of the treatment effects. To estimate the treatment effects with little to no bias in the presence of unobserved confounders poses a great challenge. The methods for de-confounding include generative implicating models (Lee, Mastronarde, and van der Schaar 2018), data reduction methods (Zhang et al. 2019), confounder selection (T. VanderWeele 2015a), reinforcement learning (C. Lu, Scholkopf, and Hemandez-Lobato 2018), and instrumental variable methods (Usaid Awan et al. 2019). In this section, we focus on generative adversarial learning (Lee, Mastronarde, and van der Schaar 2018).

## Causal Models with Latent Confounders

We introduce causal models with unobserved confounders (Lee, Mastronarde, and van der Schaar 2018). We assume that a latent confounder Z affects both the treatment variable *T* (binary treatment) and outcome variable У. Suppose that there exists a proxy variable X that approximates the latent variable Z (Figure 7.8b). The data which we collect are denoted by D = (X„ *T; _{r} Yi), i =* 1, ...,«. The outcome Y is a function of the treatment

*T,*the feature vector (covariates) X, the confounder Z, and denoted by

FIGURE 7.8

(a) Causal diagram without confounder. (b) Causal diagram with confounder.

The joint distribution of the variables Y, Z, X, *T* is denoted as P(Y, *Z, X, T). *Since confounder Z is not observed, to determine Z, we use two approximations to factorize the joint distribution P(Y, Z, X, T) as follows:

Equation (7.39) shows that the joint distribution P (Y, Z, X, *T*) can be generated by an encoder-decoder where distribution *q _{E} (Z Y, X, T*) is the distribution of the whole latent space mapped by all the original dataset (Y,

*X,*T). Therefore, the generated latent space includes both the confounder and non-confounder parts. Equation (7.40) implies that the joint distribution P(Y, Z, X, T) can be approximately generated by prediction principle where the generated latent space includes only the confounders. To determine the confounders, we can make the two distributions

*q*and

_{E}(Y, Z, X, T)*q*equal. Adversarial learning networks can be used to implement it (Lee, Mastronarde, and van der Schaar 2018, Dumoulin et al. 2019, Chen et al. 2019).

_{p}(Y,Z,X,T)The ITE for an individual is defined as

Since the confounder Z affect both Y and T, using the formula for the total probability, we can express the conditional probability P(Y| X, *T)* as

Since we assume that Z -> T, we then obtain Substituting Equation (7.43) into Equation (7.42), we obtain

Using Equation (7.44), we can calculate *ITE* (X).

## Adversarial Learning Confounders

The basic idea of the adversarial learning confounders is to match the joint distribution *q _{E}* (Y, Z, X,

*T)*with the joint distribution

*q*(Y, Z, X, T) via bidirectional model and adversarial learning (Lee, Mastronarde, and van der Schaar 2018, Chen et al. 2019, Dumoulin et al. 2019). The pair-domain joint distributions

_{p}*q*Z, X, T) and

_{E}{Y,*q*Z, X, T) will be produced by two generative models that are similar to autoencoders. The generative model that produces the joint distribution

_{P}(Y,*q*Z, X, T) is called a reconstruction network, and the generative model that produces the joint distribution

_{E}(Y,*q*Z, X, T) is called a prediction network. Adversarial learning inference is used to match the pair-domain joint distribution (Figure 7.9).

_{P}(Y,A reconstruction network can produce the joint distribution *q _{E}(Y,* Z, X, T) that is factorized into two components:

*P*X, T) and

_{dl},m(Y,*q*(Z | У, X, T). The sampled data (У, X, T) from the distribution

_{f}*Pi„t„(Y,*X,

*T)*is input to the encoders to generate the latent variables Z that follows a distribution

*cj*| У, X, T). The latent variables 2 is denoted by 2 = G

_{E}(Z_{£}(Y, X, T, ее, Be). To ensure that the latent variables keep the original data information, we introduce the decoder Gr(Z) to reconstruct input of the encoder G

_{£}(Y, X, T, ее, Be).

A prediction network is developed to implement the factorization of the joint distribution *q _{P}(Y,* Z, X, T) in Equation (7.40). To map the data (X, T) to the latent space, another encoder (inference network) is introduced. The output Z of the inference network is a nonlinear function of X,

*T*and £j with the transformed distribution

*q*(Z | X, T) and is denoted by Z = G; (X, T, ei, Bi). The joint distribution

_{t}*q*(У | Z, X. T) is generated by the prediction decoder

_{p}*G*The prediction decoder inputs the data (X,

_{P}.*T)*sampled from the real data distribution

*Ры*and the latent variables

_{а}(Х, T)*Z-qfZ*| X, T) and predicts the outcome variable

*Y~q*denoted by У = Gp(Z, X,

_{p}(Y Z, X. T),*T, Bp).*Therefore, the joint distribution

*q*in the prediction network is equal to

_{p}(Y, Z, X, T)FIGURE 7.9

The architecture of the adversarial learning inference network for confounder and ГТЕ estimation.

## Loss Function and Optimization for Estimating ITEs in the Presence of Confounders

To match joint distributions, similar to GANs, we introduce the discriminator and a loss function, which play an adversarial game between the two distributions. When the game reaches equilibrium, the two distributions *cj _{E}(Y,* Z, X,

*T)*and

*q*Z, X, T) are matched. Let D(Z, X, T, У) be the probability that the tuple (Z, X, T, У) is drawn from the joint distribution

_{p}(Y,*q*Z, X, T) and 1 — D(Z, X,

_{E}(Y,*T,*У) be the probability that the tuple (Z, X,

*T, Y)*is drawn from the joint distribution

*q*Z, X, T). The discriminator attempts to distinguish two tuples (Z,X, T, У) and (Z, X, T, У) which are from

_{P}(Y,*q*Z, X, T) and

_{E}(Y,*q*Z, X, T), respectively. The loss function is defined as

_{P}(Y,

To match two joint distributions, we solve the following optimization problem:

To ensure that the encoded latent variables Z contain the original data information, we define the following loss function for the reconstruction decoder: