# Learning Treatment Effect by Generative Adversarial Networks

## Introduction

In this section we propose an alternative to using matching to impute counterfactuals and estimating treatment effects. Here, we introduce the generative adversarial networks (GANs) for estimating counterfactuals and treatment effects (Goodfellow et al. 2014). Unlike the classical non-neural network methods that estimate the average treatment effects from observational data, the GANs estimate the individualized treatment effects (ITEs) (Yoon, Jordon, and Schaar 2018), an important step towards "precision medicine". A great challenge in estimating the ITE is to remove the bias of confounders on the treatment effect estimation, where confounders affect both treatment assignment and outcome. We present an extension of the GANs to the estimation of ITEs in the presence of latent confounders. The GANs and conditional GANs (CGANs) are powerful tools for imputing missing data. The next section will introduce the CGANs as a general framework for imputing the counterfactuals and estimating the ITEs.

## CGANs as a General Framework for Estimation of Individualized Treatment Effects

A challenge in the estimation of ITEs is the unbiased estimation of counterfactuals. Counterfactuals will never be observed and cannot be tested by data because true counterfactuals are unknown. Recently proposed GANs and CGANs started a revolution in deep learning. GANs and CGANs offer a powerful tool for missing data imputation. GANs and CGANs have the potential to accurately produce the missing data distribution given partial observations and some other information in the data. Therefore, we can use GANs and CGANs to produce potential counterfactuals.

GANs and CGANs consist of two parts: the "generative" part that is called the generator and the "adversarial" part that is called the discriminator. Both generator and discriminator are implemented by neural networks. Typically, a K-dimensional noise vector that may be uniformly distributed or normally distributed is input to the generator network that transforms the noise vector to a new desired fake data instance using nonlinear transformation theory. Then the generated new fake data instance is input to the discriminator network to evaluate them for authenticity. The generator is constantly learning to produce better fake data instances while the discriminator constantly obtains both real and fake data and improves the accuracy of evaluation for authenticity.

**The Architecture of CGANs for Generating Potential Outcomes**

Observed factual and covariates provide basic information for the estimation of counterfactuals. Therefore, we use CGANs as a general framework for ITE estimation. The CGANs for ITE estimation consists of two blocks. The first block is to impute the counterfactuals, which is referred to as to the imputation block. The second ITE block is to estimate the distribution of the treatment effects using the predicted complete dataset produced in the imputation block. The architecture of CGANs is shown in Figure 7.5. ^{[1]}

FIGURE 7.5

CGANs for estimating ITEs.

given X = *x, T = t,* W = tv, *Yf = yf.* The goal of the generator is to learn the neural network G such that *G(x, yf, t, **w, zq, *9g)~Py_{x},i,w,/(/)• Unlike the discriminator in the standard CGANs where the discriminator evaluates the input data for their authenticity (real or fake data), the counterfactual discriminator Dc that maps pairs *(x, y)* to vectors in [ОД] attempts to distinguish the factual component from the coun- terfactuals. The output of the counterfactual discriminator *Dc* is a vector of probabilities that the component represents the factual outcome. Let *Dc (x, y, t, iv, 6d)k* represent the probability that the *i ^{lh}* component of у is the factual outcome, i.e.,

*к = rj,*where

*Qj*denotes the parameters in the discriminator. The goal of the counterfactual discriminator is to maximize the probability Dc

*(x, y, t, w, Qj)k*for correctly identifying the factual component

*rj*via changing the parameters in the discriminator neural network

*Dc-*

*•*

**Loss Function**

The imputation block attempts to impute counterfactuals by extending the loss function of binary treatment in (Yoon et al. 2018) to all types of treatments: binary, categorical or continuous treatments, we define loss function V (D_{c}, G, *9d)* as

where the log is an element-wise operation.

The goal of imputation block is to maximize the counterfactual discriminator Dc and then minimize the counterfactual generator G:

In other words, we train the counterfactual discriminator D_{c} to maximize the probability of correctly identifying the assigned treatment *T*>* or *Y ^{f}* (Y^), and then train the counterfactual generator G to minimize the probability of correctly identifying TY After the imputation block is performed, the counterfactual generator G produces the complete dataset

*D = x, ij.*Next, we use the imputed complete dataset D = X, у to generate a distribution of potential outcomes and to estimate the individualized treatment effects via standard CGANs which is called the ITE block.

**CGANs for Estimating ITEs**

**• ITE Block**

The standard CGANs consist of three parts: generator, discriminator and loss function which is summarized as follows (Yoon, Jordon, and Schaar 2018); Ge et al. 2019)

**• ITE Generator**

The ITE generator G/ is a nonlinear transform function of X and Z;:

where Y is the generated к-dimensional vector of potential outcomes, X is a covariant vector and Z_{;} is a К-dimensional vector of random variables and follow the uniform distribution Z/~»((-l,l)^{K}). The ITE generator attempts to find the transformation Y = Gj (X, Z_{/;}£fy) such that Y~ Ру|х(у)-

**• ITE Discriminator**

Following the standard CGANs, we define a discriminator D/ as a nonlinear classifier with a pair (X, Y* = Y) or (X, Y* = Y) as input and a scalar that outputs the probability of Y* being from the complete dataset D.

**• Loss Function**

A loss function for the ITE block is defined as

where D/(X, Y*) is a nonlinear classifier that determines whether Y* is from the complete dataset D or from generator G_{;}.

The goal of the ITE block is to maximize the probability of correctly identifying that Y* is from the complete dataset D and to minimize the probability of correct classification. Mathematically, ITE attempts

The algorithms for numerically solving the optimization problems (7.4) and (7.7) consist of the following steps.

To solve the optimization problems (7.4) and (7.7), we need to use sampling formulas to approximate the expectations. Therefore, we first discuss the implementation of the imputation block.

**• Imputation Block Optimization**

Assume that *n* individuals are sampled. Sampling approximation of *V (Dc, G*) is given by

where *Y* = G(X, У/, TqW,( 1 - W) О Z_{g}, 0_{g}). _

To enforce that the estimated factual outcome Vfshould be as close to the observed factual outcome *Y(* as possible(Yoon, Jordon, and Schaar 2018), we post the following restriction:

The optimization problem (7.4) can be implemented by

Optimization problems (7.10) and (7.11) can be solved by back- propagation (stochastic gradient descent) algorithms. The details for the algorithms are given in supplementary note A.

**• ITE Block Optimization**

ITE block intends to estimate the counterfactuals using the observed outcomes and imputed counterfactuals. Its performance metrics are defined as

Sampling formula for *Vj* (D/, G/) is

The optimization problem (7.7) for ITE can be reformulated as

Again, stochastic gradient descent methods can be used to solve optimization problems (7.12) and (7.13). Algorithms for their numerical implementation are similar to the algorithms for the imputation block.

**CGANs for Estimating ITEs in Survival Analysis**

*•* Background in Survival Analysis

An essential issue for survival time analysis is how to deal with censoring time. Let *%* = [У,^{1}, *...,Y, ^{K}]^{T}* be the vector of potential outcomes (survival time) for the

*i*subject, where

^{th}*Yf*is the potential outcome (survival time) if the

*i*subject had received the

^{lh}*k"'*treatment. We only observe the component of the potential outcome vector У that corresponds to the assigned treatment, which is called the factual outcome and is denoted by

*Yf.*Other unobserved potential outcomes (survival times) are called counterfactual outcomes (survival times), or simply counterfactuals, and denoted by

*Yf*

*.*The factual outcome (survival time)

*Yf*can be expressed as

Each subject is associated with *К* potential outcomes (survival times) *Yf* , but only one of them can be observed as *Yf.* Let X, = *[хц,* ...,*,,,]^{T} be a vector of covariates associated with the *i ^{lh}* subject, including miRNA, age, sex and other geographic and tumor feature variables. Assume that

*n*subjects are sampled. For the sake of simplicity, the subscript

*i*for indexing the subject is omitted when the context is clear.

Next, we consider censoring. The reasons for censoring include: 1) loss to follow-up and 2) deaths from other causes. Let L, be an indicator variable for the censored event where /, = 1 denotes that the *i ^{lh}* subject is not censored and /, = 0 denotes that the

*i*subject is censored. Let C, be the censoring time for the

^{th}*i*subject. The potential time to event (death or censored) Z = [Z

^{th}^{1}, ...,Z

^{K}]

^{r}is defined as

and the factual (observed) time to event is

Define the indicator variable for the *k"'* potential survival time censoring to indicate that it is less than the censoring time as

and

the indicator variable for the all potential survival time censoring as

The data for the treatment effect estimation with censoring survival time is *(Y,* W, *T, X, 8, Z).* Assumption 2 (ZJLY| (X, *T, e)* needs to be extended to censoring data.

ASSUMPTION 3: The treatment assignment is independent of censored and uncensored survival time given the covariates. Formally, we assume

Similar to the assumption for the treatment assignment, we also make the following assumption about the independence of the censoring time from the survival times and covariates:

ASSUMPTION 4:

• *Loss Function of the CGANsfor Estimating llLs in Survival Analysis*

The data in survival analysis consist of two parts: 1) noncensored data *D„ _{c }*and 2) censored data

*D*. In survival analysis, the produced potential outcomes (survival time) by the generator in Equation (7.3) should be changed to

_{c}

The loss function for the uncensored data *D _{nc}* is given by

For the censored data *D _{c},* we should introduce the second loss function

*V*(Ц-) to enforce the constraints:

*Z*= min {У*, С),

^{k}*к =*1,

Using the Lagrangian multiplier, we can incorporate the constraints into the loss function (7.20) to obtain the total loss function of CGANs for estimating ITEs in the survival analysis:

where Я is a tuning parameter that controls the trade-off between the censored and noncensored loss function. We often set Я = 1.

To obtain the ITEs in survival analysis, we solve the following optimization problems

- [1] Imputation Block A counterfactual generator in the imputation block is a nonlinearfunction of the covariate vector x , treatment vector T, treatmentassignment indicator vector W, observed factual outcome Yf andК dimensional random vector zq with uniform distributionzc~li((-l,l)K) where Y- = Yn . The produced potential outcomesby the generator are denoted by where output У represents a sample of G. It can take binary values,categorical values, or continuous values. 2 is a vector of 1, О denoteselement-wise multiplication and вс are the parameters in the generator.The generator is implemented by a neural network. We use Y to denotethe complete dataset that is obtained by replacing У4 with Y'. The distribution of Y depends on the determinant of the Jacobianmatrix of the nonlinear transformation function G (X, Y’, T, W, zc, вс)using a transformation theorem (Ross, n.d.). Changing the transformationfunction can change the distribution of the generated counterfactuals. LetPyx,i,w,/(}/) be the conditional distribution of the potential outcomes,