# Bayesian Models for Comparative Analysis of Count Data

Once cell subsets have been identified using automated analysis or other approaches, the natural next step is to perform statistical analysis and inference to assess the significance of variations in the cell subsets identified. In clinical studies, researchers are interested in understanding the association between cellular heterogeneity and disease progression or vaccine efficacy. For example, current vaccine development for major infectious diseases, including HIV/AIDS, has been targeted to induce protective T cells (Gilbert 2012; McMichael and Koff 2014).

The cellular adaptive immune response depends in part on the generation of antigen-specific T cells. T cells undergo selective pressure during maturation so that those that recognize their specific antigen on antigen-presenting cells become activated, undergo clonal expansion, transition into the blood, and eventually become a long-lived memory population. These antigen-specific cells are critical for antigen recall. T cells specific for a particular antigen represent a very small fraction of an individual's T-cell repertoire. Clinicians and immunologists often rely on flow cytometry to distinguish and identify rare antigen-specific T cells within heterogeneous cell samples such as blood. In many applications (e.g., finding tumor-specific T cells), the task is to identify subjects for whom the proportion of antigen-specific cytokine-producing T cells is significantly different between two experimental conditions (e.g., antigen stimulation vs. nonstimulation).

Polyfunctional T cells (De Rosa et al. 2004)—subsets of antigen-specific T cells that simultaneously produce multiple effector cytokines and other functional markers in response to activation—are believed to be of clinical relevance, and there is evidence linking their frequency to clinical outcome (Seder et al. 2008). They have been shown to be important in protective immunity and nonprogression of diseases (Ciuffreda et al. 2008; Li et al. 2008; Rodrigue-Gervais et al. 2010; Seder et al. 2008; Vordermeier et al. 2002). The intracellular cytokine staining (ICS) assay is an important tool to measure such polyfunctional T cells, and because these polyfunctional cell subsets are typically defined using Boolean combinations, the number of cell subsets to be tested is large.

To summarize, in the comparative analysis of cell subsets identified in flow cytometry, typically based on the relative frequency of cells, there are two key analytical challenges: (i) Most of the cell subsets are found at low frequencies

(e.g., less than 0.1%); and (ii) the number of cell subsets that needs to be tested is large, typically increasing exponentially with the number of measured cell features.

Existing strategies for analyzing cytometry summary data include ad hoc rules based on fold-changes (Trigona et al. 2003), descriptive statistics (Roederer et al. 2011), Hotelling's T^{2} statistics (Nason 2006), and simple analysis using Fisher's test of 2 x 2 contingency tables (Proschan and Nason 2009). These approaches do not robustly address the challenges of low frequency cell subsets and the need to adjust for multiple comparisons.

To address the first challenge of low cell subset frequencies, a Bayesian version of Fisher's 2x2 test (MIMOSA) (Finak et al. 2014b) was proposed. MIMOSA is a Bayesian hierarchical framework based on a beta-binomial mixture model. MIMOSA allows the inference to be subject specific, while borrowing strength across subject through common prior distributions. However, MIMOSA cannot jointly model all cell subsets. Hence, multiple comparisons across cell subsets have to be taken into account and can reduce statistical power.

To account for the remaining challenge of loss of power from multiple testing, a polyfunctionality index (Larsen et al. 2012) using empirical cell subsets proportions was proposed to summarize all observed cell subsets into a single number. However, empirical summaries are known to be extremely noisy when cell counts are small. Instead, a computational framework for unbiased combinatorial polyfunctionality analysis of antigen-specific T-cell subsets (COMPASS) (Lin et al. 2015a) was recently developed. COMPASS uses a Bayesian hierarchical framework to model all observed cell subsets and select those most likely to have antigen-specific responses.

To formally study the count data, let us assume that cell counts are observed from I subjects in two conditions: stimulated (s) and unstimulated (u). Let M denote the number of markers measured; then there are K_{M} = 2^{M }possible Boolean combinations defining functional cell subsets, depending on whether the marker is expressed or not. In addition, let K(K < K_{M}) denote the actual number of cell subsets considered for statistical analysis (i.e., after filtering empty and very sparse cell subsets). Let n_{sik} and *n _{uik}, k* = 1, ...,

*K, i*= 1, ., I, denote the observed counts for the K categories in the stimulated and unstimulated samples, respectively. In addition, let n

_{si}= (n

_{si1}, ..., n

_{siK}), and nui = (nui1, ...,

*nuiK).*Then

*N*=

_{s}i*'Lknik*and

*Nui*= ^n^k are the total numbers of cells measured for subject i in the stimulated and unstimulated samples, respectively. We arrange all K cell subsets in ascending order with respect to their degree of functionality (the number of cytokines the cell subset expresses), except for the Kth subset, which we set to be the null subset with zero degree of functionality.

For a given subject i, i = 1, ..., I, the K observed cell counts n_{si} and n_{ui} are jointly modeled by multinomial distributions:

where MN denotes the multinomial distribution, p_{si} *= (p _{si1}, p_{siK})* and

p_{ui} = (p_{ui1}, ? ?•, p_{uiK}) are the two unknown proportion vectors associated with the stimulated and unstimulated samples. To identify subjects for whom the proportion of antigen-specific cytokine-producing T cells is significantly different between stimulated and unstimulated conditions is equivalent to comparing the two proportion vectors p_{si} and p_{ui} for each subject *i. *A Bayesian latent variable model is a natural modeling strategy to perform such comparison. In particular, a binary indicator vector y_{i} = (y_{i1}, ..., y_{iK}) can be introduced and defined as y_{ik} = 0 if p_{uik} = p_{sik}, and y_{ik} = 1 otherwise, for *k* = 1, ..., K. Figure 7.1 illustrates this Bayesian hierarchal model. More specifically, an individual p_{sik} disappears from the respective likelihood when the associated y_{ik} is zero, and the dimensionality of the parameter space is reduced.

A natural prior for pui is the Dirichlet distribution, by its conjugacy to the multinomial distribution:

where a_{u} = (a_{u1}, ., a_{uK}) is an unknown hyperparameter vector shared across all subjects under the unstimulated condition. This is given a vague exponential hyperprior: *a _{uk}* ~ exp(A.

_{uk}), with

*X*being predefined before fitting the model.

_{uk}Special care is needed when defining priors for y_{i} and p_{si}, *i* = 1, ..., I, as y_{i }introduces dependencies between psi and pui, and proportion vectors have to sum to 1. Therefore, both y_{i} and p_{si} are modeled sequentially. For each y_{i}, it is assumed that the first K - 1 elements of y_{i} are independently distributed Bernoulli random variables with probability parameter ra_{k}. The independence assumption used is mainly for computational convenience, but it is not unrealistic given that there is no *a priori* information about the dependence among the subsets (if any). The parameter ra_{k} represents the proportion of responding subjects for subset k; and this parameterization allows information sharing across subjects when estimating values for y_{ik}. Then the last element y_{iK},

FIGURE 7.1

Graphical representation of COMPASS model.

conditioning on the information for all the first K elements, follows a mixture distribution that satisfies the constraint that the probability vector must sum to 1:

where 5c(x) denotes a point mass distribution at c with *P* (x = c) = 1; 1[x] is the indicator function, equal to 1 if x is true and 0 otherwise. Let Be(x; *p)* denote the Bernoulli distribution with probability p that x equals 1. By choosing a common conjugate prior for ю_{к}, the prior for у by Equation 7.10 can be simplified by analytically integrating out ю.

Last, for each subject i, conditioning on *p _{ui}* and

*Y*

*i*

*, p*can be partitioned into

_{si}_{Y}0 _{Y}1

two subvectors: p_{s}i and p_{s}i, where the corresponding elements in Y_{i} are all c for p^{Y}i, c = 0, 1. Therefore, we can write

_{0} y0 y0

where according to the definition of y0 , p (p_{s}i |p_{ui}, Y_{i}) is a point mass at p_{u}j. The

conditional distribution p(p_{s}j |p_{s}i, p_{ui}, y_{i}) can be derived by applying the standard change-of-variable formula, which gives a scaled Dirichlet distribution. Because of the conjugate priors for p_{u} and p_{s}, a marginal likelihood can be obtained by analytically integrating out both pu and ps. This greatly facilitates MCMC computation and mixing.

In COMPASS, cell subset responses are quantified by the posterior probabilities of *Yik* = 1, which can be visualized using a heat map. Figure 7.2 illustrates the COMPASS analysis results on the ICS data generated through the RV144 HIV vaccine case-control study with 92TH023-Env peptide pool ex *vivo* stimulation, as described by Haynes et al. (2012) and Lin et al. (2015a). Columns in Figure 7.2 correspond to the different cell subsets modeled by COMPASS in the RV144 ICS data set, color-coded by the cytokines they express (white = off, shaded = on) and ordered by degree of functionality from one function on the left to five functions on the right. Rows correspond to subjects (226 vaccine recipients), who are ordered by their status: noninfected (top) and infected (bottom). Each cell of the heat map shows the probability that the corresponding cell subset (column) exhibits an antigen-specific response in the corresponding subject (row), where the probability is color-coded from white (zero) to black (one).

FIGURE 7.2

Heatmap of COMPASS posterior probabilities for the RV144 data set.

In addition to subject-level response probabilities, COMPASS defines two scores that summarize a subject's entire antigen-specific polyfunctional profile into a single numerical value. The functionality score is defined as the proportion of antigen-specific subsets detected among all possible ones. The polyfunctionality score is similar, but it weighs the different subsets by their degree of functionality, naturally favoring subsets with higher degrees of functionality, motivated by the observations that a higher-degree function has been correlated with better outcomes in certain vaccine studies (Attig et al. 2009; Ciuffreda et al. 2008; Rodrigue-Gervais et al. 2010). In the analysis of the RV144 vaccine efficacy trial data set, COMPASS improved characterization of antigen-specific T cells and revealed cellular "correlates of protection/ immunity" missed by other methods. COMPASS is available as an R package on BioConductor.