# GENE EXPRESSION SIGNATURES AND MOLECULAR DIAGNOSTICS

I showed in the earlier figure that it's possible to distinguish T cells from all other cells reasonably well based only on CD8 expression levels. Identification of combinations of genes whose expression levels can reliably distinguish cells of different types is of substantial medical relevance. For example, given a biopsy or patient's tumor sample, doctors would like to know (1) what types of cells are found in the sample, and more importantly, (2) do these cells represent a specific disease subtype? If so, (3) is this patient's disease likely to respond to a certain treatment or therapy? All of these are classification questions, and a large research effort has been directed toward finding specific classifiers for each of these challenges.

In fact, there are probably *many* combinations of genes whose expression levels distinguish between cell types. Once again, this represents a difficult feature selection problem: We have ~25,000 genes that can all be possible predictors, and we'd like to choose genes or combinations of genes that are the best. And the number of combinations is astronomical—even for 2 gene combinations there are more than 300 million possible predictors. Previously we saw that regularization can be used to choose sparse predictors in linear regression. What about for classification? Well, since logistic regression *is *linear regression, we can go ahead and use the same regularization strategy that we used before to obtain sparse classifiers.

Using logistic regression with *L1* regularization and X = 0.25, I obtained a two gene combination that does an almost perfect job at separating T cells from all other cells (Figure 10.2): only one false positive (other cell on the T-cell side of the boundary) and one false negative (one T cell on the other side of the boundary). Note that this problem would have been impossible with standard logistic regression (without the penalty) because the number of genes (dimensions of *X)* is much greater than the number of cells (the number of observations). Impressively, one of the genes chosen by the model is the T-cell receptor, whose expression level can identify T cells reasonably well on its own—but the computer didn't know this was the T-cell receptor when it chose it.

FIGURE 10.2 Identification of a two-gene signature using L1 regularized logistic regression. The computer automatically identified 2 genes (out of millions of possible two-gene combinations) that almost perfectly separates T-cells from other cells. One of these genes turns out to be the T-cell receptor.

Identifying gene signatures that predict clinical outcomes is one of the most promising avenues for genomics research. However, there are many complex statistical issues that need to be considered—methods to reliably identify these signatures is an area of current research interest (Majewski and Bernards 2011).