AIC AND FEATURE SELECTION AND OVERFITTING IN MULTIPLE REGRESSION
As the number of features (or covariates) becomes larger than a handful, one often wants to test which (if any) of the features are significantly associated with the response (Y). In practice there are several solutions to this so-called “feature selection” problem in multiple regression. The major problem in feature selection is that each new additional feature will always improve the fit of the model because the additional parameters can always explain something (even if it is really only noise). This leads to overfitting, where a model can fit the data it was trained on very well, but have little predictive power on new data.
To illustrate the problem of overfitting, let’s consider fitting multiple regression models to predict gene expression from transcription factor binding sites again. However, this time, we’ll use different experiments were we have biological replicates. We’ll estimate the parameters of the model on one replicate (training data) and then we’ll measure the fit (the fraction of variance explained) on both the training data, and the other replicate (test data). Because regression models are fast to compute, we’ll go ahead and build models starting with the first transcription factor in the yeast genome, adding one more with each model until we get to the 216th, which corresponds to the model given earlier with all the transcription factors.
FIGURE 8.6 Overfitting in multiple regression and the AIC. The left panel shows the fit of the multiple regression model as a function of the number of dimensions of X used for the prediction. The gray points represent the fit on data used to estimate the parameters of the model, while the black points represent the fit on data from a replicate experiment. The right panel shows the AIC (on the data used to fit the model) as a function of the number of dimensions of X.
As more dimensions of X are used to predict Y, you can see that the fit to the training data always improves (Figure 8.6). The fit to the test data (that was not used in estimating the parameters of the model) plateaus and may even decrease. This is because the additional parameters are fitting noise in the training data that is not reflected in the held-out data. Two clear increases in fit occur for both training and test data when Rtg3 (a transcription factor with a similar motif to Pho4) and Gis1 (a transcription factor with similar motif to Msn2) are added to the model. Because the increase in model fit occurs for both the training and test data, we infer that these are real predictors of the gene expression response. You can see that the model might predict as little as about 2% of the biological variation, even though it appears to predict 10% on the training set. I say that it might only predict that little because that assumes that the biological replicates really are replicates and any nonrepeatable variation is noise. There could easily be unknown factors that make the two experiments quite different.
Although I have presented the problem of overfitting in the context of multiple regression, it is a theme that applies (at least in some form) to every statistical model (or algorithm) that estimates (or learns) parameters from data. As we learn a more and more complex model, we can improve prediction of the training set. However, the prediction accuracy of our model on new data (sometimes referred to as generalizability) of our model may not increase, and might even decrease.
The classical statistical approach to avoid overfitting in linear regression is to add the additional covariates sequentially. As each new dimension of X is added to the model, all the previous ones are tested to see if the new feature captures their contribution to the model better. If so, previous Xs are removed in favor of the new one. Only if the new variable makes an additional statistically significant improvement to the fit is it retained in the final model. A major limitation of this approach is that the order that the features are added to the model can affect the result, and it’s tricky to perform multiple testing corrections in this framework: The number of tests must be related to the total number of features tried, not just the ones actually included in the model. Rather than statistically testing the covariates one by one, we’d prefer to measure the improvement of fit for each additional covariate and decide if the improvement in fit is “worth” the extra parameters. Because the objective function in regression is a likelihood, we could try to use the AIC (described in Chapter 5) to choose the best model, taking into account the number of parameters.
To illustrate the power of the AIC for combating overfitting, I plotted the AIC for each of the transcription factor models here. To be clear, the AIC is computed based on the maximum likelihood, which is obtained during the parameter estimation. Notice that unlike the fit to the training data, the AIC does not continue to increase as we add more and more parameters to the model. In fact, it reaches its minimum with around 75 transcription factors, and encouragingly, this is where the fit of the model to the (unseen) test data reaches its plateau. So of the 216 models that we tried, the AIC does seem to be telling which model captured the most real information about the data, without including additional dimensions that were irrelevant for prediction of new data. An important subtlety about the minimum AIC model is also illustrated very clearly: Just because the model has the minimum AIC, and therefore is in some sense the “best” trade-off between data fit and model complexity, this does not imply that the fit (or predictive power or accuracy) that we measured on the training data actually reflects the fit of the model to new data (generalization accuracy). Indeed, in this case, the fit is still badly overestimated.
The AIC is a very powerful and general way to compare models with different numbers of parameters. In practice, however, this approach only works when there are relatively few Xs (or features) to try. In the case of hundreds of possible features, it’s simply impractical to try all possible combinations of features and choose the one with the smallest AIC. In the previous example, I arbitrarily chose 216 models adding transcription factors one at a time starting with chromosome 1. We have absolutely no guarantee that the 75 transcription factor model that minimizes the AIC is the only model that achieves that AIC: The 7 transcription factor model I showed based on transcription factors I thought might be important has nearly the same AIC as the 75 parameter model that happened to be the best of the 216. Furthermore, there are likely to be models with much lower AICs: To be sure that we found the lowest, we would have to try all possible combinations of the 216, an astronomically large number. Even though multiple regressions are very fast to calculate, trying 216! combinations (or even all possible 3 transcription factor models) would be computationally prohibitive. Not to mention the multiple testing problem. Finding the best set of features in a model is known as “feature selection” and is a fundamentally difficult problem for high-dimensional models.
In general, feature selection is the analog of multiple hypothesis testing when we are building multivariate predictive models. Unfortunately, there is no solution as comprehensive and effective for feature selection as FDR and Bonferroni corrections are for multiple hypothesis testing.
MULTIPLE REGRESSION SO FAR
- • Linear regression has a straightforward generalization if the data used to predict is more than one-dimensional.
- • Partial correlations can be used to distinguish effects for two correlated covariates.
- • Overfitting is when a model with lots of parameters has been trained on one dataset and predicts that data very well, but can't predict well on data that wasn't used for training. This usually indicates that the model has learned to predict the particular noise in the training data. This problem gets worse as the number of parameters that need to be estimated grows.
- • The AIC is a simple and effective way to choose between models with different numbers of parameters, as long as the number of models doesn't get too large.