Regularization in Multiple Regression and Beyond
In the discussion of multiple regression in Chapter 8, we saw that as the number of dimensions X becomes large, we have to ensure that additional parameters in our model are not overfitting the data. The classical statistical solution to this problem is to add covariates (or features, or dimensions of X) sequentially and test whether the new parameter estimated is statistically significant. If so, then test all the previous covariates and see if any of their parameters are no longer significant with the new covariate included. Needless to say, this becomes tedious if there are more than a handful of features to be considered.
A somewhat more practical solution that I presented in Chapter 8 is to fit several models with different numbers of parameters and compare them using the Akaike information criterion (AIC). Although this can be applied more generally, we still need to fit and then compute the AIC for all possible models. As we saw in Chapter 8, however, this is still not very feasible for hundreds of features (or more) that we typically have in genome-scale quantitative biology.
The use of the AIC to trade-off model complexity (number of parameters) against likelihood suggests that there might be a smarter way to fit complex models. Rather than maximizing the likelihood of many different models and then computing their AICs, we might try to maximize a different objective function that has the property we like about the AIC, namely, that it tends to choose the models with fewer parameters unless the difference in likelihood between the models is large enough.
Another way to think about this problem is to consider that when we fit the multiple regression model, we are not using a key piece of information: We expect most of the dimensions of X to be unrelated to our prediction problem. Said another way, we expect most of the components of the “b” vector to be 0. In machine learning jargon, we expect the parameter vector to be “sparse,” meaning most of the values are 0s. Ensuring sparsity ensures a small AIC because it means that we don’t include most of the possible parameters in the model. Of course, the problem is that beforehand, we don’t know which of the dimensions should have their parameters set to 0, and which ones we should include in our model. And as I pointed out in Chapter 8, the number of possible combinations means that for high-dimensional problems, it’s infeasible to try them all.