# REGULARIZATION AND PENALIZED LIKELIHOOD

This brings us to one of the most powerful ideas in machine learning: regularization. Regularization can solve the problem of overfitting by allowing us to fit simpler models based on very high-dimensional data. The key idea is to use regularization to set most of the parameters equal to zero. Here, we’ll consider one very elegant type of regularization for probabilistic models, called “penalized likelihood,” where we modify the objective function by adding new terms to the likelihood. In doing so, we hope to encourage the model to set most of the parameters to zero. In other words, penalized likelihood methods amount to changing the objective function, in this case so that the objective function deems models to fit the data better if they have few nonzero parameters. It’s important to remember that regularization is a more general term, and even models that don’t have probabilistic interpretations can have regularization schemes to encourage sparsity and avoid estimating too many parameters from too little data.

In the case of linear regression, we typically consider penalized likelihoods of the following form:

where the first formula is the general formula for a penalized likelihood (PL), where I have written the penalty as an arbitrary function, f of the parameters, theta. In the second formula, I showed the kind of penalized likelihood that’s typically used as an objective function for linear regression. You can see that in the penalty terms for linear regression subtract from the likelihood based on the absolute value of slope parameters of the linear regression model. Note that when *b* parameters are 0, there is no penalty. If a *b* parameter in some dimension is not zero, the penalty is weighted by the parameter lamda, which controls the relative strength of the penalty. In the machine learning jargon, the first penalty term is referred to as an L1 regularization because it is proportional to the L1 norm of the vector b. The second penalty term is an L2 regularization because it is proportional to the L2 norm of the vector. In practice, a mixture of these two (the so-called “elastic net” Zou and Hastie 2005) or L1 alone (the “LASSO” Tibshirani 1996) work well for linear regression.

Terminology aside, I hope that the form of this new objective function makes some sense: The penalized likelihood is explicitly trading off the fit to the data (the likelihood) against a penalty function based on the parameters. The more the parameters deviate from 0, the more gets subtracted from the likelihood. The larger the X parameters, the bigger the change in likelihood needs to be before a nonzero parameter can be accepted into the model. In the limit of very small X, the penalized likelihood objective function reduces to the ordinary likelihood.

It’s also worth noting that adding penalty terms to the likelihood (in particular the *L*1 penalty) means that there are no longer analytic solutions for the *b* parameters; the models must be fit using numerical methods. This means that to use them in practice, you’ll always be relying on the efficiency and accuracy of software packages. Be sure to check for the latest and greatest implementations. For example, there are several packages implementing these models in R. In general, penalized likelihoods can lead to computationally hard optimization problems; this means that it is often not possible to optimize these objective functions in a practical amount of time for very large datasets with millions of datapoints. However, for the specific case of regression (including generalized linear models), it’s known that the maximum of the penalized likelihood objective function can be identified efficiently (Friedman et al. 2010).

Fitting models using penalized likelihoods can in principle solve the problem of model complexity/feature selection for linear regression. Estimating parameters using L1 regularized likelihood tends to produce models with much fewer nonzero parameters than standard linear regression, and the parameters that are nonzero tend to be much more strongly associated with the data. Another great advantage of penalized regression is that it’s possible to fit models where the number of covariates (predictors, dimensions of X) is larger than the number of observations. This is in contrast to standard multiple regression, where if the number of dimensions is greater than the number of observations, the estimation procedures will fail. However, one inconvenience of regularized regression compared to standard regression is that because the parameter estimates are no longer maximum likelihood estimates (they are maximum *penalized* likelihood estimates) we can no longer apply well-developed theory that we used to derive the distribution of parameter estimates and P-values for hypothesis tests of association that we saw in Chapter 6. Of course, distributions of parameters and P-values for hypothesis tests can still be obtained numerically using permutations and resampling (see Chapter 2).

Perhaps the more serious drawback of the penalized likelihood objective functions that I’ve described is that although they can ensure that most of the *b* parameters are held at 0, to do so, I had to introduce two additional parameters: ^ and P_{2}. These parameters also need to be estimated—I hope it’s obvious that if we tried to choose them to maximize the objective function, we would simply set them to 0. The penalized likelihood is always less than the likelihood. In practice, therefore, these parameters are usually chosen using a cross-validation procedure. The idea of the cross-validation procedure is to estimate the parameters of the model on a subset of the data, and see how well it predicts the data that was left out during the estimation. (We will return to cross-validation later in the book in our discussion of evaluation of classifiers.) Cross-validation allows the regularization parameters to be chosen so that the model is producing the best possible predictions of data that wasn’t used in the training, and therefore can’t be “overfit.” Of course, the downside of this cross-validation procedure is that we need enough data to actually leave some of it out during training. With thousands of datapoints, this isn’t usually a problem for continuous regression problems, but it can become an issue for logistic and multinomial regression in the classification setting. We will return to this in the next chapters. If there isn’t enough data for cross-validation, it’s possible to choose a regularization parameter that produces a number of nonzero parameters that matches some biological knowledge or intuition. We’ll return to the intuition behind these regularization parameters later in this chapter.

Applying regularization to the transcription factor regression problem (discussed in Chapter 8) allows us to automatically obtain sparse models in an unbiased way. For example, with an L1 penalty of 0.045, I got a 3-transcription factor model with R^{2} of about 0.029 on the training data, and 0.011 on the replicate. The model has an AIC of 6722, which compares pretty well to what I got with the models I trained with a lot more work in Chapter 8.