DIFFERENCES BETWEEN THE EFFECTS OF L1 AND L2 PENALTIES ON CORRELATED FEATURES
Although the form of the penalties I proposed is somewhat arbitrary at this point, it’s useful to consider the effect that the two penalties would have on highly correlated features. For example, let’s imagine the extreme case where two features, X1 and X2 are perfectly correlated, and the b for either of them alone is 0.5. In this case, either one could be used to predict Y, and there’s no additional information gained by including the other one. If we use the L2 penalty, the optimal solution is to assign b = (0.25,
0.25), which gives a total penalty of 0.1251. Any unequal distribution to the two features gives a larger penalty. This means that the L2 penalty prefers to share the predictive power amongst the correlated features. In practice, this means that the L2 penalty doesn’t reliably produce sparse models when features are strongly correlated; instead, it just assigns smaller bs to all the correlated features.
If we use the L1 penalty, on the other hand, any solution where the two components add up to 0.5, such as b = (0.5, 0) or b = (0, 0.5) yield the same total penalty to the likelihood of 0.51. This means that the L1 penalty has no preference between sharing and just choosing one of the features if they are perfectly correlated. For any real features that are not perfectly correlated, one of the features will explain the data slightly better (possibly due to noise) and that feature will be selected by the model, while the other one will be removed from the model. Thus, it’s the L1 penalty that’s actually the one pushing the parameters to 0 when there are correlated features. In practice, however, the L1 penalty can be overly “aggressive” at removing predictive features, especially in molecular biology where we may wish to preserve bona fide biological redundancy.
Because of the different behavior of the two penalties on correlated features, it’s thought that penalties composed of a combination of the L1 and L2 can work better than either alone. The L1 part works well to get rid of features that are only fitting noise, while the L2 part encourages inclusion of multiple highly correlated features in the model. Thus, current thinking is that a combination of both penalties is best, although this means that two regularization parameters must be chosen for each model.