REGRESSION IS NOT JUST "LINEAR"— POLYNOMIAL AND LOCAL REGRESSIONS
One very powerful feature of linear regression is that it can be generalized in several different ways. First, it does not assume that the relationship between variables is “linear” in the sense that there has to be a line connecting the dots. A simple way to extend linear regression beyond a simple line is to think about regressions on transformations of the X variables (features or covariates). For example, it’s totally okay to write regressions like
Because linear regression assumes the Xs are measured exactly, everything we worked out will apply just fine for these transformations of X. You might notice that in one of the regressions I added an extra parameter to weight the term corresponding to X2. Although at first it looks like this will make the regression problem much more complicated, if you write out the likelihood for this regression (see Exercises), you’ll see that you’ll be able to derive a third formula for b2, just like we derived formulas for b0 and bp Although the algebra gets a bit tedious, in Chapter 8, we will discuss how to write out the equations for linear regression in linear algebra notation and solve them for an arbitrary number of b’s. Thus, “linear” regression can be used on data where arbitrary nonlinear transformations have been applied to the predictor. In the case where we fit a term proportional to X and X2, we are using linear regression to fit a parabola (or quadratic form) to our data. Thus, polynomials of arbitrary degree can be fit using linear regression.
Perhaps even more powerful than transformations of X, are extensions of regression that do not use all of the values of X equally. These methods are known as “local” regressions. The idea is very simple: if you have plenty of data, you can use only nearby points to make a model that predicts well in a local region or neighborhood of the space. The simplest form of a local regression is “nearest neighbor regression” where, instead of the entire dataset, only a fixed window of, say, k, datapoints are used to predict Y. In a nearest neighbor regression, we construct a simple average at each observation X.. The fc-nearest datapoints can be chosen using any of the distance metrics we discussed in Chapter 5. In the univariate (one-dimensional) case, this just means the numerically closest points. I hope it’s clear that different subsets of the dataset will have different averages and the overall predictions of Y] need not form a line at all. As к approaches the size of the entire dataset, the results for nearest neighbor regression will approach the average of the dataset. The main drawback of nearest neighbor regression is that it’s very sensitive to the size of the neighborhood, and that we only get predictions of Y at observed values of X. For this reason, nearest neighbor regression often includes heuristic methods to interpolate between the X .
Perhaps the most elegant formulation of local regression is known as kernel regression. In this formulation, a weighted average is used to predict Y at any point X0, based on all the data, but using a weighting scheme that weights the Xs so that data nearby X0 contribute more strongly to the average than distant datapoints. Specifically, in the so- called Nadaraya-Watson kernel regression, the prediction of Y at a point X0 is given by
where K(X; - X0|) is now the so-called “kernel function,” which is used to weight the data based on their distance to the point of interest X; - X0. Thus, kernel regression solves both the problems of nearest neighbor regression, but raises the question of how to choose the kernel function. We seek a function that is maximal when the distance between points is small and then decays rapidly when the distance is large. A great example of such a function is the Gaussian probability density, and this function is a very popular choice for a kernel (where it is known as the radial basis function kernel or RBF kernel).
When we use a Gaussian kernel, we also need to choose the standard deviation (or bandwidth) of the kernel. This determines how fast the kernel decays as a function of distance, effectively setting the size of the neighborhood of points used for local regression. If the bandwidth (the standard deviation of the Gaussian kernel) is too small, nearby points will dominate the local estimate, and the estimate will be too noisy. On the other hand, if the bandwidth is too large, the local estimate will be insensitive to the variation in the data. In the limit of infinite bandwidth, kernel regression converges to a simple average of the data (see Exercises). There are methods to choose this bandwidth automatically based on the data, but these should be used with caution. In most cases, you can obtain a good (low-dimensional) kernel regression by trying a few values of the bandwidth.
To illustrate the effect of the bandwidth on kernel regression, I fit two kernel regressions to the mRNA and protein data (Figure 7.5). Notice that the result confirms that over a large range of mRNA and protein, the relationship appears linear; but at high mRNA levels, the linear relationship clearly becomes more complicated. However, when we fit the regression with a bandwidth that is too small, we find that the kernel regression becomes unstable and follows extreme datapoints.
The most widely used form of local regression is LOESS (Cleveland and Delvin 1988). In this method, a polynomial is used to model the conditional expectation of Y any point X0, based on a weighting scheme that weights the Xs, so that only the nearby X0 is given positive weights, and everything else is given a very small weight or weight of zero. Note that as with kernel regression, X0 is not necessarily an actual datapoint (or observation)—it can be, but it can also just be a point where we want an estimate of Y. Just as with standard linear regression we obtain a prediction at every value
FIGURE 7.5 Kernel regression of protein levels on mRNA levels. On the left is a kernel regression with a suitable choice of bandwith (0.25) for this data. On the right is a kernel regression with a very small bandwidth (0.01) that clearly shows overfitting of the data. Gray “+”s indicate individual genes for which protein and mRNA abundance have been measured.
of X, local regression gives a prediction at every value of X based on the nearby data.
The estimate at X0 is obtained by first fitting a polynomial to the nearby points using linear regression by minimizing a weighted version of the SSR.
where I have written the weighting function w as a function of |X; - X0| to emphasize that it is a function of the distance of the datapoint from the point of interest. Because these weights don’t depend on the parameters of the regression, they add no numerical complication to the minimization of the SSR (see Exercises). Once we have obtained the local estimates of the b parameters, we can go ahead and predict the value of Y at the point X0.
Amazingly, many of the results from standard linear regression, regarding the convergence of the predictions of Y and the variance in the estimate, can be generalized to weighted regressions like kernel regression and LOESS. Of course, two important issues are the choice of the polynomial (e.g., linear, quadratic, cubic, etc.) and the choice of the weighting function w(). Note that if we choose a 0th order polynomial, we have something very close to kernel regression. If a kernel is chosen as the weighting function, LOESS is equivalent to “local polynomial kernel regression.” Since LOESS is a widely used method, there are standard choices that are usually made, and it’s known in many cases how to obtain good results. Furthermore, there are LOESS implementations in standard statistics packages like R or MATLAB ®.
Local regression approaches are used often in molecular biology to model time-course data where levels of genes or proteins are changing over time, but there is no reason to assume that these changes are linear functions (or any other simple form) of time.
Another interesting application of local polynomial regressions is that they can be used to estimate derivatives of the curves that they fit. For example, in the mRNA and protein modeling example, we wanted to test if the slope of the (log-transformed) regression was 1, but we noticed that
FIGURE 7.6 A second-order kernel regression (with bandwidth 0.5) is used to model the conditional expectation ?[Y|X] (upper dashed trace), and the first derivative (d/dX)E[Y|X] (lower dashed trace). The hypothesis of simple proportionality of mRNA and protein abundances is indicated as a solid line (1) and the derivative follows this prediction over the range where most of the data is found.
the assumption of a constant slope over the whole data was unrealistic. I solved this by simply throwing out the data (at high mRNA levels) that seemed to violate the simple assumptions about the relationship between mRNA and protein abundance. However, using a polynomial kernel regression, we can estimate the slope (the first derivative) of the regression at every point without assuming a simple linear relationship in the data. A kernel estimate of the derivative is shown in Figure 7.6: you can see that for most of the range of mRNA concentrations, the slope is very close to 1. Note that unlike in the simple linear regression models shown in Figure 7.3, no assumptions about a simple “line” fitting the whole dataset are made in order to estimate the slope here. And still, we find that on average, there is a simple linear relationship between mRNA and protein abundance for the range of expression levels where most of the mRNAs are found.