Density Estimation

This chapter considers the task of estimating a density from a sample of independent and identical observations. Previous chapters began with a review of parametric techniques. Parametric techniques for density estimation might involve estimating distribution parameters, and reporting the parametric den- sity with estimates plugged in; this technique will not be further reviewed in this volume.

Histograms

The most elementary approach to this problem is the histogram, which represents the density as a bar chart.

The bar chart represents frequencies of the values of a set of categorical variable in various categories as the heights of bars. When the categories are ordered, one places the bars in the same order. To construct a histogram for a continuous random variable, then, coarsen the continuous variable into a categorical variable, whose categories are subsets of the range of the original variable. Construct a bar plot for this categorical variable, again, with bars ordered according to the order of the intervals.

Because the choice of intervals is somewhat arbitrary, the boundary between bars is deemphasized by making the bars butt up against their neighbors. The most elementary version of the histogram has the height of the bar as the number of observations in the interval. A more sophisticated analysis makes the height of the bar represent the proportion of observations in the bar, and a still more sophisticated representation makes the area of the bar equal to the proportion of observation in the bar; this allows bars of unequal width while keeping the area under a portion of the curve to approximate the proportion of observation in that region. Unequally-sized bars are unusual.

Construction of a histogram, then, requires selection of bar width and bar starting point. One generally chooses the end points of intervals generating the bars to be round numbers. A deeper question is the number of such intervals. An early argument (Sturges, 1926) involved determining the largest number so that if the data were in proportion to a binomial distribution, every interval in the range would be non-empty. This gives a number of bars proportional to the log of the sample size.

The histogram is determined once the bin width Д,,, and any bin separation point, is selected. Once An is selected, many bin separation points determine the same histogram; without loss of generality, choose the smallest non-negative value and denote it by tn.

Scott (1979) calculates optimal sample size by minimizing mean square error of the histogram as an estimator of the density. Construct the histogram with bars of uniform width, and such that the bar area equals the proportion of observations out of the entire sample falling in the particular bar. Let f(x) represents the height of the histogram bar containing the potential data value x. The integrated mean squared error of the approximation is then

The quality of the histogram depends primarily on the width of the bin Дге; the choice of tn is less important . Scott (1979) takes tn = 0, and, using Taylor approximation techniques within the interval, shows that the integrated mean squared error of the approximation is

The first term in (8.2) represents the variance of the estimator, and the second term represents the square of the bias. Minimizing the sum of the first two terms gives the optimal bin size Д* = [G//J^ f(x)2dx][ '2n_1/3. One might approximate f'(x)2dx by using the value of the Gaussian density with variance matching that of the data, to obtain Д* = 3.49sn-1/2, for s the standard deviation of the sample.

Kernel Density Estimates

Histograms are easy to construct, but their choppy shape generally does not reflect our expectations about true density. A more sophisticated approach is kernel density estimation (Rosenblatt, 1956). Estimate the density as the average of densities centered at data points:

Here w is a density, also called a kernel: that is, a non-negative function integrating to 1.

This estimator / depends on the kernel w, and the smoothing parameter

Д,].

Some plausible choices for w are the (standard) Gaussian density, the Epanechnikov kernel w(x) = a— lGa3x2/9 for |ar| < Д; this kernel is scaled to have unit variance with a = j^=, and minimizes the integrated mean square error (8.1) among symmetric kernels (Epanechnikov, 1969).

Another plausible kernel is the triangle kernel w(x) = l//6— |a?/6|, for |*| < /6, and 0 otherwise. One might also consider the box kernel; consider the simplest version, w(x) = 1 if |a;| < 1/2, rather than the one standardized to unit variance, w(x) = /3/2 if |xj < /3- This kernel will be considered for pedagogical reasons below, although in practical terms its use abandons the advantages of the smoothness of the result.

Example 8.2.1 Refer again to the nail arsenic data from Example 2.3.2. Figure 8.1 displays kernel density estimates with a default bandwidth and for a variety of kernels. This figure was constructed using

cat(’ Density estimation ’) attach(arsenic)

#Save the density object at the same time it is plotted, plot(a<-density(nails),lty=l,

main="Density for Arsenic in Nails for Various Kernels") lines(density(nails,kernel="epanechnikov"),lty=2) lines(density(nails,kernel="triangular"),lty=3) lines(density(nails,kernel="rectangular"),lty=4) legend(l,2, lty=rep(l,4), legend=c("Normal","Quadratic", "Triangular","Rectangular"))

Note that the rectangular kernel is excessively choppy, and fits the poorest.

Generally, any other density, symmetric and with a finite variance, may be used. The parameter A„ is called the bandwidth. This bandwidth should depend on spread of data, and n; the spread of the data might be described using the standard deviation or interquartile range, or, less reliably, the sample range.

The choice of bandwidth balances effects of variance and bias of the density estimator, just as the choice of bin width did for the histogram. If the bandwidth is too high, the density estimate will be too smooth, and hide features of data. If the bandwidth is too low, the density estimate will provide too much clutter to make understanding the distribution possible.

We might consider minimizing the mean squared error for x fixed, rather than after integration. That is, choose Л,, to minimize the mean square error of the estimate,

FIGURE 8.1: Density for Arsenic in Nails for Various Kernels

Using the box kernel, the estimate f(x) has a rescaled binomial distribution, and E [^/(aOj = p/An for p = F(x + An/2) — F(x — An/2). Expanding F(x) as a Taylor series, noting that F'(x) = f(x), and canceling terms when possible,

The bias of the estimator is approximately An2f"(x*)/24. Hence, generally, if An does not converge to zero as n —> oo, then bias does not converge to zero, and the estimate is inconsistent; hence consider only strategies for which limn An = 0. Furthermore, since limn An = 0, then bias is approximated by

Furthermore, Var [/(ar)j = p( 1 - p)/(nAn2), and applying (8.4), and using the convergence of An to zero, one obtains

Minimizing the mean square requires minimizing

Differentiating and setting (8.7) to zero implies that C2(x)An~2jn = or

This gives a bandwidth dependent on x: a bandwidth independent of x may be constructed by minimizing the integral of the mean squared error (8.7), СъЦАпп) + C2A,,4, for C2 = C2(x) dx and C =

jI-ooCUX) dx-

Example 8.2.2 Refer again to the arsenic nail data of Examples 2.3.2 and 8.2.1. The kernel density estimate, using the suggested bandwidth, and bandwidths substantially larger and smaller than the optimal, are given in Figure 8.2. The excessively small bandwidth follows separate data points closely, but obscures the way they cluster together. The excessively large bandwidth wipes out all detail from the data set. Note the large probability assigned negative arsenic concentrations by the estimate with the excessively large bandwidth. This plot was drawn in R using

plot(density(nails,bw=a\$bw/10),xlab="Arsenic",type="l", main="Density estimate with inopportune bandwidths") lines(density(nails,bw=a\$bw*5),lty=2)

legend(1,2,legend=paste("Band width def ault",c("/10","*5")), lty=c(l,2)) detach(arsenic)

and the object a is the kernel density estimate with the default bandwidth, constructed in Example 8.2.1.

Silverman (1986, §3.3) discusses a more general kernel w(t). Equations (8.5) and (8.6) hold, with

and (8.8) continues to hold. Sheather and Jones (1991) further discuss the constants involved.

Epanechnikov (1969) demonstrates that (8.3) extends to multivariate distributions; estimate the multivariate density f(x) from independent and identically distributed observations Xl5.... Xn, where Xj = (Хц,..., X^), using

FIGURE 8.2: Density Estimate with Inopportune Bandwidths

Here w is a density over Э? , and Anj are dimension-dependent bandwidth parameters. Epanechnikov (1969) uses multivariate densities that are products of separate densities for each dimension, but surveys work using more general kernels in two dimensions. The number of observations needed for precise estimation grows dramatically as the dimension d increases.

Exercises

1. The data at

http://lib.stat.cmu.edu/datasets/CPS_85.Wages

reflects wages from 1985. The first 27 lines of this file represent an explanation of variables; delete these lines first, or skip them when you read the file. The first six fields are numeric. The sixth is hourly wage; you can skip everything else. Fit a kernel density estimate to the distribution of wage, and plot the result . Comment on what you see.

2. The data at

http://lib.stat.cmu.edu/datasets/cloud

contain data from a cloud seeding experiment. The first fifteen lines contain comment and label information; ignore these. The third field indicates season, and the sixth field represents a rainfall measure expected not to be impacted by the experimental intervention. On the same set of axes, plot kernel density estimates of summer and winter rain fall, and comment on the difference.