Least Squares, Errors and the Central Limit Theorem (1805-1810)

New conceptual breakthroughs were required in order to make further progress in the general framework of statistical inference. Relatively little progress was made in the twenty years following Laplace’s 1781 paper. Laplace himself appears to have put probability to one side and refocused on mathematical astronomy, where he made terrific contributions. Two vital breakthroughs were then delivered in succession in 1805 and 1809 by another two pre-eminent European mathematicians—Legendre and Gauss. Laplace then returned once again to the subject of probability and delivered the final fork of this trident in 1810. The synthesis of the three new concepts that were introduced over this five-year period represented a fundamental and permanent development in mathematical statistics.

Like so many of his peers, Adrien-Marie Legendre’s primary focus during his remarkable career had been in pure mathematics and astronomy. His interest in probability and statistics arose through the contemporary problem that had similarly attracted Laplace: finding a ‘best estimate’ from a number of discordant astronomical observations. In 1805 he published a short book on modelling comets, Nouvelles methods pour la determination des orbites des cometesP The book included an appendix entitled ‘Sur la method des moindres quarres. In this appendix, he advocated choosing a parameter estimate by minimising the squared errors that it produces for a given set of observations. That is, using the notation developed above, if we have v_{b} ... v_{n} observations and we wish to make a best estimate of V, we would write e_{i} = v_{i} - V, and find the estimator that minimises the sum of ej^{2}, ... e_{n}^{2}. He noted that, in this form of example, the estimator produced by least squares would be the arithmetic mean of the sample (and this applied irrespective of the specific form of the error probability distribution). He did not attach any profound importance to this observation, but viewed it as a desirable property.

Legendre’s suggestion of the use of the sample arithmetic mean was clearly easy to practically implement, and had a beguiling mathematical elegance. But he appeared to have plucked the least squares fitting rationale out of thin air. He did not provide any fundamental mathematical or metaphysical rationale for why the least squares estimator property of the sample arithmetic mean made it, in some deeper sense, ‘best’. Moreover, Laplace had already established that, for an arbitrary choice of error probability distribution, the arithmetic mean of the sample would not necessarily be best (in the sense of being the median of the posterior distribution). Legendre had worked out that the arithmetic mean of the sample was the least squares estimator of the population mean, but for this observation to progress from being a curious nicety to a pillar of a rigorous probabilistic approach to statistical estimation, more was required.

The great Carl Gauss then took an important step towards an explanation of why the least squares estimator could be mathematically described as a ‘best’ estimate. Like Laplace and Legendre, he focused the bulk of his energy and talents on pure mathematics and astronomical problems. In 1809, he published a major paper on the mathematics of planetary orbits, ‘Theoria motus corporum celestium ^{1} Like Legendre’s paper on the orbit of comets, he included as an appendix a piece on statistical inference. Gauss’s statistical piece started where Laplace had finished. That is, he considered the Bayesian approach to developing a posterior distribution based on an assumption of a uniform prior distribution and the conditional distributions of the observed errors. He showed that the choice of parameter value which was implied by the median of the posterior distribution in this case was also the choice of parameter value which maximised the value of the joint conditional probability distribution of the errors. So, the parameter value could therefore be found by differentiating the joint error probability distribution and setting it equal to zero. Interestingly, this was very close to the maximum likelihood concept that Fisher developed in the early twentieth century. However, without defining the form of the error probability distribution, Gauss had not practically advanced beyond where Laplace had reached twenty years earlier.

To break the impasse, Gauss did something rather ad hoc. Instead of specifying a form of error distribution and then deriving the optimal parameter estimate that it generated, he specified a particular form of parameter estimate (the arithmetic mean of the sample), and then derived the form of error distribution that implied this form of parameter estimate was optimal. He found that the arithmetic mean of the sample maximises the joint distribution of the errors only when the error distribution has the form:

This is the normal distribution, though at this point in time no particular importance was attached to it. A probabilistic rationale for Legendre’s least squares method had now been produced: Legendre had shown that the arithmetic mean was the least squares estimator, and Gauss had now shown that the least squares estimator was the optimal estimator when the error distribution was normal. However, Gauss had not provided any good reason why the normal distribution was the best choice of error distribution. For as long as the normal distribution was just an arbitrary assumption for the error distribution, a significant ‘so what’ would hang over Gauss’s intriguing finding. ^{[1]}

This ‘so what’ did not hang in the air for long. Laplace had returned to the subject of mathematical statistics during the first decade of the nineteenth century, and, without any knowledge of the work of Gauss, published in 1810 his ‘Memoire sur les approximations des formules qui sont fonctions de tres grand nombres et sur leur application aux probabilities .^{15} This paper contained Laplace’s most important contribution to probability and statistics: the Central Limit Theorem. Speaking loosely, this said that the sum of any independent variables would be approximately normally distributed when the number of terms in the sum is large.

This suddenly propelled the normal distribution from obscurity to centre stage. De Moivre’s derivation of the normal distribution as the limiting case of the binomial distribution almost 100 years earlier could now be seen as just one example of an all-pervasive phenomenon. In an instant, Gauss’s work on normal error distributions went from being an interesting piece of ad hoc analysis to being the defining statement of statistical inference. By Laplace’s Central Limit Theorem, error distributions could be assumed to be normal (for large samples). In which case, Gauss’s optimal estimator for the population mean was the arithmetic mean, which Legendre had shown was the least squares estimator. An elegant and profound synthesis had been achieved that would form part of the permanent foundation of inferential statistics.