From Probability to Statistics, 1764-1810
Bayes and His Application of His Theorem (1764)
Jacob Bernoulli’s binomial analysis of Ars Conjectandi had been driven by the aim of making a breakthrough in the theory of statistical inference. He ultimately failed in this objective, although he established some other fundamental probability laws along the way. Thomas Bayes built on Bernoulli’s binomial analysis, and, with a novel alternative physical analogy of a table instead of an urn, he introduced a new conceptual framework for inferential methods.
Bayes was a Fellow of the Royal Society but by all accounts led a quiet life as an English clergyman, largely detached from the intellectual fervour of the period. His now famous paper ‘An essay towards solving a problem in the doctrine of chances’ was unpublished during his lifetime. It, along with his other papers and possessions, was bequeathed to a fellow clergyman, Richard Price. Price was a significant mathematician and philosopher and he, like Bayes, was also a Fellow of the Royal Society. As a thinker aware of the promise of statistical inference for empirical science, Price immediately recognised the value of Bayes’ paper. He edited it and presented it to the Royal Society for publication in its journal a year or so after Bayes’ death in 1762.
Bayes’ paper was both ahead of its time in concept and yet presented in an anachronistic geometric style. Both these properties led to the paper being somewhat impenetrable. The nineteenth-century mathematical historian Todhunter wrote that the paper’s introductory discussion of the general laws of probability was ‘excessively obscure, and contrasts most unfavourably with the treatment of the same subject by De Moivre’. But the paper’s statement of intent was quite clear:
Given the number of times in which an unknown event has happened and failed. Required the chance that the probability of it happening in a single trial lies somewhere between any two degrees of probability that can be named.
In essence, how can the population probability distribution be inferred from the observation of a limited sample of outcomes from that population? Richard Price viewed statistical inference as the rational solution to David Hume’s sceptical problem of induction, in which it was argued that no amount of empirical observation of the past could provide knowledge of what would happen in the future. In Price’s introduction to the paper, he argued that the problem stated above by Bayes is ‘necessary to be solved in order to assure foundation for all our reasoning concerning past facts, and what is likely to be hereafter ... it is necessary to be considered by anyone who would give a clear account of the strength of analogical or inductive reasoning’. He believed that Bayes’ new approach to statistical inference provided the fundamental solution to this problem of inductive reasoning.
The paper is perhaps most well-known for its eponymous Bayes’ Theorem:
This theorem, when considered separately from its application to statistical inference, is a straightforward conditional probability statement that is uncontroversial and unambiguously derived from the basic axioms of mathematical probability. In fact, it is so rudimentary and universal that scholars are unsure whether Bayes was in fact the first to use it. But the structure of the equation—where probabilities of B conditional on A can be transformed into probabilities of A conditional on B—opens the door to what was then often referred to as ‘inverse probabilities’. Mathematical probability was about making statements about the probabilities of observing specific outcomes, given known population characteristics. Bayes’ Theorem alluded to a potential means of inverting these probability statements into ones that infer something about the probability characteristics of the population, given some observed sample outcomes.
Bayes used a physical analogy to develop his application of this probability theorem to statistical inference. Where Jacob Bernoulli had used an urn filled with black and white balls, Bayes used a table and balls. Whilst his table was two dimensional, his analogy only considers how the balls were distributed along one dimension. The use of a table rather than a one-dimensional line facilitated a geometric interpretation of his probability calculations.
He supposed a first ball was randomly thrown onto the table, and that this was followed by a further n balls that were independently thrown randomly onto the table. He considered how knowledge of whether each of the sub-
Fig. 2.1 Bayes' table
sequent n balls landed to the right or left of the first ball’s position could be used to make a statement about the probability distribution of и, the position along the ‘x-axis’ of the table of the first ball. This set-up can be summarised simply in the Fig. 2.1 above.
Bayes wished to infer the probability distribution of the position, и, of the first ball along the line AB, given only the number of times that the n subsequent random balls landed to the right or left of the first ball. For any given value of и, the probability в of a subsequent ball landing to the right of the first ball is simply the ratio of the line distances иВ/AB (as the ball’s position on AB is uniformly distributed). For a given и, the number of times, M, that the n subsequent balls land to the right of the first ball is therefore binomially distributed with parameters n and в. So, we can write:
So far, this looks extremely similar to Jacob Bernoulli’s work on the distribution of the number of black balls chosen from an urn that has a known proportion of black and white balls. But Bayes had merely laid the preparations for his breakthrough step. He then noted that his theorem allowed him to find an expression for the probability that и is between two points b and f on the line AB:
By construction of his physical example, where each ball is independently and randomly thrown onto the table, the position of every ball on the line AB is uniformly distributed. This means that the unconditional probability of M = x is 1/(n+1) for all x: the number of balls out of a total of n that lands to the right of the first ball can take a total of (n+1) different values (0, 1, 2, ... n), and if the balls are uniformly distributed then each of these outcomes are equally likely. With this and the above equation, Bayes obtained:
Bayes had provided a rigorous mathematical solution for the probability distribution of the location, и, of the first ball along the line AB, conditional on the observed results of how many of the subsequent n randomly distributed balls were positioned to the left or right of the first ball. Today, this distribution is known as the beta distribution. The integral is intractable for large values of n and (n - x), but is very straightforward for relatively small values. For example, consider the case of n = 6 given in Fig. 2.1 above. Without any knowledge of the distribution of the subsequent balls, the probability of и being in the right-hand half of the line AB is clearly simply 1/2. If three of the six balls were observed to land to the right of и, the above distribution implies that the probability would, intuitively enough, remain 1/2. However, if, instead, four of the balls were found to have landed to the right of и, the probability of и being in the right-hand half of AB would be reduced to 29/128 (slightly less than a quarter).
Thus far, Bayes’ work is a novel but still uncontroversial application of mathematical probability—within the physical framework set out, his results are unambiguously rigorous mathematics. In Bayes’ physical model of the table with randomly thrown balls, a ‘uniform prior distribution’ for и arises by construction. It has an unambiguous physical interpretation that derives from the set-up that the balls are randomly thrown onto the table.
Bayes argued that the ‘inversion’ logic of his theorem was naturally applicable to statistical inference. That is, to making conditional statements about the properties of unknown populations (analogous to и on his table) based on observed sample data (the position of the n balls). The more general application of this approach to the problem of statistical inference required treating the unknown parameters of the population that were being estimated as random variables (0 in the above example). The prior probability distribution would be a description of the information (or lack thereof) known about the parameter before the statistical observations were considered. The posterior distribution was the parameter’s probability distribution after being updated to reflect the information in the statistical observations. In his physical analogy, Bayes’ set-up had implied a uniform prior and he had derived a beta posterior distribution.
This ‘Bayesian’ approach to statistical inference involves two fundamental steps that have invoked much philosophical and mathematical controversy, particularly in the twentieth century when it was criticised by leading statistical thinkers such as Sir Ronald Fisher. The first issue is how to interpret the idea of modelling the unknown (but fixed constant) population parameter as a random variable. This made sense in Bayes’ analogy when the position fflwas physically distributed with a given probability distribution. But the standard deviation, say, of a normal distribution is not distributed in this way—it is simply an unknown fixed quantity. The use of a probability distribution to represent the degree of knowledge or prior information available about the parameter is a conceptual leap. To move from Bayes’ physical analogy to the application of Bayes’ Theorem in more general statistical inference, the physical randomness of Bayes’ ball must be translated into an epistemological statement of ignorance. This translation is in some sense philosophical rather than purely mathematical, and as such its viability is open to some debate.
The second issue closely follows: in the absence of any knowledge or information about the parameter, is it appropriate to assume the prior distribution is uniform? Bayes appears to have thought deeply about this question and he tackled it head on in the paper. He argued that for events where ‘we absolutely know nothing antecedently to any trials made concerning it’ then ‘concerning such an event, I have no reason to think that, in a certain number of trials, it should rather happen any one possible number of times than another’. This assertion of equiprobability straightforwardly translates into the assumption of a uniform probability distribution.
In modern times, this is sometimes referred to as the Principle of Insufficient Reason and it may still be invoked as the logical basis for the uniform prior distribution. It has been speculated that Bayes’ lack of confidence in the soundness of this argument restrained him from publishing his paper during his lifetime, but this is largely conjecture. It is equally possible that his inability to find a workable mathematical approximation for his beta posterior distribution was the main reason for his reluctance to publish. In any case, this argument attracted little dissent until around 100 years after the publication of the paper. As we shall see in the next section, the Bayesian uniform prior was adopted whole-heartedly by no lesser mathematicians than Laplace and Gauss in the decades following Bayes’ paper. In 1854, George Boole, an English mathematician and philosopher (and the father of Boolean logic) was the first to publish a significant critique of the logic of Bayesian statistical inference, arguing:
It has been said that the principle involved ... is that of the equal distribution of our knowledge, or rather of our ignorance—the assigning to different states of things of which we know nothing, and upon the very ground that we know nothing, equal degrees of probability. I apprehend, however, that this is an arbitrary method of procedure.
The philosophical argument that the assumption of a uniform prior distribution was in fact an arbitrary and therefore illegitimate general approach to statistical inference gained further traction through the latter part of the nineteenth century and the first half of the twentieth century. Sir Ronald Fisher, who was one of the most important statistical thinkers of the twentieth century, famously argued that, in the absence of any prior knowledge, the assumption that the prior distribution was a squared sine function was as logical a candidate for the prior distribution as the uniform distribution. In twentieth-century statistical practice, Bayesian statistical inference started to be superseded by other methodologies such as maximum likelihood that were argued to be less subjective or arbitrary. But the controversy reigns on, and the Bayesian method has become increasingly popular once again in the twenty-first century (so far!).