Jacob Bernoulli and the First Law of Large Numbers (1692/1713)
Jacob Bernoulli (1654-1705) was the first in a veritable dynasty of Bernoullis who made significant contributions to mathematics and physics in the eighteenth century. At least five of the clan published on probability, but none with greater influence than Jacob. His book, Ars Conjectandi, is a seminal piece in the history of probability and statistics. It likely was written over the 1680s and 1690s—and the most important part, the first limit theorem in probability, is believed to have been written in 1692—but it was never published in Bernoulli’s lifetime. His nephew Nicolas, another famous mathematical Bernoulli (of whom more will be said later), edited and finally published the book in 1713, several years after Jacob’s death.
The first part of Ars Conjectandi considers combinatorial problems and games of chance in a similar way to those dealt with by Pascal, Fermat and Huygens over the preceding decades. This section of the book is most notable for containing the first derivation of the binomial probability distribution. But the most important section of the book is in Chap. 4, where Bernoulli derives the first limit theorem in probability: Bernoulli’s Theorem, or what is often now referred to as the weak law oflarge numbers. The theorem is the first to consider the asymptotic behaviour of a random sample from a population with known characteristics.
Bernoulli uses the analogy of an urn filled with black and white balls, where the proportions of black and white balls in the urn are known—say the proportion of all balls that are black is p. n balls are randomly selected, with replacement, from the urn. The number of black balls, m, in the n randomly sampled balls is a random variable, as is the sampled proportion of black balls, m/n. At the time of writing, the intuition that a larger sample size n would result in m/n being a more stable and reliable estimate of p already existed. Bernoulli writes: ‘For even the most stupid of men, by some instinct of nature, by himself and without any instruction (which is a remarkable thing), is convinced that the more observations have been made, the less danger there is of wandering from one’s goal.’
But no quantification of how the accuracy of a sample increased with sample size had been developed. Even more vitally, it had yet to be determined whether there was some fundamental limit to the amount of certainty that a random sample could provide, even for very large sample sizes. Bernoulli’s treatment of these questions arguably signifies the moment where probability emerges as a fully formed branch of applied mathematics.
Mathematically, Bernoulli’s Theorem, or the weak law of large numbers, states:
where p, m and n are defined as above, and e is an arbitrarily small number.
The proof of the theorem was mainly a matter of combinatorial algebra. Bernoulli recognised that m would have the binomial probability distribution that he developed earlier in the book. He then used algebra’s Binomial Theorem to expand the combinatorial terms arising in the probability distribution and show how they behaved in the limit for n.
Bernoulli’s Theorem showed that a random sample of independent trials, each with a known and constant probability of ‘success’, would provide an unbiased estimate of the population ‘success’ probability. Furthermore, it showed that the noise in the sample probability asymptotically reduced to zero as the sample size increased. This work also provided, again for the first time, quantification of the sampling error associated with samples of a given finite size n (under the assumption that the population probability, p, is known). Bernoulli developed some quantitative examples. He considered the case where p = 0.6, and he asked how big n must be in order for the sample probability, m/n, to be between 31/50 and 29/50 with a probability of 9999/10,000 (a probability level Bernoulli referred to as moral certainty). He calculated that n needed to be at least 25,550. In his History of Statistics,2 Stigler noted that Ars Conjectandi abruptly ends with this calculation, and he speculates that the magnitude of this sample size would have been disheartening at a time when examples of large-scale empirical sampling had yet to emerge. 
Bernoulli’s Theorem fundamentally requires the population probability, p, to be known. The theorem tells us about the behaviour of a sample when the characteristics of the population are already given. The inverse problem of statistical inference—of developing estimates of the characteristics of a population given the observed sample—is not advanced by the theorem. It does tell us that an infinite sample will provide an unbiased estimate of the population probability, but it does not say anything about the behaviour of finite samples from an unknown population—the crux of statistical inference. For example, if we do not know p, and we observe m = 4 and from a sample of size n = 10, the theorem does not tell us anything about the interval we can infer for p with some specified probability from these observation, or even that 0.4 is the ‘best estimate’ of p. Bernoulli’s writing seemed aware of this limitation, but he also understood that a solution to the statistical inference problem was the greater prize and he tried hard to find applications of his theorem to it.
Bernoulli’s efforts to wring some application to statistical inference out of his theorem led to a series of letters between Leibniz and Bernoulli during 1703. Leibniz disputed that Bernoulli’s Theorem could be meaningfully applied to statistical inference. Bernoulli claims in a letter: ‘Had I observed it to have happened that a young man outlived his respective old man in 1,000 cases, for example, and to have happened otherwise only 500 cases, I could safely enough conclude that is twice as probable that a young man outlives an old man as it is that the latter outlives the former’. But he was not able to mathematically define ‘safely enough’, nor was he able to define in what sense the 2-1 probability estimate in his example was the ‘best’ estimate. Leibniz used an analogy: for any finite number of sample points, an infinite number of curves can be made to pass exactly through all of them, and there was no means of establishing which one is best. Similarly, for a finite sample that gives a particular sample probability m/n, there are many values ofp that feasibly could have created such an outcome, and so the choice of value remains arbitrary. Concepts like least squares and maximum likelihood—foundations of statistical inference—would take at least another 100 years before they started to appear.
-  Stigler (1986).