Statistics Is for Testing (Lack of) Randomness
Statistics is not good at detecting fudged data. However, testing randomness is possible. Randomness in a randomized controlled trial means:
- - a representative sample “drawn at random” from a target population,
- - each member of the target population has equal chance of being selected,
- - if other criteria for selection are applied, the result will not be the effect of treatment, but the effect of bias,
- - the theory of statistical testing is based on randomness (see Chap. 1),
- - unrandom data produce p-values that are pretty meaningless.
We will try and name two important causes for unrandomness. The first cause is extreme inclusion criteria. An example is given. In 1991 Kaariainen published an interesting study in Scand J Gastroenterol (1991; 23: 58-66). A controlled clinical trial of Helicobacter-associated gastric bleedings was analyzed according to two different analysis procedures, one applying strict inclusion criteria, and the other applying pretty loose criteria for inclusion. The effect of the strict criteria on the numbers of patients to be excluded from the study was huge as expected.
- 285 Patients had to be excluded in case of strict inclusion criteria.
Complications mainly in the form of bleedings were observed in only two patients which was 1.7 % of the studied population (only “superman” subjects were left in the trial).
- 4 Patients had to be excluded in case of loose inclusion criteria.
Complications in the form of bleedings were observed in 71 patients which was 18 % of the studies population.
The authors of the above study concluded, that complications in only 1.7 % of the population was not representative for the target population of this study. If you carry a briefcase full of exclusion criteria, then your trial data will be at risk of not being representative.

The second cause of unrandomness in a controlled clinical trial is inadequate data cleaning. An example is given.

An example of inadequate data cleaning is provided by an, otherwise, highly respected scientist, and great geneticist, the Augustinian friar Gregor Mendel. One hundred years ago, he used aselective samples of peas with different phenotypes. The results of his interbreedings were very close to what he expected. Using a simple chi-square test, one can only conclude, that he, somewhat, misrepresented the data. The results were closer to expectation, than could happen by randomness. See for explanation Statistics applied to clinical studies 5th edition, Chap. 11, entitled Data closer to expectation than compatible with random sampling (2012, Springer Heidelberg Germany, from the same authors).
Gregor Johann Mendel (20 July 1822 - 6 January 1884) was a German-speaking Moravian scientist and Augustinian friar who gained posthumous fame as the founder of the modern science of genetics. Though farmers had known for centuries that crossbreeding of animals and plants could favor certain desirable traits, Mendel's pea plant experiments conducted between 1856 and 1863 established many of the rules of heredity, now referred to as the laws of Mendelian inheritance.
“Fudged” data can not be identified by any statistical test, but you can assess, whether your data are compatible with randomness. Many tests for assessing randomness are available. We will name a few of them.
- 1. Chi-square goodness of fit test.
- 2. Kolmogorov-Smirnov test.
- 3. Shapiro-Wilckens test.
- 4. Survival has an exponential pattern: if log transformation is linear, then an exponential pattern in your data supports randomness of survival data.
- 5. Extreme p- and/or standard deviation -values are not compatible with randomness, if you expect in a confirmative trial a p = approximately 0.01, then you will have less than 5% chance of a p-value <0.0001. Such p-values are not compatible with randomness.
- 6. Investigating final digits of the main result values.
An example of the above 6th method for demonstrating randomness is given.
In a statin trial, the results consisted of 96 risk ratios (RRs). It was observed, that often a 9 or a 1 were the final digits, for example, RRs of 0.99/0.89/1.01/1.011 were, frequently, observed. We can check the accuracy of these result-data with the help of a multiple comparison chi-square test, according to the underneath table.
Final digit of RR |
observed frequency |
expected frequency |
^[(observed-expected)2 / expected] |
0 |
24 |
9.6 |
21.6 |
1 |
39 |
9.6 |
90.0 |
2 |
3 |
9.6 |
4.5 |
3 |
0 |
9.6 |
9.6 |
4 |
0 |
9.6 |
9.6 |
5 |
0 |
9.6 |
9.6 |
6 |
0 |
9.6 |
9.6 |
7 |
1 |
9.6 |
7.7 |
8 |
2 |
9.6 |
6.0 |
9 |
27 |
9.6 |
31.5 |
Total |
96 |
96.0 |
199.7 |
The above table is tested with chi-square. The difference between the observed and expected frequencies are much larger than could happen by chance. The probability, that this difference could happen by chance, if the null hypothesis were true, would be <0.001. The conclusion of this, can be, that the frequency distribution of final digits of this study are not random. This would mean, that the validity of this study is in jeopardy.