# Statistics Is to Better Understand the Limitations of Clinical Research

Traditionally, clinical investigators use statistics as a lamp standard, for support rather than illumination, an expression of Hills, statistician London UK, 1980.

In 1948 one of the first randomized controlled trials was published: the Streptomycin trial, Br Med J 1948; 2: 769. In the first years trials were quite often negative, and this was due to

- (1) small samples,
- (2) inappropriate hypotheses,
- (3) based on biased prior data.

Subsequently, flaws were increasingly being recognized

- (1) interaction,
- (2) time effects,
- (3) negative correlations,
- (4) asymmetries of characteristics of treatment groups.

Nowadays, randomized controlled trials are seldom negative anymore. They are called confirmative, rather than explorative research.

In the past decades, we have also come a long way to better understand the limitations of clinical research, and this is, largely, thanks to statistical methods. We will give some examples.

- - The medical literature is snowed under with mortality trials.
- - Invariably a 10-30 % relative rise in survival is being observed after treatment.
- - Mortality may be important endpoint given the above.
- - However, a relative rise in survival of 30 % = an absolute risk reduction of only 1 %. The problem is. If you go from 3 to 2 % risk of death, then the absolute difference is 1 %, while the relative difference is 33 %.
- - Besides, mortality is a pretty insensitive variable for a study begun at middle-age like most studies. At that age comorbidity is huge and the risk of death due to comorbidity is correspondingly huge. This reduces the sensitivity of such trials.
- - A more sensitive endpoint in studies started at middle-age subjects would be morbidity. However, the pharmaceutical industry and the drug administration services require mortality studies.

Regarding the important issue in clinical research of mortality trials, we should add, that most patients would prefer a better quality of life, rather than a tiny bit of increased survival time, e.g., from 99 to 100 years of age. Also, as illustrated above, relative risks are overemphasized compared to absolute risks in the medical literature. It is time, that we reviewed some limitations of statistical methodologies. First, we will address the so highly esteemed p-value. It is based, on the null hypothesis, which indicates no effect in your data, or no difference from a zero effect, or your new treatment doesn't work, or your control group is not different from your intervention group. As an example, McCarthy in the article Evil p-values, occamstype- writer.org compared 50 brain surgeons with 50 rocket scientists, and found out, that their intelligence quotient was not significantly different with means of 112 versus 114. When repeating the comparison with 1000 versus 1000, a very significant difference was observed with means of 110 versus 111. The conclusion here was obvious. The null hypothesis had to be rejected. The 1000 neurosurgeons were, obviously, significantly brighter than the 1000 rocket scientists, as selected. Does that mean, that, worldwide, all neurosurgeons are brighter, than all rocket scientists. Probably not, this is, simply, a case of selection bias. Usually, selection bias can be overcome by increasing your samples sizes. But is that always true. Statisticians involved in big multidimensional data analysis, know all about it, because they, routinely, observe unexpected p-values, and they hate it, because unexpected significances indicate, that groups are no longer comparable, as it comes to main endpoint comparisons. Regarding the neurosurgeons and the rocket scientists, you probably would need an international sample of 10,000 or more to make the samples representative, and make the significance of difference disappear. So, our first limitation of statistics is, what we call the evil p-values. A list of additional limitations of statistics is given underneath.

1. Evil p-values.

In some fields, e.g., the field of psychology, banning the p-values is suggested and confidence intervals is given as a somewhat better alternative (Trafimow and Marks: Banning the p-values, Bas App Psychol 2015; 37: 1-2).

2. Type I/II errors.

- 3. Little clinical relevance in spite of statistical significance, and relative risks irrelevant to patients.
- 4. Statistics gives no certainty, but only predicts chances on the understanding that
- - H0 is untrue,
- - H1 is true,
- - data follow normal distribution,
- - data are representative of your target population,
- - data follow the same normal distribution, as that of your data.
- (H0=null hypothesis or the chance of finding an effect where there is none. H1=the alternative hypothesis or the chance of finding no effect where there is one).
- 5. Statistics is not good at detecting “fudged” data.

The above type II error means, that your trial was underpowered, and the solution here is, simply, a larger trial. A large type I error means, that there is no differ- ence/no effect. Yet, a difference/effect is established. Now, what is the solution here. Large type I errors are observed with multiple testing/treatments (see Chap. 9 for additional details). How come? If you test 2 x, your chance of a false positive result will be not 5 %, but rather 10 %!! As an example of multiple treatments consider a analysis of variance of a study assessing three groups of patients treated for anemia. The analysis of the data is given underneath.

ANOVA (analysis of variance)

n mean SD

Group 1 16 8.725 0.8445

- 2 16 10.6300 1.2841
- 3 16 12.3000 0.9419

grand mean 10.4926

SS between groups = 16 (8.7125 - 10.4926)2+16 (10.6300-.....

SS within groups = 15 x 0.84452 + ..

F=SS between/dfs/SS within/dfs=49.9 = > p < 0.01.

(SS = sums of squares, ANOVA=analysis of variance, n=sample size, SD = standard deviation, dfs = degrees of freedom)

The conclusion of the above ANOVA is: a significant difference exists between the three treatments, but where is it?

between group 1 versus 2 ? = > t-test= > t=mean diff/SEM =1.9175/1.536 = > ns between group 2 versus 3 ? =1.6700/1.592 = > ns

between group 1 versus 3 ? =3.5875/1.265 < 0.01.

(Diff = diffference, SEM=standard error of the mean, ns=not significant)

And, thus, a p-value <0.01 is observed, which is highly significant, but unadjusted for multiple treatments. If an agreed chance of false positive in this study with

1 test =5 %,

then with 2 tests =10 %,

with 3 tests =15 %.

Bonferroni recommends: reject the null hypothesis H0 at a lower significance level according to the equation (k=number of tests)

rejection-p-value=0.05 x 2/k (k-1) with 3 tests tests rejection-p-value=0.05 x 2/3 (3-1) =0.0166.

We can, now, conclude, that the calculated smallest p-value of 0.01 is still smaller than a rejection - p-value of 0.0166. And so, the H0 can still be rejected, but the result is not highly significant anymore, but just borderline significant.

Alternative methods for analyzing multiple testing do exist: Student-Neuman- Keuls test, Tukey’s test (HSD, honestly significant difference), Dunnett test, Hochberg’s procedure, Hotelling T-square. More details are in the Chap. 9. Still another alternative is, to, informally, integrate data, look for trends without judging one low p-value among, otherwise, high p-values as proof. The problem, here, is, that investigators and physicians, generally, do neither want soft data, nor meaningless p-values. Increasingly popular in the medical literature is the composite endpoint methodology. In the analysis the composite is tested only, Generally, the p-value is lower, than it is with Bonferroni and LSD procedures, because of the generally positive correlation between repeated observations in one person. A few examples of composite endpoints are given:

- 1. With a lipid study a composite variable of various lipid variables could be (cholesterol+HDLcholesterol+LDLcholesterol+triglycerides).
- 2. With a reumatoid arthritis study, a composite variable could be the Disease Activity Score defined as (1) the composite of the joint pain score+(2) number joints swollen+(3) erythrocyte sedimentation rate. Please note that, if scales are different, the separate variables must be standardized. This composite variable was used by Vitali et al. in the underneath study.

*Vitali C, Bencivelli W, Isenberg DA, Smolen JS, Snaith ML, Sciuto M, Neri R, Bombardieri S Rheumatology Unit, University of Pisa, Italy.*

*Clinical and Experimental Rheumatology [1992, 10(5):541-547] Type: Consensus Development Conference, Journal Article, Multicenter Study, Review A European Consensus Group study, involving*................

3. Another example is the composite endpoint of the underneath recent lipid study consistent of all-cause mortality, recurrent stroke, and occurrence of ischemic heart disease.

Atherosclerosis. 2013; 228: 472-7

Low levels of high-density lipoprotein cholesterol in patients with atherosclerotic stroke: a prospective cohort study.

Yeh PS^{1}, Yang CM, Lin SH, Wang WM, Chen PS, Chao TH, Lin HJ, Lin KC, Chang CY, Cheng TJ, Li YH.

From August 2006 through December 2011, patients with acute atherosclerotic ischemic stroke were included. Total cholesterol, triglycerides, low-density lipoprotein cholesterol (LDL-C) and HDL-C were checked and National Institutes of Health Stroke Scale (NIHSS) scores were obtained at admission. The primary outcomes were a composite end point of all-cause mortality, recurrent stroke, or occurrence of ischemic heart disease during follow-up