# Methods for Lowering the Type I Error (a), the Group Sequential Method with a-Spending Function Approach

The alpha spending function approach does not, necessarily, use the same significance level at each analysis. As an example, O’Brien & Fleming in the GroupSeq program of the R statistical software package can be used here. It calculates maximal sample sizes for various settings, such that the overall significance level would be close to 0.05:

interim analysis 1: a* = 0.000207

interim analysis 2: a* = 0.012025

final analysis: a* = 0.046261

with an expected number of patients of ~ 91.

In the above setting it would be recommended to include 101 patients, in order to obtain the above alphas.

A very simple alpha-spending function is Peto’s method. It uses a* = 0.001 at all interims, and a=0.05 at the final analysis. In the above graph, in addition to the Peto methodX, the Pocock non alphaspending, and the O’Brien & Fleming alpha-spending functions are given. The equations for three more alpha-spending functions and the corresponding graphs are given below, but many more exist:

A non-decreasing function f (t;a) (0 < a < 1)

• • f(t; a) = a if t > 1
• • for i = 1,2,.. ,,K, and ti=Ii/Ik define f (ti;a) = ...
• (1) Lan & de Mets: f(t;a)=a log(1 + (e-1)*t)
• (2) f(t;a) = 2(1^№-1(a/2)M])
• (3) Hwang, Shih,DeCani: f(t;a) = a(1-exp (-yt))/(1-exp(-y)) = at, if у=0

t=number of interim analyses, a=type I error, Ф = standardized normal cumulative distribution function, у = gamma parameter for creating stopping boundaries, exp (..) = e( J. The above graph shows the relationship between magnitude of alphas and those of the test statistics at the first interim analysis of the above three additional alpha spending functions. When alpha-spending functions have been calculated, also the method for calculating the confidence interval of the estimate of the outcome parameter needs to be adapted! Regarding stopping rules, previously the rule at the interim analysis would be to stop, if the observed difference was statistically “significant”.

This means, stop, if the p-value is adequately low. With alpha-spending function technologies, an additional stopping rule to be included will be the presence of futility meaning:

• - stop, if the p-value is too large
• - stop, if the observed difference is too small, so that the power to reach a significant result at the final analysis is too low.

The Snapinn’s procedure, after Steve Snapinn, the Amgen epidemiologist, is an example of this futility assessment. In addition to a single hypothesis test (reject H0, if the calculated alpha (type I error) appropriate level. We will use R package Gs Design for example. By default the underneath settings will be used:

• - design for comparing new versus old treatment
• - two interim analyses+one final analysis: k=3
• - two-sided group-sequential trial
• • 2.5 % type-I error
• • 80 % power
• • stop for superiority or for futility
• - gsDesign(k=3,alpha=0.025,beta=0.2,test.type=4).

The underneath tables give appropriate rejection type II and type I errors as well as required samples sizes for the above settings. The boundary crossing probabilities are also given.  Example 1

For the purpose of a parallel group study of patients with exacerbations of COPD (chronic obstructive pulmonary disease) a required sample size calculation was performed. There were, thus, two groups, untreated controls (usual care=UC), and UC + antibiotic patients.

Main outcome was assessed according to:

• • randomized to either UC or UC + antibiotics
• • “is UC + antibiotics superior to UC?”
• • outcome: duration of the exacerbation
• • UC: on average 4.5 days (sd 2.0)
• • clinically relevant reduction 0.5 day exacerbation time
• • test-statistic: unpaired Student’s t-test.

Sample size was assessed: how many patients must be randomized to have 90 % power if the reduction is 0.5 day:

The next question was, how many patients must be randomized to have 90 % power if the reduction is 0.5 day. For answering the following settings were applied:

• • two-sided significance level a=0.05
• • and three interim analyses after 25 %, 50 % and 75 % of the patients have been included?
• • alpha-spending with the O’Brien-Fleming function.

In R statistics the following commands had to be given:

library(gsDesign)xgsDesign(k = 4,test.type = 2,alpha = 0.025,beta=0.1,n.fix = 672, sfu = “F”)

Underneath are the required Ns = sample sizes per interim analysis, and the expected p-values=alphas at the interims, as calculated by the software program.

Analysis N Z Nominal p Spending alpha

• 1 172 4.05 0.0000 0.0000
• 2 344 2.86 0.0021 0.0021
• 3 516 2.34 0.0097 0.0083
• 4 687 2.02 0.0215 0.0145

Total p (1 sided) 0.0250

Obviously, the total number of patients to be included had to be slightly larger than that of the fully standard analysis, 687 versus 672. But, then, the reward was: two interim analyses, that could check, whether results at the interim would justify early termination of the study or not.

Example 2

The ongoing ICD (implantable cardioverter defibrillator) - 2 Study (a prospective randomized controlled trial to evaluate prevention of sudden cardiac death using implantable cardioverter defibrillators in dialysis patients at LUMC Leyden University Medical Center) is used as an example. The study is in the ISRCTN (international study registration clinical trial number) joint registration of the World Health Organization, and the International Committee of Medical Journal Editors since 2007. The latest interim analysis gave the underneath results. Of the main endpoint sudden death the statistics will be given:

• • control group: 3 in 92.000 pys (patient years = 75 patients)
• • icd group: 1 in 97.000 pys (86 patients)
• • hazard ratio (HR) = 0.3 (95 % confidence interval: 0.03-3.1), p=0.30

Was this difference futile? What might happen, when the remaining 40 patients are included with an expected accrual time of 2 years. An additional follow up after final inclusion of 1 year was agreed. Regarding expectations, the investigators assumed the same risks as observed so far: 3 vs 1 in 100.000 pys. An additional follow up of 55.000 versus 67,000 pys was expected, and expected outcomes were

4.77 events in 147.000 pys versus 1.70 in 164.000 pys with a HR=0.3 (95 % confidence interval: 0.06-1.8), p=0.19. The conditional power for a significant result at the completion of the study was taken to be 25 %.

Example 3

For the purpose of a parallel groups study of patients with ACS (acute coronary syndrome) a sample size requirement calculation was performed:

• - patients would be randomized to either immediate catheterization (and/or interventions) or to a wait-and-see strategy (and stabilization with medicines)
• • suppose: wait-and-see strategy is the standard strategy
• • the scientific question “is the immediate strategy superior?”
• - expected outcome was mortality
• • the wait-and-see mortality proportion was pj = 0.15
• • clinically relevant reduction in mortality was 5 %, thus p2 = 0.10
• • the test statistic used was chi-square statistic or z-statistic for comparing two proportions.

With a fully standard analysis, a 90 % power should be obtained, if the reduction in probability in mortality was 0.05, with a two-sided significance level of a = 0.05, and no interim analyses. And, so, with a fully standard analysis:

X2 = z2 = (za + zp)2

Ж^ЖЫ^) + p2(1-p2)]/n) = (z0.05 + z0.1)2

(0.15-0.10)2/([0.15*(1-0.15) + 0.10*(1-0.10)]/n) = (1.96+1.28)2

nper group=(1.96 + 1.28)2 [0.15*0.85 + 0.10*0.90]/(0.15-0.10)2 =918

and, thus, 1836 patients in total was the required sample size.

With 2 interim analysis and an alpha spending function analysis, how many patients must be randomized, in order to have 90 % power, if the reduction in mortality is again 0.15 - 0.10=0.05, and if we have:

• - 2 interim analyses
• - two-sided significance level with type I error a=0.050
• - an O’Brien-Fleming alpha-spending function applied
• - use of the program, gsDesign (k=3, test.type=2,sfu=“OF”, n.fix = 1836).

The software program provided the underneath required sample size and p-values.

Analysis N Z Nominal p Spending alpha

• 1st interim 622 3.47 0.0003 0.0003
• 2nd interim 1243 2.45 0.0071 0.0069

final analysis 1865 2.00 0.0225 0.0178

Total p-value 0.0250.

Obviously, the total number of patients to be included, had to be slightly larger than that of the fully standard analysis, 1865 versus 1836. But, then, the reward was again two interim analyses, that could check, whether results at the interim would justify early termination of the study or not. An interesting question would be, how does the standard analysis version and interim analysis version perform with equivalence testing. We will set the boundary of equivalence at lpi-p2l < 0.02. Using the above software program the equivalence testing of the standard analysis’ required sample size is:

• - n.fix = nBinomial(pi = 0.10, p2 = 0.10, deltao = 0.02)
• - 9496 patients in total (4748/group)
• - the same result will be obtained, when using the underneath pocket calculator method. Using the above software program (commands: gsDesign(k=3), test.type=2,n. fix = 9496, sfu = ‘ ‘OF”), the equivalence testing of the 2 interim analysis required sample size is:

Analysis N Z Nominal p Spending alpha

interim 1 3216 3.47 0.0003 0.0003

interim 2 6432 2.45 0.0071 0.0069

final 3 9648 2.00 0.0225 0.0178

Total 0.0250

Obviously, the total number of patients to be included had again to be slightly larger than that of the fully standard analysis, 9648 versus 9496. But the reward was two interim analyses, that could check whether results at the interim would justify early termination of the study or not. Note, that very large sample sizes are required for equivalence testing.