# Empirical Levels and Powers of Two-Sample Tests

- Adaptation to the Presence of Tied Observations
- Mann-Whitney-Wilcoxon Null Hypotheses
- E ciency and Power of Two-Sample Tests
- Efficacy of the Gaussian-Theory Test
- Efficacy of the Mann-Whitney-Wilcoxon Test
- Summarizing Asymptotic Relative Efficiency
- Power for Mann-Whitney-Wilcoxon Testing
- Testing Equality of Dispersion

As in Table 2.1, one might simulate data from a variety of distributions, and compare levels of the various two-sample tests. Results are in Table 3.4.

Table 3.4 shows that the extreme conservativeness of Mood’s test justifies its exclusion from practical consideration. We see that the Wilcoxon test, calibrated exactly using its exact null distribution, falls short of the desired level; a less-conservative equal-tailed test would have a level exceeding the nominal target of 0.05. The conservativeness of the Savage Score test is somewhat surprising. The close agreement between the level of the t-test and the nominal level with Gaussian data is as expected, as is the poor agreement between the level of the t-test and the nominal level with Cauchy data.

As was done in Table 2.3, one might perform a similar simulation under the alternative hypotheses to calculate power. In this case, alternative hypotheses were generated by offsetting one group by one unit. Results are in Table 3.5.

Table 3.5 excludes the exact version of the Wilcoxon test and Mood’s test, since for these sample sizes (*Mj* = 10 for *j* = 1,2), they fail to achieve the desired level for any data distribution. The approximate Wilcoxon test has comparable power to that of the f-test under the conditions optimal for the f-test, and also maintains high power throughout.

# Adaptation to the Presence of Tied Observations

The Mann-Whitney-Wilcoxon statistic is designed to be used for variables arising from a continuous distribution. Processes expected to produce data with distinct values, however, sometimes produced tied values, frequently because of limits on measurement precision. Sometimes an observation from the first group is tied with one from the second group. Then the scheme for assigning scores must be modified. Tied observations are frequently assigned scores averaged over the scores that would have been assigned if the data had been distinct; for example, if 2p),.... *Z/щ* are the ordered values from the combination of the two samples, and if Z(_{J+1}) = Zthen both observation *j* and observation *j +* 1 are assigned score *(aj + aj _{+}*1)/2. The variance of the test statistic must be adjusted for this change in scores.

When both tied observations come from the first group, or both from the second group, then one might assume that the tie arises because of imprecise measurement of a process that, measured more precisely, would have produced untied individuals. The test statistic is unaffected by assignment of scores to observations according to either of the potential orderings. However, the permutation distribution is affected, because many of the permutations considered will split the tied observations into different groups. Return to variance formula (3.11). The average rank *d* is unchanged by modification of ranks, but the average squared rank *a* changes by a^{2} + * ^{a}j*+1

^{—}

*(*i )

^{a}j+^{a}j+^{2}/2 =

*(aj*— a

_{J+}i)

^{2}/2. Then, for each pair of ties in the data, the variance (3.11) is reduced by Mi М2 (a j —

*aj*— 1). This process could be continued for triplets, etc., with more complicated expressions for the correction. Lehmann (2006) derives these corrections for generic numbers of replicated values, in the simpler case in which

_{+})^{2}/(N*aj = j:*in this case, the correction is applied to the simpler variance expression (3.22).

It is simpler, however, to bypass (3.22), and, instead of correcting (3.11), recalculating (3.11) using the new scores.

When the assumption of continuity of the distributions of underlying measurements does not hold, the distribution of rank statistics is no longer independent of the underlying data distribution, since the rank statistic distribution will then depend on the probability of ties. Hence no exact calculation of the form in §3.4.1 is possible.

It was noted at the end of §3.4.1.1 that continuity correction is of little importance in case of rank tests. When average ranks replace the original ranks, the continuity correction argument using Figure 2.2 no longer holds. Potential values of the test statistic are in some cases closer together than 1 unit apart, and, in such cases, the continuity correction might he abandoned.

# Mann-Whitney-Wilcoxon Null Hypotheses

The Mann-Whitney-Wilcoxon test was constructed to test whether the distribution *F* of the *X* variables is the same as the distribution *G* of the *Y *variables. This null hypothesis implies that P [X*. < *Yf =* 1/2. Unequal pairs *F* and *G* violate the null hypothesis of this test. However, certain distribution pairs violating the null hypothesis fall in the alternative hypothesis, but the Mann-Whitney-Wilcoxon test has no power to distinguish these. This is true if *F* and *G* are unequal but symmetric about the same point. In this case, the standard error of the Mann-Whitney-Wilcoxon test statistic (3.11) is no longer correct, and the expectation under this alternative is the same as it is under the null. The same phenomenon arises if *Jf? F(y)g(y) dy* = 1/2.

As an example, suppose that *Yj* ~ *в* for which the above alternative hypothesis has power no larger than the test size. Solve 1/2 = f_{0}°^{c}(l — exp(—*y))* exp*(—(y* — 0)^{2}/2)(27t)^{-1}/^{2} *dy* to obtain *в =* .876.

# E ciency and Power of Two-Sample Tests

In this section, consider models of the form (3.1), with the null hypothesis *0 = 0°.* Without loss of generality, one may take *в** ^{0}* = 0; otherwise, shift

*Yj*by

*6*

*°.*

Relative efficiency has already been defined for test statistics *T%* such that *(Ti — fii(e))/((**7**,(**0**)/fN)* ss

(0,1), for *N* the total sample size. Asymptotic relative efficiency calculations require specification of how M_{x} and *М**2* move together. Let M_{x} = *XN. М**2* = (1 — A*)N,* for A € (0,1).

## Efficacy of the Gaussian-Theory Test

As in the one-sample case, the large sample behavior of this test will be approximated by a version with known variance. Here *p{**6**)* = *в*, and Var [T] =

^{p2} *(ж* ^{+} *щ) ^{= 1,2} (mh)^{+}* ж)

^{: 1}”““

for *р ^{2}* the variance of each observation, and £ = /A(l — A).

For example, suppose that *Yj* ~ @(0,1), and *X,* ~ @(0, 1). In this case, the efficacy is e = £.

Alternatively, suppose that the observations are logistically distributed. Each observation has variance is тг^{2}/3, and the efficacy is e = Сч/З/тг = .551(5 The analysis of §2.4.1, for tests as in (2.15)and variance scaled as in (2.19), allows for calculation of asymptotic relative efficiency, in terms of the separate efficacies, defined as the ratio of *p!{0)* to *а (в).*

## Efficacy of the Mann-Whitney-Wilcoxon Test

In order to apply the results for the asymptotic relative efficiency of §2.4, the test statistic must be scaled so that the asymptotic variance is approximately equal to a constant divided by the sample size, and must be such that the derivative of the mean function is available at zero. Using the Mann-Whitney formulation, and rescaling so that the *T* = X^f=i i < F})/(MiМ2), then

and so Also,

For example, suppose that *Yj* ~ @(0,1), and *X,* ~ @(0,1). The differences X,; — *Yj* ~ @(0,2), and so

Hence /г'(0) = 1/(2Д7г). Also, (3.24) still holds, and

Alternatively, suppose that these distributions have a logistic distribution. In this case,

and

Efficacies for more general rank statistics may be obtained using calculations involving expectations of derivatives of underlying densities, with respect

TABLE 3.6: Asymptotic relative efficiencies for Two Sample Tests

Test |
Pooled |
MWW |
ARE of MWW to |

General |
|||

Normal, unit variance |
|||

Logistic |

**C = (A(l - A))1/2, e = /VaF[XiJ.**

to the model parameter, evaluated at order statistics under the null hypothesis, without providing rank expectations away from the null (Dwass, 1956).

## Summarizing Asymptotic Relative Efficiency

Table 3.6 contains results of calculations for asymptotic relative efficiencies of the Mann-Whitney-Wilcoxon test to the Pooled f-test. For Gaussian variables, as expected, the Pooled f-test is more efficient, but only by 5%. For a distribution with moderate tails, the logistic, the Mann-Whitney-Wilcoxon test is 10% more efficient.

## Power for Mann-Whitney-Wilcoxon Testing

Power may be calculated for Mann-Whitney-Wilcoxon testing, using (2.22) in conjunction with (3.24) for the null variance of the rescaled test, and (3.25), adapted to the particular distribution of interest. Application to Gaussian and Laplace observations are given by (3.26) and (3.27) respectively. Zhong and Kolassa (2017) give second moments for this statistic under the alternative hypothesis, and allow for calculation of *сг(в)* for non-null *9.* The second moment depends not only on the probability (3.25), but also on probabilities involving two independent copies of *X* and one copy of *Y.* and of two independent copies of *Y* and one copy of *X.* This additional calculation allows the use of (2.17); calculations below involve the simpler formula.

Example 3.8.1 *Consider using two independent sets of observations each to test the null hypothesis of equal distributions vs. the alternative that (3.1) holds, with 0 = 1, and with observations having a Laplace distribution. Then, using (3.27),* //(1) = e(e — 2)(e — l)^{-2} = 0.661. *The function fi(0) has a removable singularity at zero; fortunately the null probability is easily seen to be* 1/2. *Then* A *= 1/2, N* = 80, *p(0) = 1/2, */i(l) = 0.661, <т(0) = 1 /х/Т^7^1/2У1Г(Т72) = 1Д/3. *The power for the one-sided level* 0.025 *test, from (2.22), is* Ф(/М(0.661 — 0.5)/(l//3) — 1.96) = Ф(0.534) = .707.

*One could also determine the total sample size needed to obtain 80% power. Using (2.24), ^{one} needs* (1//3)

^{2}(.го.о25 + ■$o.2)

^{2}/(0.661 — 0.5)

^{2}= 151.4;

*choose 76 per group.*

In contrast with the shift alternative (3.1), one might consider the Lehmann alternative

for some *к ф* 1. Power calculations for Mann-Whitney-Wilcoxon tests for this alternative have the advantage that power does not depend on the underlying *G* (Lehmann, 1953).

As noted above, while efficacy calculations are available for more general rank statistics, the non-asymptotic expectation of the test statistic under the alternative is difficult enough that it is omitted here.

# Testing Equality of Dispersion

One can adapt the above rank tests to test whether two populations have equal dispersion, assuming a common center. If one population is more spread out than another, then the members of one sample would tend to lie outside the points from the other sample. This motivates the Siegel-Tukey test. Rank the points, with the minimum getting rank 1, the maximum getting rank 2, then the second to the maximum getting rank 3, the second to the minimum getting rank 4, the third from the minimum getting rank 5 and continuing to alternate. Then sum the ranks associated with one of the samples. Under the null hypothesis, this statistic has the same distribution as the Wilcoxon rank- sum test. Alternately, one might perform the Ansari-Bradley test, by ranking from the outside in, with extremes getting equal rank, and again summing the ranks from one sample.

The Ansari-Bradley test has a disadvantage with respect to the Siegel- Tukey test, in that one can’t use off-the-shelf Wilcoxon tail calculations. On the other hand, the Ansari-Bradley test is exactly invariant to reflection.

TABLE 3.7: Yarn data with rankings for testing dispersion

strength |
type |
ab |
st |
strength |
type |
ab |
st |

12.8 |
A |
1.0 |
1.00 |
15.8 |
A |
24.0 |
47.00 |

13.0 |
A |
2.0 |
4.00 |
15.9 |
A |
22.0 |
43.67 |

13.8 |
A |
3.0 |
5.00 |
15.9 |
В |
22.0 |
43.67 |

14.2 |
A |
4.5 |
8.50 |
15.9 |
В |
22.0 |
43.67 |

14.2 |
В |
4.5 |
8.50 |
16.0 |
A |
19.5 |
38.50 |

14.5 |
В |
6.0 |
12.00 |
16.0 |
В |
19.5 |
38.50 |

14.8 |
A |
7.5 |
14.50 |
16.2 |
A |
17.0 |
33.33 |

14.8 |
A |
7.5 |
14.50 |
16.2 |
В |
17.0 |
33.33 |

14.9 |
A |
10.0 |
19.33 |
16.2 |
В |
17.0 |
33.33 |

14.9 |
В |
10.0 |
19.33 |
16.4 |
A |
15.0 |
30.00 |

14.9 |
В |
10.0 |
19.33 |
16.8 |
В |
14.0 |
27.00 |

15.0 |
A |
13.5 |
26.50 |
16.9 |
В |
13.0 |
26.00 |

15.0 |
A |
13.5 |
26.50 |
17.0 |
А |
11.0 |
21.33 |

15.0 |
A |
13.5 |
26.50 |
17.0 |
В |
11.0 |
21.33 |

15.0 |
В |
13.5 |
26.50 |
17.0 |
В |
11.0 |
21.33 |

15.2 |
В |
16.5 |
32.50 |
17.1 |
А |
9.0 |
18.00 |

15.2 |
В |
16.5 |
32.50 |
17.2 |
В |
8.0 |
15.00 |

15.5 |
A |
19.0 |
37.67 |
17.6 |
А |
7.0 |
14.00 |

15.5 |
A |
19.0 |
37.67 |
18.0 |
В |
6.0 |
11.00 |

15.5 |
В |
19.0 |
37.67 |
18.1 |
В |
5.0 |
10.00 |

15.6 |
A |
22.0 |
43.33 |
18.2 |
А |
3.5 |
6.50 |

15.6 |
A |
22.0 |
43.33 |
18.2 |
В |
3.5 |
6.50 |

15.6 |
В |
22.0 |
43.33 |
18.5 |
В |
2.0 |
3.00 |

15.7 |
A |
24.0 |
48.00 |
19.2 |
В |
1.0 |
2.00 |

Example 3.9.1 *Consider again the yam data of Example 3.3.1. Test equality of dispersion between the two types of yam. Ranks are given in Table 3.7, and are calculated in package* NonparametricHeuristic *as*

yarn$ab<-pmin(rank(yarn$strength),rank(-yarn$strength)) yarn$st<-round(siegel.tukey.ranks(yarnSstrength),2) yarnranks<-yarn[order(yarnSstrength), c("strength","type","ab","st")]

*R functions may be used to perform the test.*

library(DescTools)#For SiegelTukeyTest SiegelTukeyTest(strength~type,data=yarn) yarnsplitc-split(yarnSstrength,yarnStype) ansari.test(yarnsplit[[1]],yarnsplit[[2] ])

*to find the Siegel-Tukey p-value as 0.7179, and the Ansari-Bradley p-value as 0.6786. There is no evidence of inequality of dispersion.*