# The Mann-Whitney-Wilcoxon Test

The Wilcoxon rank-sum test is defined as Т^] of (3.7), with aj = j (Wilcoxon, 1945). That is, the statistic is the sum of ranks of observations

TABLE 3.3: Mood’s test for the yarn example

 Type A Type В Total Greater than Median 9 15 24 Less than Median 15 9 24 Total 24 24 48

FIGURE 3.2: Boxplots of Yarn Strength (oz) by Type coming from the second group. Denote this specific example of the general rank statistic as Tw. The term “rank-sum” is used to differentiate this test from another test proposed by the same author, to be discussed in a later chapter.

An alternative test statistic for detecting group differences is This new statistic Тц and the previous statistic Tw can be shown to be identical. For j € {1,..., М2} indexing a member of the second sample, define Rj as the rank of observation j among all of the observations from the combined samples. Note that The first sum in (3.15) is Тц, and the second is М2 (М2 + l)/2, and so The test based on Ту is called the Mann-Whitney test (Mann and Whitney, 1947). This statistic is a U statistic; that is, a statistic formed by summing over pairs of observations in a data set.

## Exact and Approximate Mann-Whitney Probabilities

The distribution of test statistic (3.14) can be calculated exactly, via recursion (Festinger, 1946). Let cy(t, Mi, М2) the number of ways that Mi symbols X and М2 symbols Y can be written in a vector to give Ту = t. as in §3.3. Then The collection of vectors giving statistic value t can be divided according to whether the last symbol is X or Y. If the last symbol was A, then ignoring this final value, the vector still gives the same statistic value t, and there are c/(t,M 1 — 1, М2) such vectors. If the last symbol was Y. then ignoring this final value, the factor gives the statistic value tMi, and there are cv{t — Mi, Mi, М2 1) such vectors. So The recursion stops once either sample size hits zero: The maximal value for t is hence the recursion can be stopped early by noting that A natural way to perform these calculations is with recursive calls to a computer routine to calculate lower-order probabilities, although the algorithm can be implemented without such explicit recursion (Dinneen and Blakesley, 1973).

### Moments and Approximate Normality

Using this recursion can be slow, and the argument at the end of §3.2.1 can be used to show that the distribution of the test statistic is approximately Gaussian. Fortunately, a central limit theorem applies to this statistic (Erdos and Reyni, 1959).

The Wilcoxon version of the Mann-Whitney-Wilcoxon statistic is given by (3.7) with aj = j. Then ai = Ylj.Lj = N(N + l)/2, and a = (N +1)/2. Hence Eo Tv] is М2 (A + l)/2, using (3.10).

In order to calculate Varo [Tjv] using (3.11), one needs g(iv) = J2j=iJ2- One might guess it must be cubic in N. Examine functions g(w) = aw3 + bw2 + cw + d so that g(w) — g(w — 1) = w2. Then d = 0, and Equating quadratic terms above gives a = 1/3. Setting the linear term to zero gives b = 1/2, and setting the constant term to zero gives c = 1/6. Then and from (3.10), (3.11) and (3.21), In conjunction with the central limit theorem argument described above, one can test for equality of distributions, with critical values and p-values given by (3.8) and (3.9) respectively.

Example 3.4.1 Refer again to the yarn data of Example 3.3.1. Consider yard strengths for bobbin 3. Sum the ranks associated with. Type B, to get Tv = l+2+4+6= 13. Here Mi = М2 = 4, and N + 1 = M + М2 + 1=9. From (3.22), under the null hypothesis of equality of distributions, the expected value of the rank sum is 4 x 9/2 = 18, and the variance is 4 x 4 x 9/12 = 12. Hence the statistic, after standardizing to zero mean and unit variance, is (13 — 18)//l2 = —1.44. The p — value is 0.149. This may be done using R by

wilcox.test(strength'type,data=yarn[yarn\$bobbin==3,], exact=FALSE, correct=FALSE)

The continuity-corrected p-value uses statistic (13 + 0.5 — 18)//l2, and is 0.194, and might be done by

wilcox.test(strength'type,data=yarn[yarn\$bobbin==3,], exact=FALSE)

Finally, p-values might be calculated exactly using (3.17), (3.18), and (3.19), and in R by

wilcox.test(strength'type,data=yarn[yarn\$bobbin==3,], exact=TRUE)

Moments (3.22) apply to the statistic given by scores aj = j. By contrast, the Mann-Whitney statistic Тц is constructed using (3.7) from scores aj = j(N +1)/2. The variance of this statistic is still given by (3.22); the expectation is E [Tv] = M2{N + l)/2 - M2(M2 + l)/2 = M2Mxf2.

The Wilcoxon variance in (3.22) increases far more quickly than that of Mood’s test as the sample size N increases; relative to this variance, the continuity correction is quite small, and is of little importance.

## Other Scoring Schemes

One might construct tests using other scores aj. A variety of techniques are available for use. One could use scores equal to expected value of order statistics from Gaussian distribution; these are called normal scores. Alternatively, one could use scores calculated from the Gaussian quantile function aj = ФN + 1)) (Waerden, 1952), called van der Waerden scores, or scores of form aj = i~ 1 (Savage, 1956), called Savage scores, or scores equal to expected value of order st atistics from exponent ial distribution, called exponential scores. Van der Waerden scores are an approximation to normal scores. Calculating exact probabilities for general score tests, and the difficulties that this entails, was discussed at the end of §3.2.1.

Scores may be chosen to be optimal for certain distributions. Normal scores are optimal for Gaussian observations. Exponential scores are optimal for exponential observations. Original ranks are optimal for logistic observations. Savage scores are optimal for Lehmann alternatives, discussed below at (3.28).

Example 3.4.2 Consider the nail arsenic data of Example 2.3.2. One might perform an analysis using these scoring methods.

library(exactRankTests)#Gives savage and vw scores arsenic\$savagenails<-as.numeric(cscores( arsenic\$nails,type="Savage")) arsenic\$vwnails<-as.numeric(cscores(arsenicSnails, type="Normal"))

The Savage scores are

• 0.603,0.669,0.850*, 0.669, -0.056*, -0.366,0.902*,
• 0.371, -0.199,0.794,0.952, -1.149, -0.816*, -2.649,
• -1.649,0.180, -0.566*,0.454,0.069*,0.531*,0.280*.

Asterisks denote men. The mean of these scores is a = —0.005. and the mean of the squares is a = 0.823. Test for equality of arsenic in nails between sexes. Here Mj = 8 and М2 = 13. The expectation and variance of the test statistic are given by (3.10), as 13a = —0.065, and 13 x 8 x (a — a2)/20 = 4.28. Sum scores for women, the second gender group; here к = 2, and T^ = —1.32, the sum of scores above without the asterisk. The z-statistic is (—1.32 — (—0.065))//4.28 = —0.60. The p-value is 0.548. Do not reject the null hypothesis. These calculations may be done using

library(MultNonParam)#Contains genscorestat genscorestat(arsenic\$vwnails,arsenicSsex) genscorestat(arsenicSsavagenails,arsenicSsex)

giving the same results for Savage scores, and the p-value 0.7834 for van der Waerden scores.

## Using Data as Scores: the Permutation Test

One might instead use the original data as scores. That is, sort the combined data set (Xb ..., Xmx , Pi, • ■ •, Ym2) to obtain (Z(1),..., Z(N)), with Z{i) < Z(i+i) for all i; still assuming continuity, each inequality is strict. Then use a,j = Z(j). Hence the test statistic is The analysis is performed conditionally on (Z^,..., дг)); note that both

the statistic, and its reference distribution, depend on these order statistics. Compare Tp with the numerator of the two-sample pooled t-test (3.2): where Z = ^i = 1 Z(i)/N. The pooled variance estimate for the two-sample t statistic is Some algebra shows this to be Hence, conditional on (Z(j),..., Z^^), the two-sample pooled t statistic is for sz the sample standard deviation of (Zm,.... Z(N))-

Hence the pooled two-sample t statistic is a st rictly increasing function of the score statistic Tp with ordered data used as scores. However, while the pooled t statistic is typically compared to a C distribution, the rank statistic is compared to the distribution of values arising from random permutations of the group labels; this is the same mechanism that generates the distribution for the rank statistics with scores determined in advance. In the two-sample case, there are (jfj ways to assign M labels 1, and М2 labels 2, to the order statistics (Z( i),..., Z(,v)). A less-efficient way to think of this process is to specify N labels, the first Mj of them 1 and the remaining М2 of them 2, and randomly assign, or randomly permute, (Zm,..., Z(N)) without replacement; there are N such assignments, leading to at most (^f) distinct values. The observed value of Tp is then compared with the sampling distribution arising from this random permutation of values; such a test is called a permutation test. The same permutation concept coincides with the desired reference distribution for all of the rank statistics in this chapter.

Example 3.4.3 Again consider the nail arsenic data of Example 2.3.2. Recall that there are 21 subjects in this data set, of whom 8 are male. The permutation test testing the null hypothesis of equality of distribution across gender may be performed in R using

library(MultNonParam)

aov.P(dattab=arsenic\$nails,treatment=arsenic\$sex)

to give a two-sided p-value of 0.482. In this case, all (2g) = 203490 ways to reassign arsenic nail levels to the various groups were considered. The

TABLE 3.4: Levels for various Two-Sample Two-Sided Tests, Nominal level 0.05, from 100,000 random data sets each, sample size 10 each

 Test Gaussian Laplace Cauchy T-test 0.04815 0.04414 0.01770 Exact Wilcoxon 0.04231 0.04413 0.04424 Approximate Wilcoxon 0.05134 0.05317 0.05318 Normal Scores 0.04693 0.04744 0.04871 Savage Scores 0.04191 0.04340 0.04319 Mood 0.02198 0.02314 0.02314

TABLE 3.5: Powers for various Two-Sample Two-Sided Tests, Nominal level 0.05, from 100,000 random data sets each, sample size 10 each, samples offset by one unit

 Test Gaussian Laplace Cauchy T-test 0.55445 0.35116 0.06368 Approximate Wilcoxon 0.54661 0.41968 0.20765 Normal Scores 0.53442 0.37277 0.16978 Savage Scores 0.47270 0.33016 0.15225

statistic Tp of (3.23) was calculated for each assignment, this value was subtracted from the null expectation Z, and the difference was squared to provide a two-sided statistic. The p-value reported is the proportion of these for which the squared differences among the reassignments meets or exceeds that seen in the original data.