The MannWhitneyWilcoxon Test
The Wilcoxon ranksum test is defined as Т^^{]} of (3.7), with aj = j (Wilcoxon, 1945). That is, the statistic is the sum of ranks of observations
TABLE 3.3: Mood’s test for the yarn example
Type A 
Type В 
Total 

Greater than Median 
9 
15 
24 
Less than Median 
15 
9 
24 
Total 
24 
24 
48 
FIGURE 3.2: Boxplots of Yarn Strength (oz) by Type
coming from the second group. Denote this specific example of the general rank statistic as Tw. The term “ranksum” is used to differentiate this test from another test proposed by the same author, to be discussed in a later chapter.
An alternative test statistic for detecting group differences is
This new statistic Тц and the previous statistic Tw can be shown to be identical. For j € {1,..., М2} indexing a member of the second sample, define Rj as the rank of observation j among all of the observations from the combined samples. Note that
The first sum in (3.15) is Тц, and the second is М2 (М2 + l)/2, and so
The test based on Ту is called the MannWhitney test (Mann and Whitney, 1947). This statistic is a U statistic; that is, a statistic formed by summing over pairs of observations in a data set.
Exact and Approximate MannWhitney Probabilities
The distribution of test statistic (3.14) can be calculated exactly, via recursion (Festinger, 1946). Let cy(t, Mi, М2) the number of ways that Mi symbols X and М2 symbols Y can be written in a vector to give Ту = t. as in §3.3. Then
The collection of vectors giving statistic value t can be divided according to whether the last symbol is X or Y. If the last symbol was A, then ignoring this final value, the vector still gives the same statistic value t, and there are c/(t,M 1 — 1, М2) such vectors. If the last symbol was Y. then ignoring this final value, the factor gives the statistic value t — Mi, and there are cv{t — Mi, Mi, М2 — 1) such vectors. So
The recursion stops once either sample size hits zero:
The maximal value for t is
hence the recursion can be stopped early by noting that
A natural way to perform these calculations is with recursive calls to a computer routine to calculate lowerorder probabilities, although the algorithm can be implemented without such explicit recursion (Dinneen and Blakesley, 1973).
Moments and Approximate Normality
Using this recursion can be slow, and the argument at the end of §3.2.1 can be used to show that the distribution of the test statistic is approximately Gaussian. Fortunately, a central limit theorem applies to this statistic (Erdos and Reyni, 1959).
The Wilcoxon version of the MannWhitneyWilcoxon statistic is given by (3.7) with aj = j. Then ^{a}i = Ylj.Lj = N(N + l)/2, and a = (N +1)/2. Hence Eo Tv] is М2 (A + l)/2, using (3.10).
In order to calculate Varo [Tjv] using (3.11), one needs g(iv) = J2j=iJ^{2} One might guess it must be cubic in N. Examine functions g(w) = aw^{3} + bw^{2} + cw + d so that g(w) — g(w — 1) = w^{2}. Then d = 0, and
Equating quadratic terms above gives a = 1/3. Setting the linear term to zero gives b = 1/2, and setting the constant term to zero gives c = 1/6. Then
and from (3.10), (3.11) and (3.21),
In conjunction with the central limit theorem argument described above, one can test for equality of distributions, with critical values and pvalues given by (3.8) and (3.9) respectively.
Example 3.4.1 Refer again to the yarn data of Example 3.3.1. Consider yard strengths for bobbin 3.
Sum the ranks associated with. Type B, to get Tv = l+2+4+6= 13. Here Mi = М2 = 4, and N + 1 = M + М2 + 1=9. From (3.22), under the null hypothesis of equality of distributions, the expected value of the rank sum is 4 x 9/2 = 18, and the variance is 4 x 4 x 9/12 = 12. Hence the statistic, after standardizing to zero mean and unit variance, is (13 — 18)//l2 = —1.44. The p — value is 0.149. This may be done using R by
wilcox.test(strength'type,data=yarn[yarn$bobbin==3,], exact=FALSE, correct=FALSE)
The continuitycorrected pvalue uses statistic (13 + 0.5 — 18)//l2, and is 0.194, and might be done by
wilcox.test(strength'type,data=yarn[yarn$bobbin==3,], exact=FALSE)
Finally, pvalues might be calculated exactly using (3.17), (3.18), and (3.19), and in R by
wilcox.test(strength'type,data=yarn[yarn$bobbin==3,], exact=TRUE)
Moments (3.22) apply to the statistic given by scores aj = j. By contrast, the MannWhitney statistic Тц is constructed using (3.7) from scores aj = j — (N +1)/2. The variance of this statistic is still given by (3.22); the expectation is E [T_{v}] = M_{2}{N + l)/2  M_{2}(M_{2} + l)/2 = M_{2}M_{x}f2.
The Wilcoxon variance in (3.22) increases far more quickly than that of Mood’s test as the sample size N increases; relative to this variance, the continuity correction is quite small, and is of little importance.
Other Scoring Schemes
One might construct tests using other scores aj. A variety of techniques are available for use. One could use scores equal to expected value of order statistics from Gaussian distribution; these are called normal scores. Alternatively, one could use scores calculated from the Gaussian quantile function aj = ФN + 1)) (Waerden, 1952), called van der Waerden scores, or scores of form aj = i~ ^{1} (Savage, 1956), called Savage scores, or scores equal to expected value of order st atistics from exponent ial distribution, called exponential scores. Van der Waerden scores are an approximation to normal scores. Calculating exact probabilities for general score tests, and the difficulties that this entails, was discussed at the end of §3.2.1.
Scores may be chosen to be optimal for certain distributions. Normal scores are optimal for Gaussian observations. Exponential scores are optimal for exponential observations. Original ranks are optimal for logistic observations. Savage scores are optimal for Lehmann alternatives, discussed below at (3.28).
Example 3.4.2 Consider the nail arsenic data of Example 2.3.2. One might perform an analysis using these scoring methods.
library(exactRankTests)#Gives savage and vw scores arsenic$savagenails<as.numeric(cscores( arsenic$nails,type="Savage")) arsenic$vwnails<as.numeric(cscores(arsenicSnails, type="Normal"))
The Savage scores are
 0.603,0.669,0.850*, 0.669, 0.056*, 0.366,0.902*,
 0.371, 0.199,0.794,0.952, 1.149, 0.816*, 2.649,
 1.649,0.180, 0.566*,0.454,0.069*,0.531*,0.280*.
Asterisks denote men. The mean of these scores is a = —0.005. and the mean of the squares is a = 0.823. Test for equality of arsenic in nails between sexes. Here Mj = 8 and М2 = 13. The expectation and variance of the test statistic are given by (3.10), as 13a = —0.065, and 13 x 8 x (a — a^{2})/20 = 4.28. Sum scores for women, the second gender group; here к = 2, and T^ = —1.32, the sum of scores above without the asterisk. The zstatistic is (—1.32 — (—0.065))//4.28 = —0.60. The pvalue is 0.548. Do not reject the null hypothesis. These calculations may be done using
library(MultNonParam)#Contains genscorestat genscorestat(arsenic$vwnails,arsenicSsex) genscorestat(arsenicSsavagenails,arsenicSsex)
giving the same results for Savage scores, and the pvalue 0.7834 for van der Waerden scores.
Using Data as Scores: the Permutation Test
One might instead use the original data as scores. That is, sort the combined data set (X_{b} ..., Xm_{x} , Pi, • ■ •, Ym_{2}) to obtain (Z_{(1)},..., Z(_{N)}), with Z_{{i)} < Z(i+i) for all i; still assuming continuity, each inequality is strict. Then use a,j = Z_{(j)}. Hence the test statistic is
The analysis is performed conditionally on (Z^,..., дг)); note that both
the statistic, and its reference distribution, depend on these order statistics. Compare Tp with the numerator of the twosample pooled ttest (3.2):
where Z = ^i = 1 Z(i)/N. The pooled variance estimate for the twosample t statistic is
Some algebra shows this to be
Hence, conditional on (Z(j),..., Z^^), the twosample pooled t statistic is
for sz the sample standard deviation of (Zm,.... Z(N))
Hence the pooled twosample t statistic is a st rictly increasing function of the score statistic Tp with ordered data used as scores. However, while the pooled t statistic is typically compared to a C distribution, the rank statistic is compared to the distribution of values arising from random permutations of the group labels; this is the same mechanism that generates the distribution for the rank statistics with scores determined in advance. In the twosample case, there are (jfj ways to assign M labels 1, and М2 labels 2, to the order statistics (Z( i),..., Z(,v)). A lessefficient way to think of this process is to specify N labels, the first Mj of them 1 and the remaining М2 of them 2, and randomly assign, or randomly permute, (Zm,..., Z(N)) without replacement; there are N such assignments, leading to at most (^f) distinct values. The observed value of Tp is then compared with the sampling distribution arising from this random permutation of values; such a test is called a permutation test. The same permutation concept coincides with the desired reference distribution for all of the rank statistics in this chapter.
Example 3.4.3 Again consider the nail arsenic data of Example 2.3.2. Recall that there are 21 subjects in this data set, of whom 8 are male. The permutation test testing the null hypothesis of equality of distribution across gender may be performed in R using
library(MultNonParam)
aov.P(dattab=arsenic$nails,treatment=arsenic$sex)
to give a twosided pvalue of 0.482. In this case, all (^{2}g) = 203490 ways to reassign arsenic nail levels to the various groups were considered. The
TABLE 3.4: Levels for various TwoSample TwoSided Tests, Nominal level 0.05, from 100,000 random data sets each, sample size 10 each
Test 
Gaussian 
Laplace 
Cauchy 
Ttest 
0.04815 
0.04414 
0.01770 
Exact Wilcoxon 
0.04231 
0.04413 
0.04424 
Approximate Wilcoxon 
0.05134 
0.05317 
0.05318 
Normal Scores 
0.04693 
0.04744 
0.04871 
Savage Scores 
0.04191 
0.04340 
0.04319 
Mood 
0.02198 
0.02314 
0.02314 
TABLE 3.5: Powers for various TwoSample TwoSided Tests, Nominal level 0.05, from 100,000 random data sets each, sample size 10 each, samples offset by one unit
Test 
Gaussian 
Laplace 
Cauchy 
Ttest 
0.55445 
0.35116 
0.06368 
Approximate Wilcoxon 
0.54661 
0.41968 
0.20765 
Normal Scores 
0.53442 
0.37277 
0.16978 
Savage Scores 
0.47270 
0.33016 
0.15225 
statistic Tp of (3.23) was calculated for each assignment, this value was subtracted from the null expectation Z, and the difference was squared to provide a twosided statistic. The pvalue reported is the proportion of these for which the squared differences among the reassignments meets or exceeds that seen in the original data.