# Con dence Regions for a Vector Shift Parameter

Proceed analogously to the one-dimensional confidence interval construction as in §2.3.3. Introduce a shift parameter to move the data to an arbitrary point null hypothesis. In this one-sample case, one may apply the one sample test for the null hypothesis that marginal medians are 0 to the shifted data *X* — l_{n} *Ц.* where l_{n} is a vector of ones of length *n,* and ® is the outer product, ко 1 * _{a}* ® /i is the matrix with entry

*fi j*in column

*j*for all rows. That is, calculate

*T{X*— l

_{ra}®

*ц)*from the data set

*X*— l

_{n}® //. Calculate the variance matrix ((7.9) for the multivariate sign test) from the shifted data set. Calculate

*W(X*— l

_{n}®

*fi)*from (7.10). Then the random set

satisfies P [д € *T =* 1 — *a* using the test inversion argument of (1.16). This is the one-sample case of the region proposed by Kolassa and Seifu (2013).

Example 7.4.1 *Recall again the blood pressure data set of Example 6.4.2. Figure 7.2 is generated by*

library(MultNonParam); shifter(bp[,c("dpd","spd")])

*and exhibits the 0.05 contour of p-values for the multivariate test constructed from sign rank tests for each of systolic and diastolic blood pressure, and forms a 95% confidence region. Note the lack of convexity.*

# Two-Sample Methods

Two-sample methods are generally of more interest than the preceding one- sample methods. Consider a multivariate dataset *Xij* for г € {1,..., *M**1**+M**2**} *and *j* € {1,..., *J},* with data from the first sample occupying the first *M *rows of this matrix, and data from the second sample occupying the last М2 rows. Assume that the vectors (Afi,.... *Хц)* and *(Хщ*,..., *X^j),* for all *i, *are independent if *i Ф i'.* Assume further than the vectors *(Хц,*..., *X,j)* all

FIGURE 7.2: Median Blood Pressure Change Confidence Region

have the same distribution for *г < Mi,* and that the vectors *(Хц,..., Хц) *all have the same distribution for *г > M.* Let *g* be a vector indicating group membership; i < Mi, and *gi* = 2 if *i* > *Mi.* As in §7.1, consider testing and confidence interval questions.

## Hypothesis Testing

Combine the techniques for one-dimensional two-sample testing of Chapter 3 and for multi-dimensional one-sample testing of §7.1. Consider the null hypothesis *Ho* that the distribution of *(Хц,*.... *Xu)* is the same for all *i.*

### Permutation Testing

Under the null hypothesis of equality of distributions across the two groups, all assignments of the observed vectors among the two groups that keep the sizes of groups 1 and 2 at *M* and М2 respectively, are equally likely. Hence a permutation test may be constructed by evaluating the Hotelling statistic, or any other parametric statistic, at each of the such reassignments of the observations to the groups,

p-values are calculated by counting the number of such statistics with values equal to or exceeding the observed value, and dividing the count by (^д^^{2}) • Other marginal statistics may be combined; for example, one might use the Max /-statistic, defined by first calculating univariate /-statistics for each manifest variable, and reporting the maximum. This statistic is inherently one-sided, in that it presumes an alternative in which each marginal distribution for the second group is systematically larger than that of the first group. Alternatively, one might take the absolute value of the /-statistics before maximizing. One might do a parallel analysis with either the maximum of Wilcoxon rank-sum statistics or the maximum of their absolute values, after shifting to have a null expectation zero.

Finally, one might apply permutation testing to the statistic (7.3), calculated on ranks instead of data values, to make the statistic less sensitive to extreme values.

Example 7.5.1 *Consider the data on wheat yields, in metric tons per hectare (Cox and Snell, 1981, Set 5), reflecting yields of six varieties of wheat grown at ten different experimental stations, from*

*http://stat.rutgers.edu/home/kolassa/Data/set5.data .*

*Two of these varieties, Huntsman and Atou, are present at all ten stations, and so the analysis will include only these. Stations are included from three geographical regions of England; compare those in the north to those elsewhere. The standard Hotelling two-sample test may be performed in R using*

wheat<-as.data.frame(scan("set5.data",what=list(variety="", y0=0,у1=0,y2=0,y3=0,y4=0,у5=0,у6=0,у7=0,у8=0,у9=0), па.strings="-"))

# Observations are represented by columns rather than by

# rows. Swap this. New column names are in first column, dimnames(wheat)[[1]]<-wheat[,1]

wheat<-as.data.frame(t(wheat[,-1]))

dimnames(wheat)[[1]]<-c("El","E2","N3","N4","N5","N6","W7", "E8","E9","N10")

wheat$region<-factor(c("North","Other")[l+( substring(dimnames(wheat)[[1]],1,1)!="N")] , c("Other","North")) attach(wheat)

plot(Huntsman,Atou,pch=(region=="North")+1, main="Wheat Yields") legend(6,5,legend=c("Other","North"),pch=l:2)

*Data are plotted in Figure 7.3. The normal-theory p-value for testing equality of the bivariate yield distributions in the two regions is given by*

library(Hotelling)#for hotelling.test print(hotelling.test(Huntsman+Atou'region))

*The results of* hotelling.test *must be explicitly printed, because the function codes the results as invisible, and so results won’t be printed otherwise. The p-value is 0.0327. Comparing this to the univariate results*

t.test(Huntsman'region);t.test(Atou'region)

*gives two substantially smaller p-values; in this case, treatment as a multivariate distribution did not improve statistical power. On the other hand, the normal quantile plot for Atou yields shows some lack of normality. Outliers do not appear to be present in these data, but if they were, they could be addressed by performing the analysis on ranks, either using asymptotic normality:*

cat(’Wheat rank test, normal theory p-values’)

print(hotelling.test(rank(Huntsman)+rank(Atou)"region))

*or using permutation testing to avoid the assumption of multivariate normality:*

#Brute-force way to get estimate of permutation p-value for #both T2 and the max t statistic.

cat(’Permutation Tests for Wheat Data, Brute Force’) obsh<-hotelling.test(Huntsman+Atou'region)$stats$statistic obst<-max(c(t.test(Huntsman'region)$statistic, t.test(Atou'region)Sstatistic)) out<-array(NA,c(1000,2))

dimnames(out)<-list(NULL,c("Hotelling","t test")) for(j in seq(dim(out)[[1]])){

newr<-sample(region,length(region)) hto<-hotelling.test(Huntsman+Atou'newr) out[j,1]<-hto$stats$statistic>=obsh out[j,2]<-max(t.test(Huntsman~newr)$statistic, t.test(Atou~newr)$statistic)>=obst

>

apply(out,2,mean)

*giving permutation p-values for the Hotelling and max-t statistics of 0.023 and 0.003 respectively. The smaller max-t statistic reflects the strong association between variety yields across stations. If one wants only the Hotelling statistic significance via permutation, one could use*

print(hotelling.test(Huntsman+Atou'region,perm=T, progBar=FALSE))

*The argument* progBar *will print a progress bar, if desired, and an additional argument controls the number of random permutations.*

FIGURE 7.3: Wheat Yields

### Permutation Distribution Approximations

Let *Tj* be the Mann-Whitney-Wilcoxon statistic using manifest variable *j,* for *j* € {1,..., J}. Let *T =* (Ti,..., *Tj).* Let S = Var [T] be the variance matrix of this statistic, under the permutation distribution. Let be the element in row *j* and column *j'.* The diagonal elements *crjj* are independent of data values (and equal Mi М2 (М2 + Mi + 1)/12, but that’s not important here). The remaining entries of S depend on the data. For *i* = 1, ...,Mi + М2, let *Fii* be the number of observations in group 2 that beat observation *i* on variable *j* if *i* is in group 1, and the number of observations in group 1 that *i *beats on variable *j,* if *i* is in group 2. Then 4/(Mj + *Mi)* times the covariance matrix for *F* estimates the variance matrix of *T.* Kawaguchi et al. (2011) provide details of these calculations. Superior performance can be obtained using known diagonal values, and estimated correlations for the remaining entries of the variance matrix (Chen and Kolassa, 2018).

Example 7.5.2 *Consider again the wheat yield data of Example 7.5.1. Asymptotic nonparametric testing is performed using*

library(ICSNP)#For rank.ctest and HotelingsT2. rank.ctest(cbind(Huntsman,Atou)"region)

*to obtain a p-value of 0.039, or*

rank.ctest(cbind(Huntsman,Atou)"region,scores="sign") detach(wheat)

*to obtain a multivariate version of Mood’s median test.*

*Alternate syntax for* rank, ctest *consists of calling it with two arguments corresponding to the two data matrices.*

# Exercises

1. The data set

http://ftp.uni-

bayreuth.de/math/statlib/datasets/federalistpapers.txt

gives data from an analysis of a series of documents. The first column gives document number, the second gives the name of a text file, the third gives a group to which the text is assigned, the fourth represents a measure of the use of first person in the text, the fifth presents a measure of inner thinking, the sixth presents a measure of positivity, and the seventh presents a measure of negativity. There are other columns that you can ignore. (The version at Statlib, above, has odd line breaks. A reformatted version can be found at stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt).

a. Test the null hypothesis that the multivariate distribution of first person, inner thinking, positivity, and negativity, are the same between groups 1 and 2, using a permutation test. Test at *a =* .05.

b. Construct new variables, the excess of positivity over negativity, and the excess of thinking ahead over thinking behind, by subtracting variable six minus variable seven, and variable eight minus variable nine. Test the null hypothesis that the multivariate distribution of these two new variables has median zero, versus the general alternative, using the multivariate version of the sign test. Test at *a* = .05.

2. The data at

http://lib.stat.cmu.edu/datasets/cloud

contain data from a cloud seeding experiment . The first fifteen lines contain comment and label information; ignore these. The second field contains the character S for a seeded trial, and U for unseeded.

a. The fourth and fifth represent rainfalls in two target areas. Test the null hypothesis that the bivariate distribution of observations after seeding is the same as that without seeding. Use the marginal rank sum test.

b. Repeat part (a) using the permutation version of Hotelling’s test.