# Analysis of Molecular Variance

The above framework has been used by Excoffier et al. [72]) to develop a very general approach: the analysis of molecular variance (AMOVA). It is based on an equivalence between variance and distance: the variance of a sample is indeed the mean distance of the observations to the sample mean (see also Fig. 7.5). We can thus write:

with SSD being the sum of squared deviations and *d,j* the distance between individuals *г* and *j.* Using the above hierarchy *T > C > S* (in Excoffier et al.’s notation), this can be decomposed in several components (R: residual):

After dividing by the appropriate numbers of degrees of freedom, we have the following variance components:

which are related to the Ф-statistics as follows:

In plain words, these different variances are: *a** ^{2}* variation among continents,

* ^{2}* variation within each population,

*a*

*total variation.*

^{2}This can be generalized to more than two levels: each additional variance component is the quantity of variation explained by the additional level.

The function amova in pegas is a general implementation of the AMOVA framework: it accepts any number of levels. With a single level, it is equivalent to the partitioning of nucleotide diversity introduced at the beginning of this chapter.^{[1]} The arguments are:

amova(formula, data = NULL, nperm = 1000, is.squared = FALSE)

The first argument is a formula defining the AMOVA. model to be fitted; it must be of the form d " conti/pop/..., where d contains the pairwise distances (either an object of class "dist" or a square symmetric matrix), and conti, pop, ..., are the hierarchical levels. The second argument is an optional data frame containing these levels (typically factors), nperm is the number of permutations of the randomization tests assessing the statistical significance of the variance components, and is. squared indicates whether d gives the squared pairwise distances or not (the default).

We try amovaO with the haploid data set Yloc created on page 204. We first calculate the Hamming distance matrix:

> dy <- dist.hamming(Yloc)

> dy

Indl Ind2 Ind3 Ind4 Ind5 Ind6 Ind7 Ind8 Ind2 2

Ind3 3 3

Ind4 223

Ind5 5 5 2 3

Ind6 4 6 3 4 1

Ind7 3 3 4 4 5 4

Ind8 2 4 3 3 4 3 1

Ind9 64352334

This data set has already a column named population so we can use the argument data:

> res.amova <- amova(dy ~ population, data = Yloc)

> res.amova

Analysis of Molecular Variance

Call: amova(formula = dy ~ population, data = Yloc)

SSD MSD df

population 26.33333 13.166667 2

Error 24.66667 4.111111 6

Total 51.00000 6.375000 8

Variance components:

sigma2 P.value population 3.0185 0.0589

Error 4.1111

Phi-statistics: population.in.GLOBAL 0.4233766

Variance coefficients:

a

3

The variance explained by population is not significantly different from zero which is not surprising since these data were randomly assigned into the three populations. As a purely hypothetical exercise, we artificially create structure in these data by assigning the individuals that are incidentally similar together in the same group. A simple way to do this is by performing a projection of the distances using MDS (p. 116):

> mds <- cmdscale(dy)

> mds

[,1] [,2]

Indl -2.4895645 -1.19685107 Ind2 -2.3447801 1.07546427

Ind3 0.4600040 -0.51935489 Ind4 -1.2126382 -1.47318435 Ind5 2.4951381 -1.05172331 Ind6 2.2607319 -1.10317072 Ind7 -0.9190935 2.01749998

Ind8 -0.5846970 0.09919587

Ind9 2.3348994 2.15212423

By definition of the MDS, the individuals that are close in the above coordinates are also close in terms of genetic distances. We take the ordering on the first axis to identify similar individuals:

> о <- order(mds[, 1])

> о

[1] 124783695

So individuals 1, 2, and 4 will be in the first (artificial) population, individuals 7, 8, and 3 will be in the second one, and individuals 6, 9, and 5 in the last one. We use the vector о as indices to create this new population factor easily (note that pop must be created beforehand because the ^{1} [’ operator will look for it):

> pop <- NULL # or: pop <- integer(nrow(Yloc))

> pop[o] <- rep(l:3, each = 3)

> pop <- factor(pop)

> pop

[1] 112133223 Levels: 123

We now perform the new AMOVA without using the data argument:

> amova(dy - pop)

Analysis of Molecular Variance

Call: amova(formula = dy - pop)

SSD MSD df

pop 33.66667 16.833333 2

Error 17.33333 2.888889 6

Total 51.00000 6.375000 8

Variance components: sigma2 P.value pop 4.6481 0.01

Error 2.8889

Phi-statistics: pop.in.GLOBAL 0.6167076

Variance coefficients:

a

3

The variance component = 4.65 is now significantly greater than zero, and Ф§т = 0.62 is higher than in the first analysis (0.42).

The case studies below show an application of AMOVA with more than one level.

- [1] The package ade4 has a function with the same name implementing the two-levelAMOVA (see p. 194 on how to use the operator).