Case Studies
Complete Genomes of the Fruit Fly
We conduct an AMOVA separately for each chromosome by drawing randomly
10,000 SNPs. The AMOVA test was significant only for the 3R chromosome (results for the other chromosomes not shown):
> i <- which(SNP & infо.droso$CHR0M == "3R")
> x <- read.vcfffl, which.loci=sample(i, size=le4), quiet=TRUE)
> d <- dist.asd(x)
> amovafd - Region/Locality, geo)
Analysis of Molecular Variance
Call: amovafformula = d ~ Region/Locality, data = geo)
SSD MSD df
Region 5.449476 1.0898951 5
Locality 1.344987 0.1344987 10
Error 11.757241 0.1119737 105
Total 18.551704 0.1545975 120
Variance components:
sigma2 P.value Region 0.041505 0.0390
Locality 0.011799 0.6973
Error 0.111974
Phi-statistics:
Region.in.GLOBAL (Phi_CT) Locality.in.GLOBAL (Phi_ST) 0.25112447 0.32251207
Locality.in.Region (Phi_SC)
0.09532639
Variance coefficients:
a b c
1.909091 15.656198 19.110744
The highest level clearly explains the largest amount of genetic variation.
Human Genomes
We do an AMOVA with the human mtGenomes similar to the previous analysis. We discard the indels in order to compute the pairwise distances from the SNPS only:
> x <- MIT0[, is.snp(MITO)]
> x <- as.DNAbin(sapply(x, as.character))
We check that there are only strict SNPs:
> checkAlignment(x, plot = FALSE)
Number of sequences: 2534 Number of sites: 3589
No gap in alignment.
Number of segregating sites (including gaps): 3586 Number of sites with at least one substitution: 3586 Number of sites with 1, 2, 3 or 4 observed bases:
- 12 3 4
- 3 3586 0 0
It appears that there are three sites with no polymorphism but this is not a problem here because they will be ignored when calculating the Hamming distances:
> dx <- dist.dna(x, "N")
> am <- amova(dx - Continent/population, MIT0, nperm = 100)
> am
Analysis of Molecular Variance
Call: amova(formula = dx ~ Continent/population, data = MITO, nperm = 100)
SSD MSD df
Continent 265645.97 66411.4925 4
population 44040.73 2097.1778 21
Error 1361496.72 542.8615 2508
Total 1671183.43 659.7645 2533
Variance components:
sigma2 P.value Continent 128.125 0
population 15.981 0
Error 542.862
Phi-statistics:
Continent.in.GLOBAL (Phi_CT)
- 0.18650742 population.in.GLOBAL (Phi_ST)
- 0.20977125
population.in.Continent (Phi_SC)
0.02859747
Variance coefficients:
a b c
97.25724 98.20395 501.84905
There are very significant variation at both continent and population levels. A way to visualize this result is to plot the histograms of the distances selected using the logical indexing as explained on page 108. Here we build two series of indices named ic and ip for the continent and population level, respectively:
> ic <- outer(MIT0$Continent, MIT0$Continent, "==")
> ip <- outer(MIT0$population, MIT0$population, "==")
> ic <- ic[lower.tri(ic)]
> ip <- ip[lower.tri(ip)]
The logical vector ic is of the same length as the "dist" object dx and has TRUE is a distance has been calculated between two individuals from the same continent, or FALSE otherwise; and similarly for the vector ip but with respect to the population level. We then look at the distribution of different categories of distances (Fig. 8.8):

Figure 8.8
Distribution of pairwise Hamming distances at different levels for human mtGenomes.
> layout(matrix(l:4, 2, 2, byrow = TRUE))
> hist(dx[ic], main = "Within continents")
> hist(dx[!ic], main = "Between continents")
> hist(dx[ip], main = "Within populations")
> hist(dx[!ip], main = "Between populations")
This shows that the contrast between continents (Фет) and between populations (Фят) is mainly due to the absence of very short distances (< 5). To understand the low value of population differentiation within continents (Фнс)> we plot the distribution of within-population distances for each continent separately. For this we use the function foo on page 109 and combine its results with the above indices. We first store the different continent names (Fig. 8.9):
> conti <- levels(MIT0$Continent)
As a reminder, foo returns TRUE for the distances calculated between one individual from a first population and another individual from a second population, both given as arguments to foo. If the two populations are the same, then the value TRUE is for the distances among individuals within a single population.
> layout(matrix(l:6, 3, 2, byrow = TRUE))
> for (i in 1:5) {
+ j <- foo(MITO$Continent, contifi], contifi], FALSE)
+ hist(dx[ip & j],
+ main = paste("Within populations in", conti[i]))
+>
Exercises
- 1. Two locations have coordinates N 43° 36', E 4° 00' and N 43° 36', E 4° 30'. After transformation to the UTM system, these coordinates are (1065044, 4851312) and (1105414, 4854854), respectively. Calculate the geographical distances between both locations using two methods and compare the results.
- 2. What is the effect of genetic drift on Фст, and
sc? - 3. Write down the equation of the variance component in the case of a one-level AMOVA. How these components relate to 4>st?
- 4. Load the package pegas in memory and execute the examples in ?pegas: : amova. Modify the factors g and p in order to obtain significant variance components (see the example in Sect. 8.2.2).
- 5. What would be the matrix W so that the Moran I is equivalent to the Pearson correlation coefficient?
- 6. The data set rupica provided with adegenet contains the genotypes of 335 chamois (Rupicapm rupicapra) and their geographical coordinates (see details on these data in ?rupica). Perform a spatial principal component analysis (sPCA) with these data.
- 7. Plot the geographical coordinates of the nancycats data set delivered with adegenet (hint: see the slot ©other in this data set). Can you perform (directly) an sPCA using this data set?

Figure 8.9
Distribution of pairwise Hamming distances within populations for each continent for human mtGenomes.
9