# Case Studies

## Complete Genomes of the Fruit Fly

We conduct an AMOVA separately for each chromosome by drawing randomly

10,000 SNPs. The AMOVA test was significant only for the 3R chromosome (results for the other chromosomes not shown):

> i <- which(SNP & infо.droso$CHR0M == "3R")

> x <- read.vcfffl, which.loci=sample(i, size=le4), quiet=TRUE)

> d <- dist.asd(x)

> amovafd - Region/Locality, geo)

Analysis of Molecular Variance

Call: amovafformula = d ~ Region/Locality, data = geo)

SSD MSD df

Region 5.449476 1.0898951 5

Locality 1.344987 0.1344987 10

Error 11.757241 0.1119737 105

Total 18.551704 0.1545975 120

Variance components:

sigma2 P.value Region 0.041505 0.0390

Locality 0.011799 0.6973

Error 0.111974

Phi-statistics:

Region.in.GLOBAL (Phi_CT) Locality.in.GLOBAL (Phi_ST) 0.25112447 0.32251207

Locality.in.Region (Phi_SC)

0.09532639

Variance coefficients:

a b c

1.909091 15.656198 19.110744

The highest level clearly explains the largest amount of genetic variation.

## Human Genomes

We do an AMOVA with the human mtGenomes similar to the previous analysis. We discard the indels in order to compute the pairwise distances from the SNPS only:

> x <- MIT0[, is.snp(MITO)]

> x <- as.DNAbin(sapply(x, as.character))

We check that there are only strict SNPs:

> checkAlignment(x, plot = FALSE)

Number of sequences: 2534 Number of sites: 3589

No gap in alignment.

Number of segregating sites (including gaps): 3586 Number of sites with at least one substitution: 3586 Number of sites with 1, 2, 3 or 4 observed bases:

- 12 3 4
- 3 3586 0 0

It appears that there are three sites with no polymorphism but this is not a problem here because they will be ignored when calculating the Hamming distances:

> dx <- dist.dna(x, "N")

> am <- amova(dx - Continent/population, MIT0, nperm = 100)

> am

Analysis of Molecular Variance

Call: amova(formula = dx ~ Continent/population, data = MITO, nperm = 100)

SSD MSD df

Continent 265645.97 66411.4925 4

population 44040.73 2097.1778 21

Error 1361496.72 542.8615 2508

Total 1671183.43 659.7645 2533

Variance components:

sigma2 P.value Continent 128.125 0

population 15.981 0

Error 542.862

Phi-statistics:

Continent.in.GLOBAL (Phi_CT)

- 0.18650742 population.in.GLOBAL (Phi_ST)
- 0.20977125

population.in.Continent (Phi_SC)

0.02859747

Variance coefficients:

a b c

97.25724 98.20395 501.84905

There are very significant variation at both continent and population levels. A way to visualize this result is to plot the histograms of the distances selected using the logical indexing as explained on page 108. Here we build two series of indices named ic and ip for the continent and population level, respectively:

> ic <- outer(MIT0$Continent, MIT0$Continent, "==")

> ip <- outer(MIT0$population, MIT0$population, "==")

> ic <- ic[lower.tri(ic)]

> ip <- ip[lower.tri(ip)]

The logical vector ic is of the same length as the "dist" object dx and has TRUE is a distance has been calculated between two individuals from the same continent, or FALSE otherwise; and similarly for the vector ip but with respect to the population level. We then look at the distribution of different categories of distances (Fig. 8.8):

Figure 8.8

Distribution of pairwise Hamming distances at different levels for human mtGenomes.

> layout(matrix(l:4, 2, 2, byrow = TRUE))

> hist(dx[ic], main = "Within continents")

> hist(dx[!ic], main = "Between continents")

> hist(dx[ip], main = "Within populations")

> hist(dx[!ip], main = "Between populations")

This shows that the contrast between continents (Фет) and between populations (Фят) is mainly due to the absence of very short distances (< 5). To understand the low value of population differentiation within continents (Фнс)> we plot the distribution of within-population distances for each continent separately. For this we use the function foo on page 109 and combine its results with the above indices. We first store the different continent names (Fig. 8.9):

> conti <- levels(MIT0$Continent)

As a reminder, foo returns TRUE for the distances calculated between one individual from a first population and another individual from a second population, both given as arguments to foo. If the two populations are the same, then the value TRUE is for the distances among individuals within a single population.

> layout(matrix(l:6, 3, 2, byrow = TRUE))

> for (i in 1:5) {

+ j <- foo(MITO$Continent, contifi], contifi], FALSE)

+ hist(dx[ip & j],

+ main = paste("Within populations in", conti[i]))

+>

# Exercises

- 1. Two locations have coordinates N 43° 36', E 4° 00' and N 43° 36', E 4° 30'. After transformation to the UTM system, these coordinates are (1065044, 4851312) and (1105414, 4854854), respectively. Calculate the geographical distances between both locations using two methods and compare the results.
- 2. What is the effect of genetic drift on Фст, and
sc? - 3. Write down the equation of the variance component in the case of a one-level AMOVA. How these components relate to 4>st?
- 4. Load the package pegas in memory and execute the examples in ?pegas: : amova. Modify the factors g and p in order to obtain significant variance components (see the example in Sect. 8.2.2).
- 5. What would be the matrix
*W*so that the Moran*I*is equivalent to the Pearson correlation coefficient? - 6. The data set rupica provided with adegenet contains the genotypes of 335 chamois
*(Rupicapm rupicapra)*and their geographical coordinates (see details on these data in ?rupica). Perform a spatial principal component analysis (sPCA) with these data. - 7. Plot the geographical coordinates of the nancycats data set delivered with adegenet (hint: see the slot ©other in this data set). Can you perform (directly) an sPCA using this data set?

Figure 8.9

Distribution of pairwise Hamming distances within populations for each continent for human mtGenomes.

**9**