# Principal Component Analysis of Coancestry

Zheng and Weir [312] developed a method, which they called EIGMIX, based on the same population divergence model outlined above for the analysis of Fsx [294]. The procedure is to first do a PCA on the SNP data with snpgdsEIGMIX which assumes the above model. Then, using “surrogate” population data with *К* groups, the first *К — l* PCs are retained, and group means are calculated. The deviations from these means for each individual are taken as estimates of the ancestry proportions calculated with snpgdsAdmixProp. There is a special plot function snpgdsAdmixPlot but the results can also be displayed with compoplot from adegenet.

Figure 7.12

In the absence of admixture, there is a single evolutionary path between populations *W* and *X* (in grey). If population *X* is the result of admixture between *W* and *Y.* genes have followed two paths between *W* and *X.*

The package pophelper^{[1]} [80] offers an alternative way to visualize the results of the *Q* matrices from different analyses. It has several functions to import results from programs outside R such as STRUCTURE [229], ADMIXTURE, or TESS. The main plotting function is called plotQ with too many options to detail here. The *Q* matrices from the above analyses can also be analyzed by pophelper by putting them in a list and taking care of converting the matrices into data frames, for instance:

> Q <- list(snap = as.data.frame(res.snaplproba))

Then the plot can be done with:

> plotQ(Q)

which will write a file ‘snap.png’ with the assignment probabilities. Alternatively, the Webserver http://pophelper. com can do these plots after uploading the data; the plots can be edited interactively and exported into files.

# A Second Look at F-Statistics

During the last decade, very significant progress has been accomplished in analyzing genome-wide SNP data to assess population histories, admixture, and complex demographic scenarios [218, 220, 224, 235, 289]. In a way, these works draw a link between population genetics and phylogenetics [74]. Indeed, the basic idea is that populations are linked by a phylogenetic tree and that drift and mutations occurred along the branches of this tree. Different indices can be defined based on the expected divergence between two or more populations, and the distribution of these indices will depend on their demographic history.

Consider SNP data and four populations labelled *W, X*, *Y.* and Z, and let the proportions of an arbitrary allele in each population denoted as £iy, *S.Y **•* and *E,z-* Take two of these populations, say *W* and *X.* a measure of the genetic drift that happened since they separated is given by:

In the presence of admixture, the gene lineages took different paths leading to the present populations *W* and *X* (Fig. 7.12). Peter [220] proposed another formula which avoids the need to define ancestral alleles:

where ttw and *ттх* are the nucleotide diversity within populations *W* and *X*, and ttwX is the inter-population nucleotide diversity (see p. 245). A similar statistic can be defined with three populations:

The order of the populations matters. Interestingly, F( can be calculated in terms of Fj’s [235]:

If *Y* is not admixed from *W* and *X,* then F3 > 0. If Fj(K: IF. *X) <* 0, then this is an indication that population Y has a “complex” history. According to Patterson et al. [218], this assessment of admixture is robust to the ascertainment of the ancestral state of the allele; on the other hand, admixture may also result in a positive value of F3. Peter [220] suggests, using coalescent theory, that this test is quite restrictive to detect admixture.

A four-population statistic is defined with:

Like for F_{3}, this index can be formulated with the pairwise F_{2}'s [235]:

Peter [220] defined a test of admixture of *W* from *X* and *Y* with:

where *A* and *В* are two distant populations from the three others, *В* has to be more closely related to either *X* or *Y.* and *W* is the population where admixture is assessed. Another way to calculate this quantity, if all populat ions are sampled at the same time is:

where no outgroup population is required.

In practice, these indices are used assuming different scenarios to test hypotheses. For example, Raghavan et al. [232] analyzed a human genome from Siberia and computed a series of Fj indices fixing *Y* as the most distant population (from Africa), *W* as the population from Siberia, and *X* as one of the 147 worldwide non-African populations. Peter [220] suggested to use the pairwise F2 to assess the “treeness” of the population history using traditional phylogenetic methods.

These three statistics can be calculated with the functions F2, F3, F4 in pegas. They have identical options which are shown here for the last one:

F4(x, allele.freq = NULL, population = NULL, check.data = TRUE, pops = NULL, jackknife.block.size = 10, В = 10000)

The data x are an object of class "loci"; alternatively, allele.freq can be used if the allele frequencies have been calculated with by (Sect. 5.4). population is the population variable (by default it is taken from x), check.data = TRUE checks that all loci are biallelic, pops can be used to specify the four populations and their order, jackknife.block.size is the number of loci that are considered as a block in the jackknife confidence intervals described by Patterson et al. [218], and В is the number of replications of the bootstrap procedure to compute the same confidence intervals.

F4 also returns the *D* statistic defined by Patterson et al. [218] as follows: suppose the four above populations are related by an unrooted tree *(W,X),(Y.Z),* then define the event “BABA” if an allele drawn at random agrees between populations *W* and *Y* and between populations *X* and *Z* but differs among these two pairs. Furthermore, define the event “ABBA” in a similar way if the allele agrees between populations *W* and *Z.* Then *D* is defined as:

The value of *D* varies between —1 and +1.

The package admixturegraph can be used to analyze graphically the outputs of F4. This package provides tools to build admixture graphs. As a simple example, we build a phylogenetic tree with three populations:

> library(admixturegraph)

> leaves <- c("W", "Y", "X")

> inner_nodes <- c("WY", "WYX")

> edges <- parent_edges(c(edge("W", "WY"), edgeC'Y", "WY"),

+ edgeC'WY", "WYX"), edgeC'X", "WYX")))

> graph <- agraph(leaves, inner_nodes, edges)

> graph Heaves

[1] "W" "Y" "X"

$inner_nodes [1] "WY" "WYX"

Inodes

[1] "W" "Y" "X" "WY" "WYX"

Iparents

W Y X WY WYX

W FALSE FALSE FALSE TRUE FALSE

Y FALSE FALSE FALSE TRUE FALSE

X FALSE FALSE FALSE FALSE TRUE

WY FALSE FALSE FALSE FALSE TRUE

WYX FALSE FALSE FALSE FALSE FALSE

Iprobs

W Y X WY WYX

у II II II II II II II II II II

у и и и и и и и и и и

^ и и и и и и и и и и

уу II II II II II II II II II II

WYX "^{11 11}" ^{1111 1111 1111 }$children

W Y X WY WYX

W FALSE FALSE FALSE FALSE FALSE

Y FALSE FALSE FALSE FALSE FALSE

X FALSE FALSE FALSE FALSE FALSE

WY TRUE TRUE FALSE FALSE FALSE WYX FALSE FALSE TRUE TRUE FALSE

attr(,"class")

[1] "agraph"

We build a second graph with the same three populations but adding an admixture edge:

> inner_nodes2 <- c("w", "y", "x", "XWY")

> edges2 <- parent_edges(c(edge("W", "w"), edgeO'w", "XWY"),

+ edgeO'X", "x"), edgeO'x", "XWY"),

+ edgeO'Y", "y"),

+ admixture_edge("y", "w", "x", "alpha")))

> graph2 <- agraph(leaves, inner_nodes2, edges2)

Figure 7.13

Phylogenetic trees of three populations built with admixturegraph (A) with no admixture, and (B) with admixture.

We plot both graphs (Fig. 7.13):

> layout(matrix(1:2, 1))

> plot(graph, col = "grey")

> plot(graph2, col = "grey")

admixturegraph has other functions to fit admixture graphs to observed *F *statistics (e.g., fit_graph).

To conclude this chapter, Table 7.1 lists the methods reviewed in this chapter together with some from Chapter 8 and their main characteristics.

- [1] https://github.com/royfrancis/pophelper