Split Diversity: Measuring and Optimizing Biodiversity Using Phylogenetic Split Networks
Olga Chernomor, Steffen Klaere, Arndt von Haeseler, and Bui Quang Minh
Abstract About 20 years ago the concepts of phylogenetic diversity and phylogenetic split networks were separately introduced in conservation biology and evolutionary biology, respectively. While it has been widely recognized that biodiversity assessment should better take into account the phylogenetic tree of life, it has also been widely acknowledged that phylogenetic networks are more appropriate for phylogenetic analysis in the presence of hybridization, horizontal gene transfer, or contradicting trees among genomic loci. Here, we aim to combine phylogenetic diversity and networks into one concept, split diversity (SD), which properly measures biodiversity for conflicting phylogenetic signals. Moreover, we reformulate well-known conservation questions under the SD framework and present computational methods to solve these, in general, computationally intractable questions. Notably, integer programming, a technique widely used to solve many real-life problems, serves as a general and efficient strategy that delivers optimal solutions to many biodiversity optimization problems. We finally discuss future directions for the new concept.
Keywords Biodiversity optimization • Phylogenetic diversity • Phylogenetic split networks • Split diversity • Integer programming
Introduction
The previous book chapters show that in the presence of phylogenetic information it is more appropriate to assess biodiversity based on phylogenetic trees than on the concept of species richness (see also May 1990; Vane-Wright et al. 1991). Phylogenetic diversity (PD; Faith 1992) is a popular measure of the amount of evolutionary history encompassed by the species under consideration. Given a phylogenetic tree for a set of taxa, PD of a taxon subset is defined as the sum of the branch lengths of the minimal subtree connecting those taxa. The definition of PD per se requires “a reliable estimate of phylogenetic relationships among the taxa” (Faith 1992). However, such a reliable estimate is sometimes hard to obtain due to, for example, model misspecification (Jermiin et al. 2008) or even intrinsically nontreelike evolutionary patterns. More recently, phylogenomic studies often revealed conflicting phylogenetic signals among genomic loci, adding the complication how to compute PD from multiple trees.
Figure 1 illustrates the problem. Here, phylogenetic trees are reconstructed for ten pheasant species from the mitochondrial cytochrome b gene (CYB) and the intron 3 of the dimerization cofactor of hepatocyte nuclear factor 1 (DCoH3) (data from Kimball and Braun 2008). The two resulting trees, denoted by TCYB and TDCoH3, clearly separate the two genera Gallus (junglefowl) and Polyplectron (peacockpheasant). However, they strongly contradict within the Gallus clade. For example, G. sonneratii (grey junglefowl) and G. varius (green junglefowl) are the basal Gallus species in TCYB and TDCoH3, respectively. The trees also disagree on the phylogenetic positions of P. emphanum (Palawan peacock-phesant) and P. malacense (Malayan peacock-pheasant). Moreover, edge lengths of the trees represented by
Fig. 1 Maximum likelihood phylogenetic trees inferred with IQ-TREE (Minh et al. 2013) from the mitochondrial CYB and the nuclear intron DCoH3 for four Gallus (junglefowl) and six Polyplectron (peacock-pheasant) species. The scalebar represents the expected number of nucleotide substitutions per site. Highlighted in boldface are the four species maximizing phylogenetic diversity
the expected numbers of substitutions per site substantially differ between the trees. This particular example reflects the fact that the evolutionary relationships among these birds are still controversial and more data is needed to elucidate the galliform tree of life (e.g., Wang et al. 2013).
If one is interested in selecting four species maximizing PD, then one indeed ends up with two different sets of species (highlighted in bold-face, Fig. 1) and only
P. emphanum occurs in both subsets.
To resolve this issue, we introduced the concept of Split Diversity (SD), which generalizes PD by combining information from multiple trees (Minh et al. 2009). For example, SD of a taxon set can be defined as the average PD of the two trees. By maximizing SD one then simultaneously maximizes PDs over all trees, which captures conflicting phylogenetic signals between the trees. Moreover, computing SD this way is equivalent to computing “phylogenetic diversity” from the so-called phylogenetic split networks (Bandelt and Dress 1992a; Huson et al. 2010). SD has also been recently applied to prioritize populations for conservation (Volkmann et al. 2014). In the following we formalize the concept of split networks and the measure of split diversity. Further, we reformulate well-known biodiversity optimization problems under the framework of SD, present algorithmic solutions and computational tools to these problems. Finally conclude the chapter with future perspectives.