# Formulation

To begin, the classic rarefaction formula for species richness will be reviewed in order to demonstrate how it can be extended to the case of Phylogenetic Diversity. The expected species richness (*S*) for a given amount of sampling is simply the sum of probabilities (*p*) of each species occurring in a subset of *m* accumulation units (Eq. 1).

To solve Eq. 1, we need to determine the probability (*p*) of each species being selected by a random draw of *m* accumulation units from the total set of *N* units. Regardless of whether the accumulation unit is an individual or a sample, this probability is a function of the frequency (*n*) with which species *i* occurs across the set of *N* accumulation units (Chiarucci et al. 2008). Since *N* is a set of finite size, random draws from that set should be without replacement and thus *p* is defined by the hypergeometric distribution (Hurlbert 1971). Substituting into Eq. 1, the expected species richness is as follows (Eq. 2).

The quantity within the square brackets in Eq. 2 corresponds to *p* in Eq. 1. Note that the expressions in curved brackets are binomial coefficients and not simple fractions, while the quantity subtracted from one within the square brackets is a fraction. The denominator in this fraction gives the number of distinct subsets of size *m* that can be drawn from the total set of *N* units. The numerator gives the number of distinct subsets of size *m* that do not contain species *i*. Equation 2 is the same as that originally proposed by Hurlbert (1971).

Phylogenetic Diversity is simply the sum of a set of branch lengths spanning a set of species (or, more generally, tips). So, for a set of *S* species, there is a corresponding set of *T* branch segments. Each branch segment (*j*) has a length (*L*) measured as sequence substitutions, millions of years, or some other biologically meaningful estimate of difference. Considering only rooted phylogenetic trees, PD is calculated as follows (Eq. 3).

In the original definition intended by Faith (1992), the PD of a subset of species is calculated by summing the branch lengths connecting that set of species to the root of the tree, even when the common ancestor of that subset is not the same as the root. In this definition, a subset containing a single species (or even a single individual) has a non-zero PD value, which in this case, would be the total path length from the tip to the root. This corresponds to the *rooted PD* value of Pardi and Goldman (2007). The alternative, called *unrooted PD* by Pardi and Goldman (2007), includes only the branch segments connecting a subset of species to their common ancestor, and thus a subset containing only a single species would have zero PD. The former definition, rooted PD, is adopted here because it allows for the straightforward formulation of a whole class of derived PD measures (Faith 2013), and because it is concordant with the original idea of PD acting as a surrogate for the feature diversity of a set (Faith 1992; Faith et al. 2009). Obviously, rooted PD requires a rooted phylogenetic tree, even if the choice of root is arbitrary (Nipperess and Matsen 2013).

Given this definition, the rarefaction of PD involves finding the expected (average) sum of branch lengths (including the path to the root) for all possible distinct subsets of *m* accumulation units (Fig. 2). This is achieved by extending the classic rarefaction formula through a substitution of species for branch segments in a phylogenetic tree. Since PD is simply the sum of branch lengths, then the expected PD must also be the sum of branch lengths, each weighted by the probability (*q*) of its occurrence in a subset of size *m* (O'Dwyer et al. 2012). So, for a rooted phylogenetic tree represented as a set of *T* branch segments, the expected PD is given as follows (Eq. 4).

The probability of each branch segment occurring in a subset is again a function of the frequency with which it occurs among accumulation units. The frequency of occurrence of a particular branch segment (*o*) depends on the frequency of occurrence of species that are descendent from that branch segment. Let *x* be a binary value indicating whether species *i* is (1) or is not (0) a descendant of branch segment *j*. Multiplying *x* by *n* and summing across all species will give the total number of occurrences of branch segment *j* among *N* accumulation units (Eq. 5).

Thus, by summing across branches instead of species, substituting branch occurrence for species occurrence, and including a branch length weighting, we are able to adapt the classic rarefaction formula for species richness for the purposes of calculating expected Phylogenetic Diversity (Eq. 6). Note this solution is equivalent to that of Nipperess and Matsen (2013) but is expressed in an expanded form for the specific case of calculating rooted PD. Equation 6 is very similar to the solution for

**Fig. 2 An illustration of the process of rarefying Phylogenetic Diversity (PD) by units of individuals. An initial sample of ten individuals ( m = 10) distributed among four tips (species) is rarefied to a subset of five individuals (m = 5) by a process of random sampling without replacement. For the rarefied samples, 2 of the 252 possible subsets are shown. The expected PD under rarefaction is the average sum of branch lengths represented by each of these distinct subsets. The branch lengths summed to calculate PD are black while those not represented (and thus not summed) are grey. Note that the rooted definition of PD is used where the path length to the root is always included, even in the case where only a single tip is represented**

*expected PD* of Faith (2013) but differs in that random draws are without replacement following the hypergeometric distribution.

Finally, it is now possible to calculate the expected PD for a given number of *species*. A species, in this context, is simply a collection of individuals in much the same way as a sample is a collection of individuals, and the same equations apply.

Under these circumstances, *oj* is equal to the sum of *xij* (over all species) as *ni* will always equal 1, and *N* is equal to *S*. Substituting into Eq. 6 gives the following formula for rarefaction by species (Eq. 7).