In the following, we describe the open reference species delimitation pipeline that combines the EPA with the PTP (EPA-PTP). The EPA initially places a large number of query sequences (short reads) into the branches of a given reference phylogeny. Thereafter, we execute PTP separately and independently for the query sequences assigned to each branch. This allows to annotate the branches of the reference tree by the number of species induced by the query sequences that were placed into each branch. The input of our pipeline is a reference alignment where each sequence represents one species and a reference phylogeny for that alignment. The PTP method and the pipeline are implemented in Python and rely on the python Environment for Tree Exploration package (Huerta-Cepas et al., 2010 (link)) for tree manipulation and visualization.
Our pipeline executes the following steps:

Run UCHIME (Edgar et al., 2011 (link)) against the reference alignment to remove chimeric query sequences.

Use EPA to place the query sequences onto the reference tree. Sequences that have a maximum placement likelihood weight of <0.5 (i.e. an uncertain placement, see Berger et al., 2011 (link) for details) are discarded.

For each branch in the reference tree, we extract the set of query sequences that have been placed into that branch and infer a tree on them using RAxML (Stamatakis, 2006 (link)). Because the PTP method requires a correctly rooted tree, we use the following two rooting strategies: if the branch leads to a tip, apart from the query sequences, we extend the alignment by including the reference tree tip sequence and that reference sequence that is furthest away from the current tip. The most distant sequence is used as outgroup. Keep in mind, that thereby the tree will be rooted at the longest branch (see the discussion below). To analyze query sequence placements at internal branches, we use the RAxML −g constraint tree option to obtain a rooted tree of the query sequences. The constraint tree consists of the bifurcating reference tree and a polytomy comprising the query sequences attached to the reference tree branch under consideration. The result of this constrained ML tree search is a resolved tree of query sequences that are attached to the reference tree branch. The attachment point is used as root.

Because we assume that the reference phylogeny is a species tree that reflects our knowledge about the speciation process and rate, we initially estimate only once on the reference phylogeny. Thereafter, we apply PTP to each query sequence (one for each branch of the reference phylogeny) tree to delimit species. Note that in this scenario we will only need to estimate , as remains fixed.

When PTP is applied to a placement of query sequences on a terminal branch, those queries that are delimited as one population with the reference sequence at the tip will be assigned taxonomically to the species represented by this reference sequence. Otherwise, they are identified as new species in the reference tree.

As mentioned previously, we also combined EPA with CROP (EPA-CROP). The method works as EPA-PTP, with the only difference that CROP is used instead of PTP to calculate the number of MOTUs for each placement.
Free full text: Click here