Our pipeline executes the following steps:
Run UCHIME (Edgar et al., 2011 (link)) against the reference alignment to remove chimeric query sequences.
Use EPA to place the query sequences onto the reference tree. Sequences that have a maximum placement likelihood weight of <0.5 (i.e. an uncertain placement, see Berger et al., 2011 (link) for details) are discarded.
For each branch in the reference tree, we extract the set of query sequences that have been placed into that branch and infer a tree on them using RAxML (Stamatakis, 2006 (link)). Because the PTP method requires a correctly rooted tree, we use the following two rooting strategies: if the branch leads to a tip, apart from the query sequences, we extend the alignment by including the reference tree tip sequence and that reference sequence that is furthest away from the current tip. The most distant sequence is used as outgroup. Keep in mind, that thereby the tree will be rooted at the longest branch (see the discussion below). To analyze query sequence placements at internal branches, we use the RAxML −g constraint tree option to obtain a rooted tree of the query sequences. The constraint tree consists of the bifurcating reference tree and a polytomy comprising the query sequences attached to the reference tree branch under consideration. The result of this constrained ML tree search is a resolved tree of query sequences that are attached to the reference tree branch. The attachment point is used as root.
Because we assume that the reference phylogeny is a species tree that reflects our knowledge about the speciation process and rate, we initially estimate
When PTP is applied to a placement of query sequences on a terminal branch, those queries that are delimited as one population with the reference sequence at the tip will be assigned taxonomically to the species represented by this reference sequence. Otherwise, they are identified as new species in the reference tree.