The rooted species tree is required in order to identify the correct out-group in each orthogroup tree, as correct gene tree rooting is critical for the orthology assessment from that tree [22 (link)]. Since orthogroups can potentially contain any subset of the species in the analysis, it is not sufficient to simply know the out-group for the complete species set. Instead, the complete rooted species tree is required. If the user knows the rooted species tree for the set of species being analyzed, then it is recommended to specify this tree manually at the command line to remove the possibility of species tree inference error. Such a tree can be provided as a Newick format text file. In the event that a species tree is not provided (or not known), then OrthoFinder automatically infers it.
Sets of one-to-one orthologs that are present in all species are often used for species tree inference; however, in real-world large-scale analyses, these can be rare [33 ]. A new algorithm, Species Tree from All Genes (STAG), was developed to allow species tree inference even for species sets with few or no complete sets of one-to-one orthologs present in all species [33 ]. Without this algorithm, species tree inference could fail if there were no sets of one-to-one orthologs present in all species. STAG infers the species tree using the most closely related genes within single-copy or multi-copy orthogroups. In benchmark tests, STAG [24 (link)] had higher accuracy than other leading methods for species tree inference, including maximum likelihood species tree inference from concatenated alignments of protein sequences, ASTRAL [38 (link)] and NJst [39 (link)].
The Species Tree Root Inference from Duplication Events (STRIDE) algorithm [22 (link)] is used to root the species tree in OrthoFinder. STRIDE was developed to enable the rooting of the species tree using only information available in the set of gene trees. STRIDE does this by identifying the set of well-supported in-group gene duplication events in the complete set of unrooted orthogroup trees, and using these events to infer a probability distribution over an unrooted STAG species tree for the location of its root. Similarly to STAG, STRIDE has been shown to identify the correct root of the species tree in multiple large-scale molecular phylogenetic data sets spanning a wide range of time scales and taxonomic groups [22 (link)]. In some cases, it is possible that there could be few duplications within the gene trees, and so STRIDE will not be able to identify the root of the species tree, or will only be able to exclude the root from clades in which gene duplication events are observed. In this case, ortholog inference should still not be significantly impacted since the rooting of the gene tree only affects ortholog inference in cases where gene duplication events are present [22 (link)]. This makes the STRIDE approach particularly suited to gene tree rooting for ortholog inference.
Free full text: Click here