Homologs of each of the 31 phylogenetic marker genes were identified from the 578 complete bacterial genomes by BLASTP searches (using marker sequences of Escherichia coli as query sequences and a cut-off E-value of 0.1) followed by HMMer searches (cut-off E-value 1 × e-10). The corresponding protein sequences were retrieved, aligned, and trimmed as described above, and then concatenated by species into a mega-alignment. A maximum likelihood tree was then constructed from the mega-alignment using PHYML [35 (link)]. The model selected based on the likelihood ratio test was the WAG model of amino acid substitution with γ-distributed rate variation (five categories) and a proportion of invariable sites. The shape of the γ-distribution and the proportion of the invariable sites were estimated by the program.
To speed up bootstrapping analyses, very closely related taxa were removed from the original mega-alignment, which left us with 310 taxa. Maximum likelihood trees were made from 100 bootstrapped replicates of this reduced dataset using PHYML with the same parameters described above.
With very few exceptions, the marker genes are single-copy genes in all of the bacterial genomes analyzed. In those rare cases in which two or more homologs were identified within a single species, a tree-guided approach was used to resolve the redundancy. If the redundancy resulted from a species-specific duplication event, then one homolog was randomly chosen as the representative. In all other cases, to avoid potential complications such as lateral gene transfer, we excluded that marker and treated it as 'missing' in that particular genome. It has been shown that as long as there is sufficient data, a few 'holes' in the dataset will not compromise the resulting tree [36 (link)].
Free full text: Click here