Trees were visualized using Interactive Tree Of Life (Letunic and Bork, 2019 (link)) in conjunction with the script deposited at
In addition to GBDP formula d5, which explores sequence (dis-)similarity and is the recommended one for phylogenetic inference (Auch et al., 2006 (link); Meier-Kolthoff et al., 2014a (link)) we here used formula d3, which compares the gene content of the investigated genomes after correcting for reduction in genome size (Henz et al., 2005 (link)). While this analysis was also done using the GBDP software, for consistency with previous work we will refer to the d5 phylogeny as GBDP tree and to the d3 tree as gene-content analysis. There are various reasons why a gene-content phylogeny may fail to recover the true tree, as detailed below, hence the gene-content analysis is not intended to lend phylogenetic support. However, it may nevertheless be of taxonomic interest whether or not a certain branch is supported by gene-content data, particularly since the gene content conveys metabolic capabilities (Zhu et al., 2015 (link)) and yield independent evidence for conclusions from standard genome-scale phylogenies (Breider et al., 2014 (link)).
Full-length 16S rRNA gene sequences were extracted from the genomes using RNAmmer version 1.2 (Lagesen et al., 2007 (link)) and compared with the 16S rRNA gene reference database using BLAST and phylogenetic trees to verify the taxonomic affiliation of genomes. Non-matching genome sequences were discarded from further analyses. A comprehensive sequence alignment was generated with MAFFT version 7.271 with the “localpair” option (Katoh et al., 2005 ) using either the sequences extracted from the genome sequences or the previously published 16S rRNA gene sequences, depending on the length and number of ambiguous bases. Trees were inferred from the alignment with RAxML (Stamatakis, 2014 (link)) under the maximum-likelihood (ML) criterion and with TNT (Goloboff et al., 2008 (link)) under the maximum-parsimony (MP). In addition to unconstrained, comprehensive 16S rRNA gene trees (UCT), constrained comprehensive trees (CCT) were inferred with ML and MP using the bipartitions of the GBDP tree with ≥95% support as backbone constraint, as previously described (Hahnke et al., 2016 (link); Nouioui et al., 2018 (link)).
Taxa were analyzed to determine whether they were monophyletic, paraphyletic or polyphyletic (Farris, 1974 (link); Wood, 1994 (link)) Taxa non-monophyletic according to the GBDP tree were tested for evidence for their monophyly in the UCT and the 16S rRNA gene trees, if any, in the original publication. In the case of a significant conflict (i.e., high support values for contradicting bipartitions) between trees or low support in the GBDP tree, additional phylogenomic analyses of selected taxa were conducted. To this end, protein sequences of those taxa with the reciprocal best hits from GBDP/BLAST were clustered with MCL (Markov Chain Clustering) version 14-137 (Enright et al., 2002 (link)) under default settings and an e-value filter of 10–5 in analogy to OrthoMCL (Li et al., 2003 (link)). The resulting sets of orthologous proteins were aligned with MAFFT and concatenated to form a supermatrix after discarding the few clusters that still contained more than a single protein for at least one genome. Comprehensive supermatrices were compiled from all the orthologs that occurred in at least four genomes, whereas core-genome supermatrices were constructed for the orthologs that occurred in all of the genomes. Supermatrices were analyzed with TNT, and with RAxML under the “PROTCATLGF” model, in conjunction with 100 partition bootstrap replicates (Siddall, 2010 (link); Simon et al., 2017 (link))
Additionally, selected phenotypic features relevant for the taxonomic classification of Alphaproteobacteria were as comprehensively as possible collected from the taxonomic literature: motility by flagella, absence or presence of carotenoids, absence or presence of bacteriochlorophyll α, absence or presence of sphingolipids, average number of isoprene residues of the major ubiquinones, and relationship to oxygen. To avoid circular reasoning, missing features of a species were only inferred from features of its genus when species and genus were described in the same publication or when the species description had explicitly been declared as adding to the features of the genus. For the binary chemotaxonomic characters an alternative coding was also investigated that treated all missing values as indicating absence. Ubiquinone percentages would be more informative than just statements about being “major” but mostly only the latter are provided in the literature. Oxygen conditions were coded as ordered multi-state character: (1) strictly anaerobic; (2) facultatively aerobic, facultatively anaerobic, or microaerophilic; (3) strictly aerobic. Among all nine coding options tested, this yielded the highest fit to the tree (
Taxa that were unambiguously non-monophyletic according to the genome-scale analyses were screened for published evidence of their monophyly. The published evidence was judged as inconclusive when based on unsupported branches in phylogenetic trees, based on probably homoplastic characters or on probable plesiomorphic character states. Plesiomorphies might well be “diagnostic” but just for paraphyletic groups (Hennig, 1965 ; Wiley and Lieberman, 2011 ; Montero-Calasanz et al., 2017 (link)) hence “diagnostic” features alone are insufficient in phylogenetic systematics.
For fixing the obviously non-monophyletic taxa taxonomic consequences were proposed if new taxon delineations could be determined that were sufficiently supported by the CCT. In these cases, the uncertain phylogenetic placement of taxa whose genome sequences were not available at the time of writing would not affect the new proposals. Where necessary taxa were tentatively place in newly delineated groups.