The approach to taxon sampling and analysis was in almost all respects the same as previously described (Hahnke et al., 2016 (link); Nouioui et al., 2018 (link)). A total of 1104 annotated type-strain genome sequences (Supplementary Table S1 ) for Alphaproteobacteria (ingroup) and Spirochaetes (outgroup) were collected. While some originated from GenBank the majority was obtained de novo in the course of the KMG projects phase II (Mukherjee et al., 2017 (link)) and phase IV and deposited in the Integrated Microbial Genomes platform (Chen et al., 2019 (link)) and in the Type-Strain Genome Server database (Meier-Kolthoff and Göker, 2019 (link)). Among Alphaproteobacteria KMG-II mainly targeted Rhodobacteraceae but also representatives of other families. All newly generated KMG sequences underwent standard quality control at DSMZ and JGI documented on the respective web pages and had < 100 contigs. All accepted genome sequences had < 500 contigs and matched the 16S rRNA gene reference database described below. Structural annotation at JGI and DSMZ was done using Prodigal v. 2.6.2 (Hyatt et al., 2010 (link)). The features of all genome sequences that entered these analyses are provided in Supplementary Table S1 . These annotated genome sequences were processed further as in our previous study using the high-throughput version of the Genome BLAST Distance Phylogeny (GBDP) approach in conjunction with BLAST+ v2.2.30 in blastp mode (Auch et al., 2006 (link); Camacho et al., 2009 (link); Meier-Kolthoff et al., 2014a (link)) and FastME version 2.1.6.1 using the improved neighbor-joining algorithm BioNJ for obtaining starting trees followed by branch swapping under the balanced minimum evolution criterion (Desper and Gascuel, 2004 (link)) using the subtree-pruning-and-regrafting algorithm (Desper and Gascuel, 2006 (link); Lefort et al., 2015 (link)). One hundred pseudo-bootstrap replicates (Meier-Kolthoff et al., 2013a (link), 2014a (link)) were used to obtain branch-support values for these genome-scale phylogenies.
Trees were visualized using Interactive Tree Of Life (Letunic and Bork, 2019 (link)) in conjunction with the script deposited athttps://github.com/mgoeker/table2itol . Outgroup-based rooting was compared with rooting using least-squares dating as implemented in LSD version 0.2 (To et al., 2016 (link)) after removing the outgroup taxa and inferring an accordingly reduced tree with FastME. Species and subspecies boundaries were investigated using digital DNA:DNA hybridization (dDDH) as implemented in the Genome-To-Genome Distance Calculator (GGDC) version 2.1 (Meier-Kolthoff et al., 2013a (link)) and in TYGS, the Type (Strain) Genome Server (Meier-Kolthoff and Göker, 2019 (link)).
In addition to GBDP formula d5, which explores sequence (dis-)similarity and is the recommended one for phylogenetic inference (Auch et al., 2006 (link); Meier-Kolthoff et al., 2014a (link)) we here used formula d3, which compares the gene content of the investigated genomes after correcting for reduction in genome size (Henz et al., 2005 (link)). While this analysis was also done using the GBDP software, for consistency with previous work we will refer to the d5 phylogeny as GBDP tree and to the d3 tree as gene-content analysis. There are various reasons why a gene-content phylogeny may fail to recover the true tree, as detailed below, hence the gene-content analysis is not intended to lend phylogenetic support. However, it may nevertheless be of taxonomic interest whether or not a certain branch is supported by gene-content data, particularly since the gene content conveys metabolic capabilities (Zhu et al., 2015 (link)) and yield independent evidence for conclusions from standard genome-scale phylogenies (Breider et al., 2014 (link)).
Full-length 16S rRNA gene sequences were extracted from the genomes using RNAmmer version 1.2 (Lagesen et al., 2007 (link)) and compared with the 16S rRNA gene reference database using BLAST and phylogenetic trees to verify the taxonomic affiliation of genomes. Non-matching genome sequences were discarded from further analyses. A comprehensive sequence alignment was generated with MAFFT version 7.271 with the “localpair” option (Katoh et al., 2005 ) using either the sequences extracted from the genome sequences or the previously published 16S rRNA gene sequences, depending on the length and number of ambiguous bases. Trees were inferred from the alignment with RAxML (Stamatakis, 2014 (link)) under the maximum-likelihood (ML) criterion and with TNT (Goloboff et al., 2008 (link)) under the maximum-parsimony (MP). In addition to unconstrained, comprehensive 16S rRNA gene trees (UCT), constrained comprehensive trees (CCT) were inferred with ML and MP using the bipartitions of the GBDP tree with ≥95% support as backbone constraint, as previously described (Hahnke et al., 2016 (link); Nouioui et al., 2018 (link)).
Taxa were analyzed to determine whether they were monophyletic, paraphyletic or polyphyletic (Farris, 1974 (link); Wood, 1994 (link)) Taxa non-monophyletic according to the GBDP tree were tested for evidence for their monophyly in the UCT and the 16S rRNA gene trees, if any, in the original publication. In the case of a significant conflict (i.e., high support values for contradicting bipartitions) between trees or low support in the GBDP tree, additional phylogenomic analyses of selected taxa were conducted. To this end, protein sequences of those taxa with the reciprocal best hits from GBDP/BLAST were clustered with MCL (Markov Chain Clustering) version 14-137 (Enright et al., 2002 (link)) under default settings and an e-value filter of 10–5 in analogy to OrthoMCL (Li et al., 2003 (link)). The resulting sets of orthologous proteins were aligned with MAFFT and concatenated to form a supermatrix after discarding the few clusters that still contained more than a single protein for at least one genome. Comprehensive supermatrices were compiled from all the orthologs that occurred in at least four genomes, whereas core-genome supermatrices were constructed for the orthologs that occurred in all of the genomes. Supermatrices were analyzed with TNT, and with RAxML under the “PROTCATLGF” model, in conjunction with 100 partition bootstrap replicates (Siddall, 2010 (link); Simon et al., 2017 (link))
Additionally, selected phenotypic features relevant for the taxonomic classification of Alphaproteobacteria were as comprehensively as possible collected from the taxonomic literature: motility by flagella, absence or presence of carotenoids, absence or presence of bacteriochlorophyll α, absence or presence of sphingolipids, average number of isoprene residues of the major ubiquinones, and relationship to oxygen. To avoid circular reasoning, missing features of a species were only inferred from features of its genus when species and genus were described in the same publication or when the species description had explicitly been declared as adding to the features of the genus. For the binary chemotaxonomic characters an alternative coding was also investigated that treated all missing values as indicating absence. Ubiquinone percentages would be more informative than just statements about being “major” but mostly only the latter are provided in the literature. Oxygen conditions were coded as ordered multi-state character: (1) strictly anaerobic; (2) facultatively aerobic, facultatively anaerobic, or microaerophilic; (3) strictly aerobic. Among all nine coding options tested, this yielded the highest fit to the tree (Supplementary Table S1 ) but the differences between the coding options were not pronounced. Phylogenetic conservation of selected phenotypic and genomic characters with respect to the GBDP tree (reduced to represent each set of equivalent strains by only a single genome) was evaluated using a tip-permutation test in conjunction with the calculation of maximum-parsimony scores with TNT as previously described (Simon et al., 2017 (link); Carro et al., 2018 (link)) and 10,000 permutations. TNT input files were generated with opm (Vaas et al., 2013 (link)). The proportion of times the score of a permuted tree was at least as low as the score of the original tree yielded the p-value. Maximum-parsimony retention indices (Farris, 1989 (link); Wiley and Lieberman, 2011 ) were calculated to further differentiate between the fit of each character to the tree.
Taxa that were unambiguously non-monophyletic according to the genome-scale analyses were screened for published evidence of their monophyly. The published evidence was judged as inconclusive when based on unsupported branches in phylogenetic trees, based on probably homoplastic characters or on probable plesiomorphic character states. Plesiomorphies might well be “diagnostic” but just for paraphyletic groups (Hennig, 1965 ; Wiley and Lieberman, 2011 ; Montero-Calasanz et al., 2017 (link)) hence “diagnostic” features alone are insufficient in phylogenetic systematics.
For fixing the obviously non-monophyletic taxa taxonomic consequences were proposed if new taxon delineations could be determined that were sufficiently supported by the CCT. In these cases, the uncertain phylogenetic placement of taxa whose genome sequences were not available at the time of writing would not affect the new proposals. Where necessary taxa were tentatively place in newly delineated groups.
Trees were visualized using Interactive Tree Of Life (Letunic and Bork, 2019 (link)) in conjunction with the script deposited at
In addition to GBDP formula d5, which explores sequence (dis-)similarity and is the recommended one for phylogenetic inference (Auch et al., 2006 (link); Meier-Kolthoff et al., 2014a (link)) we here used formula d3, which compares the gene content of the investigated genomes after correcting for reduction in genome size (Henz et al., 2005 (link)). While this analysis was also done using the GBDP software, for consistency with previous work we will refer to the d5 phylogeny as GBDP tree and to the d3 tree as gene-content analysis. There are various reasons why a gene-content phylogeny may fail to recover the true tree, as detailed below, hence the gene-content analysis is not intended to lend phylogenetic support. However, it may nevertheless be of taxonomic interest whether or not a certain branch is supported by gene-content data, particularly since the gene content conveys metabolic capabilities (Zhu et al., 2015 (link)) and yield independent evidence for conclusions from standard genome-scale phylogenies (Breider et al., 2014 (link)).
Full-length 16S rRNA gene sequences were extracted from the genomes using RNAmmer version 1.2 (Lagesen et al., 2007 (link)) and compared with the 16S rRNA gene reference database using BLAST and phylogenetic trees to verify the taxonomic affiliation of genomes. Non-matching genome sequences were discarded from further analyses. A comprehensive sequence alignment was generated with MAFFT version 7.271 with the “localpair” option (Katoh et al., 2005 ) using either the sequences extracted from the genome sequences or the previously published 16S rRNA gene sequences, depending on the length and number of ambiguous bases. Trees were inferred from the alignment with RAxML (Stamatakis, 2014 (link)) under the maximum-likelihood (ML) criterion and with TNT (Goloboff et al., 2008 (link)) under the maximum-parsimony (MP). In addition to unconstrained, comprehensive 16S rRNA gene trees (UCT), constrained comprehensive trees (CCT) were inferred with ML and MP using the bipartitions of the GBDP tree with ≥95% support as backbone constraint, as previously described (Hahnke et al., 2016 (link); Nouioui et al., 2018 (link)).
Taxa were analyzed to determine whether they were monophyletic, paraphyletic or polyphyletic (Farris, 1974 (link); Wood, 1994 (link)) Taxa non-monophyletic according to the GBDP tree were tested for evidence for their monophyly in the UCT and the 16S rRNA gene trees, if any, in the original publication. In the case of a significant conflict (i.e., high support values for contradicting bipartitions) between trees or low support in the GBDP tree, additional phylogenomic analyses of selected taxa were conducted. To this end, protein sequences of those taxa with the reciprocal best hits from GBDP/BLAST were clustered with MCL (Markov Chain Clustering) version 14-137 (Enright et al., 2002 (link)) under default settings and an e-value filter of 10–5 in analogy to OrthoMCL (Li et al., 2003 (link)). The resulting sets of orthologous proteins were aligned with MAFFT and concatenated to form a supermatrix after discarding the few clusters that still contained more than a single protein for at least one genome. Comprehensive supermatrices were compiled from all the orthologs that occurred in at least four genomes, whereas core-genome supermatrices were constructed for the orthologs that occurred in all of the genomes. Supermatrices were analyzed with TNT, and with RAxML under the “PROTCATLGF” model, in conjunction with 100 partition bootstrap replicates (Siddall, 2010 (link); Simon et al., 2017 (link))
Additionally, selected phenotypic features relevant for the taxonomic classification of Alphaproteobacteria were as comprehensively as possible collected from the taxonomic literature: motility by flagella, absence or presence of carotenoids, absence or presence of bacteriochlorophyll α, absence or presence of sphingolipids, average number of isoprene residues of the major ubiquinones, and relationship to oxygen. To avoid circular reasoning, missing features of a species were only inferred from features of its genus when species and genus were described in the same publication or when the species description had explicitly been declared as adding to the features of the genus. For the binary chemotaxonomic characters an alternative coding was also investigated that treated all missing values as indicating absence. Ubiquinone percentages would be more informative than just statements about being “major” but mostly only the latter are provided in the literature. Oxygen conditions were coded as ordered multi-state character: (1) strictly anaerobic; (2) facultatively aerobic, facultatively anaerobic, or microaerophilic; (3) strictly aerobic. Among all nine coding options tested, this yielded the highest fit to the tree (
Taxa that were unambiguously non-monophyletic according to the genome-scale analyses were screened for published evidence of their monophyly. The published evidence was judged as inconclusive when based on unsupported branches in phylogenetic trees, based on probably homoplastic characters or on probable plesiomorphic character states. Plesiomorphies might well be “diagnostic” but just for paraphyletic groups (Hennig, 1965 ; Wiley and Lieberman, 2011 ; Montero-Calasanz et al., 2017 (link)) hence “diagnostic” features alone are insufficient in phylogenetic systematics.
For fixing the obviously non-monophyletic taxa taxonomic consequences were proposed if new taxon delineations could be determined that were sufficiently supported by the CCT. In these cases, the uncertain phylogenetic placement of taxa whose genome sequences were not available at the time of writing would not affect the new proposals. Where necessary taxa were tentatively place in newly delineated groups.