A 16S rRNA gpkg was created from the 2013/08 public release of the Greengenes database (33 (link)). GraftM create was run using these sequences and the taxonomy-decorated phylogenetic tree for the 97% nucleotide identity representative OTU set (ftp://greengenes.microbio.me/greengenes_release/gg_13_8_otus). Ribosomal protein gpkgs. Gpkgs were created for ribosomal proteins by starting with the set of HMMs included with PhyloSift (20 (link)). These HMMs were used to search with HMMER, using an E-value cutoff of 1e–40, against the set of finished and permanent draft proteomes from the IMG (34 (link)) that were >90% complete and <5% contaminated according to CheckM v1.0.5 (35 (link)). To prevent contaminated genomes introducing error into the taxonomic annotations, only those genomes where a single hit was found were utilized. To limit the effect of taxonomic bias toward lineages with a greater number of sequenced genomes, only a single protein from each species (one representative per species, using a type strain where possible and including all those without species level taxonomic classification) were used. Proteomes were searched using GraftM graft using default parameters, after which 15 ribosomal markers were determined to be single copy on the basis of their being detected as having a single hit in >5900 of the 6215 genomes. GraftM packages for the 15 protein-coding genes were generated with GraftM create using those sequences found in single copy, a previously generated HMM and the corresponding IMG taxonomy for each genome. Functional and taxonomic McrA gpkgs. Two gpkgs were constructed for the alpha subunit of the methyl coenzyme M reductase (mcra) gene. Amino acid sequences for the McrA protein family and paralogous MrtA sequences were sourced from IMG (February 2014) using the BLASTP tool provided online. Spurious hit sequences were removed by manual inspection. Genes for the Bathyarchaeotal (36 (link)) and Vertrataearchaeotal (37 (link)) orthologues were sourced from NCBI. The first taxonomy-annotated gpkg was created using the default GraftM create pipeline using the sequences and their associated genome taxonomy. The second was created by re-decorating the McrA tree with functional, rather than taxonomic information. This second tree was annotated according to their substrate utilization: acetoclastic (from acetate) comprised of the order Methanosarcinales; hydrogenotrophic (from hydrogen, carbon dioxide and/or formate), comprised of the Methanomicrobiales, Methanocellales, Methanococcales and Methanobacteriales; methylotrophic (from methylated compounds) comprised of the Methanomassiliicoccales, Methanofastidiales and Vertrataearchaeota. Lineages within the Bathyarchaeota were recently found to encode mcra, though their metabolism is not yet confirmed. These sequences were included in the gpkg, but left unannotated. The McrA tree was curated with these functional groupings using ARB (38 (link)), with the exception of the Methanosarcina which are thought to be capable of producing methane from all three substrate groups (39 (link)). The Methanosarcinaceae were annotated as a clade separate to the exclusively acetoclastic Methanosaetaceae.