Annotations from Enzyme Commission (EC) numbers (
http://www.sbcs.qmul.ac.uk/iubmb/enzyme/), Pfam, TIGRFAM, Clusters of Orthologous Groups (COG), and IMG Terms [27 (
link), 30 –33 ] for cobamide biosynthesis, cobamide-dependent enzymes, and cobamide-independent alternative annotations were chosen. These included annotations used by Degnan et al. [10 (
link)], but in other cases alternative annotations were chosen to improve specificity of the identified genes (Supplementary Table
4). For example, EC: 4.2.1.30 for glycerol dehydratase identifies both cobamide-dependent and -independent isozymes, and hence Pfam annotations specific to the cobamide-dependent version were used instead. These genes were identified in each genome using the “function profile: genomes vs functions” tool (Jan–May 2017) (Supplementary Table
1,
2 sheet 2).
For genes without functional annotations in the IMG/M ER database, we chose sequences that were genetically or biochemically characterized [34 (
link)–37 (
link)] to use as the query genes in one-way BLASTP [38 (
link)] against the filtered genomes using the IMG/M ER “gene profile: genomes vs genes” tool, accessed Jan–May 2017 (Supplementary Table
4).
Output files for the cobamide genes were combined into a master file in Microsoft Excel (Supplementary Table
1,
2 sheet
2). This master file was used as input for custom Python 2.7 code that interpreted the presence or absence of genes as predicted phenotypes. We used Microsoft Excel and Python for further analysis. Genomes were scored for the presence or absence of cobamide-dependent enzymes and alternatives (Supplementary Table
5) based on the annotations in Supplementary Table
4. We then created criteria for seven cobamide biosynthesis phenotypes based on the presence of certain sets of cobamide biosynthesis genes (Supplementary Table
7): very likely cobamide producer, likely cobamide producer, possible cobamide producer, tetrapyrrole precursor salvager, cobinamide (Cbi) salvager, likely non-producer, and very likely non-producer, and classified genomes accordingly (Supplementary Table
5). These are grouped into complete biosynthesis (very likely, likely, and possible cobamide producer), partial biosynthesis (tetrapyrrole precursor salvager and Cbi salvager), and no biosynthesis (likely non-producer and very likely non-producer).
During cobamide biosynthesis, the lower ligand base is activated by CobT to allow attachment to the nucleotide loop. For phenolic lower ligands, this reaction is carried out by ArsA and ArsB, subfamilies of
cobT homologs found in tandem [22 (
link), 39 (
link)]. To distinguish putative
arsAB homologs from other
cobT homologs that are not known to produce phenolyl cobamides, IMG/M ER entries for all genes that were annotated as
cobT homologs were downloaded. Tandem
cobT homologs were defined as those with sequential IMG gene IDs. This list of tandem
cobT genes was then filtered by size to eliminate genes encoding less than 300 or more than 800 amino acid (AA) residues, indicating annotation errors (CobT is approximately 350 AA residues) (Supplementary Table
9). The remaining tandem
cobT homologs were assigned as putative
arsAB homologs.
To identify the anaerobic benzimidazole biosynthesis genes
bzaABCDEF, four new hidden Markov model profiles (HMMs) were created and two preexisting ones (TIGR04386 and TIGR04385) were refined. Generally, the process for generating the new HMMs involved performing a Position-Specific Iterated (PSI) BLAST search using previously classified instances of the Bza proteins aligned in Jalview [38 (
link), 40 (
link)]. Due to their similarity, BzaA, BzaB, and BzaF were examined together, as were BzaD and BzaE. To help classify these sequences, Training Set Builder (TSB) was used [41 (
link)]. All six HMMs have not been assigned TIGRFAM accessions at the time of publication, but will be included in the next TIGRFAM release, and are included as Supplementary HMM Files. Details for each protein are listed in the Supplementary Materials and Methods. Protein sequences for 10,591 of the filtered genomes were queried for each
bza HMM using hmm3search (HMMER3.1)[96 ]. Hits are only reported above the trusted cutoff defined for each HMM (Supplementary Table
8). A hit for
bzaA and
bzaB or
bzaF indicated that the genome had the potential to produce benzimidazole lower ligands. The specific lower ligand was predicted based on the
bza genes present [19 (
link)].
We used BLASTP on IMG/M ER to search for tetrapyrrole precursor biosynthesis genes that appeared to be absent in the 201 species identified as tetrapyrrole precursor salvagers. Query sequences used were the following:
Rhodobacter sphaeroides HemA (GenPept C49845);
Clostridium saccharobutylicum DSM 13864 HemA, HemL, HemB, HemC, and HemD (GenBank: AGX44136.1, AGX44131.1, AGX44132.1, AGX44134.1, AGX4133.4, respectively). We additionally searched for the
Bacillus subtilis HemD, which only has the UroIII synthase activity (UniProtKB P21248.2). We visually inspected the open reading frames near any BLASTP hits in the IMG/M ER genome browser. After this analysis, 180 species remained (Supplementary Table
10). Genomes were classified as a particular type of tetrapyrrole precursor salvager only if they were missing all genes upstream of a precursor.
Shelton A.N., Seth E.C., Mok K.C., Han A.W., Jackson S.N., Haft D.R, & Taga M.E. (2018). Uneven distribution of cobamide biosynthesis and dependence in bacteria predicted by comparative genomics. The ISME Journal, 13(3), 789-804.