To address these challenges we aimed to build a manually curated and validated database for screening of environmental metagenomic and metatranscriptomic sequence datasets for functional genes. We focused on biochemical functions and metabolic pathways important in environmental microbial ecology, including global carbon and nitrogen cycles, by manually selecting and organizing functional gene information into a database here called ‘FOAM’ (Functional Ontology Assignments for Metagenomes).
First, KEGG orthologs (KOs) (12 (link)) were retrieved to fit within a hierarchical organization from general features to specific pathways (such as denitrification, methanogenesis, etc.). KEGG KO (a reference set of homologous genes, consistent in known functions) benefits from stability, good maintenance, curation, and third party annotation. The KEGG KO was chosen as the FOAM ‘unit’ because it is a qualitative and dynamically maintained knowledge base associated with a rich tool environment that is available within or outside of KEGG. Additionally, using KEGG KO permits the use of all visualization KEGG tools or third party software that have been released [e.g. Cytoscape (14 ), Glamm (15 (link)), Voronto (16 (link)), iPATH (17 (link)), bioconductor Pathview package (18 (link))]. KEGG KO lists the genes defined in KEGG that belong to each functional and homologous family and, as a consequence, these can be multi-domain and multi-functional. Here, to provide accurate functional annotation, each FOAM module was constructed to ideally target one function.
The reduced size of the resultant FOAM database, compared to non-specific sequence databases, was a first step towards significant improvement in the speed and specificity of similarity searches. In addition, to improve upon the sensitivity of conventional heuristic alignment programs, we turned each KO set into Hidden Markov Models (HMMs; 19 (link)) by fetching their corresponding protein family (Pfam) profiles (20 (link)) as described in Figure1 . This step generated a sizeable number of conflicts (several Pfam per KO and vice versa) that were automatically resolved by functional assignments to KO. For the few remaining unresolved assignations, the corresponding set of sequences was manually split according to the topology of their phylogenetic trees. At this point the HMMs were re-trained from the new pool of sequences.
By retrieving the sequences of the corresponding Pfam of each selected KO, in addition to the sequences already present in the FOAM database, we ensured precise detection of functions from potentially distant homologs. With this method, ∼74 000 peptide sequence profiles were specifically tailored and trained to predict functions as defined in KEGG KO. This profile-based searching approach enabled identification of less conserved regions along sequence alignments. Thus this method is applicable for searching for more distant homologs, similar to the approach used by Pfam (20 (link)) and TIGRFAM (21 (link)). However, we found that most Pfam and TIGRFAM models provide multiple KEGG KO assignments and did not serve our needs for retrieval of functionally specific annotations from metagenomes. Also, Pfam and TIGRFAM do not focus on environmental processes and cover only few functions of interest for different environmental sources. Additionally, Pfam and TIGRFAM are based on a simplified alignment, called ‘SEED’, which is composed of a collection of sequences representative of a protein family, whereas our aim was a more comprehensive recruitment of more distant homologs. Recently, FunGene (22 (link)) was published as a new toolkit specialized to process amplicon data for functional genes, focusing on marker genes (∼100 currently available). FunGene provides users with HMMs for their marker genes of interest as a tool to test primers and probes. Moreover, FunGene allows users to build and submit new HMMs. FOAM is complementary to FunGene: it includes ∼3000 custom protein models obtained by enriching Pfams relevant to environmental microbiology with more protein sequences. An additional attribute of FOAM is that KO assignments were screened during the manual calibration to ensure that the Pfam alignments all targeted the same KO. If parts of the alignments targeted other KOs they were omitted from building the models or manually reassigned. Importantly, FOAM is a database that can be complemented with input from the user community. The FOAM database is by no means complete and we encourage recommendations from future users for additional categories to input into FOAM.
First, KEGG orthologs (KOs) (12 (link)) were retrieved to fit within a hierarchical organization from general features to specific pathways (such as denitrification, methanogenesis, etc.). KEGG KO (a reference set of homologous genes, consistent in known functions) benefits from stability, good maintenance, curation, and third party annotation. The KEGG KO was chosen as the FOAM ‘unit’ because it is a qualitative and dynamically maintained knowledge base associated with a rich tool environment that is available within or outside of KEGG. Additionally, using KEGG KO permits the use of all visualization KEGG tools or third party software that have been released [e.g. Cytoscape (14 ), Glamm (15 (link)), Voronto (16 (link)), iPATH (17 (link)), bioconductor Pathview package (18 (link))]. KEGG KO lists the genes defined in KEGG that belong to each functional and homologous family and, as a consequence, these can be multi-domain and multi-functional. Here, to provide accurate functional annotation, each FOAM module was constructed to ideally target one function.
The reduced size of the resultant FOAM database, compared to non-specific sequence databases, was a first step towards significant improvement in the speed and specificity of similarity searches. In addition, to improve upon the sensitivity of conventional heuristic alignment programs, we turned each KO set into Hidden Markov Models (HMMs; 19 (link)) by fetching their corresponding protein family (Pfam) profiles (20 (link)) as described in Figure
By retrieving the sequences of the corresponding Pfam of each selected KO, in addition to the sequences already present in the FOAM database, we ensured precise detection of functions from potentially distant homologs. With this method, ∼74 000 peptide sequence profiles were specifically tailored and trained to predict functions as defined in KEGG KO. This profile-based searching approach enabled identification of less conserved regions along sequence alignments. Thus this method is applicable for searching for more distant homologs, similar to the approach used by Pfam (20 (link)) and TIGRFAM (21 (link)). However, we found that most Pfam and TIGRFAM models provide multiple KEGG KO assignments and did not serve our needs for retrieval of functionally specific annotations from metagenomes. Also, Pfam and TIGRFAM do not focus on environmental processes and cover only few functions of interest for different environmental sources. Additionally, Pfam and TIGRFAM are based on a simplified alignment, called ‘SEED’, which is composed of a collection of sequences representative of a protein family, whereas our aim was a more comprehensive recruitment of more distant homologs. Recently, FunGene (22 (link)) was published as a new toolkit specialized to process amplicon data for functional genes, focusing on marker genes (∼100 currently available). FunGene provides users with HMMs for their marker genes of interest as a tool to test primers and probes. Moreover, FunGene allows users to build and submit new HMMs. FOAM is complementary to FunGene: it includes ∼3000 custom protein models obtained by enriching Pfams relevant to environmental microbiology with more protein sequences. An additional attribute of FOAM is that KO assignments were screened during the manual calibration to ensure that the Pfam alignments all targeted the same KO. If parts of the alignments targeted other KOs they were omitted from building the models or manually reassigned. Importantly, FOAM is a database that can be complemented with input from the user community. The FOAM database is by no means complete and we encourage recommendations from future users for additional categories to input into FOAM.