A total of nine complete genomes (with various GC contents) and their annotations were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/) (Table 1). (This set of genomes does not overlap with the genomes we used for training.) To systematically test FragGeneScan, reads of various lengths (100, 200, 400 and 700 bp) and with various sequencing error rates (0–3%) were simulated from these genomes using MetaSim (10 (link)). For each genome, up to 1-fold coverage of reads was sampled for each read length and sequencing error rate. Based on the current estimation of sequencing error rates (10 (link)), Sanger sequencing reads of 700 bp were simulated with the error rates ranging from 0% to 1%, and 454 sequencing reads were simulated with the error rates ranging from 0% to 3%.

Genomes of microbial species that were used to evaluate the performance of FragGeneScan

SpeciesGene Bank Acc.CG (%)Genome size (Mb)No. of genes
Buchnera aphidicola str. APSNC_002528260.6564
Burkholderia pseudomallei K96243 chr1NC_006350674.13399
Bacillus subtilis subsp. subtilis str. 168NC_000964434.24105
Corynebacterium jeikeium K411NC_007164612.52104
Chlorobium tepidum TLSNC_002932562.22252
Escherichia coli str. K-12 substr. MG1655NC_000913504.64132
Helicobacter pylori J99NC_000921391.61489
Prochlorococcus marinus str. MIT 9312NC_007577311.71810
Wolbachia endosymbiont str. TRSNC_006833341.1805
Three real metagenomes were used for gene prediction in metagenomic sequences (Supplementary Table S3). Two real metagenomes (TS28 and TS50) from the twin obese and lean study (14 (link)) were downloaded from the MG-RAST website (http://metagenomics.nmpdr.org). The other real metagenome (SRX007415) from the rumen microbiota response study was downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov). These three metagenomes were BLASTXed against 98% non-redundant protein sequences from prokaryotic genomes, plasmids and phages collected from IMG 3.0 (http://img.jgi.doe.gov) using an E-value cutoff of 1.0e-3 for TS28 and TS50, and 1.0e-1 for SRX007415 (which has shorter reads), respectively. FragGeneScan gene prediction in these metagenomes was compared to the similarity search results.