1,468,357 protein coding sequences or CDS from 501 Hungate isolate genomes were searched using LAST77 (link) against ~1.9 billion CDS predicted from 8,200 metagenomic samples stored in the IMG database. Hungate genomes were designated as “recruiters” if the following criteria were met: a minimum of 200 CDS with hits at >=90% amino acid identity over 70% alignment lengths to an individual metagenomic CDS or >=10% capture of total CDS in each genome. The rationale for choosing the minimum 200 hit count was to ensure that the evidence included more than merely housekeeping genes (which tend to be more highly conserved). In a few instances, the 200 CDS hit count requirement was relaxed if at least 10% of the total CDS in the genomes were captured. The 90% amino acid identity cutoff was chosen based on Luo et al.78 (link), who assert that organisms grouped at the ‘species’ level typically show >85% AAI among themselves. We ascertained that >=90% identity was sufficiently discriminatory for species in the Hungate genome set by observing differences in the recruitment pattern (hit count or % CDS coverage) of different species of the same genus (e.g., Prevotella spp., Butyrivibrio spp., Bifidobacterium spp., Treponema spp.) from every phylum against the same metagenomic sample.
For nucleotide read recruitment, total reads from an individual metagenome were aligned against scaffolds from each of the 501 isolates using the BWA aligner79 (link). The effective minimum nucleotide % identity was ~75% with a minimum alignment length of 50-bp. Alignment results were examined in terms of total number of reads recruited to an isolate (at different % identity cutoffs with >=97% identity proposed as a species-level recruitment), average read depth of total reads recruited to a given isolate genome, as well as % coverage of total nucleotide length of the genome.