We compiled all the prokaryotic genomes publicly available as of March 2019 that were sampled from the human gut. To retrieve isolate genomes, we surveyed the IMG24 (link), NCBI22 (link) and PATRIC23 (link) databases for genome sequences annotated as having been isolated from the human gastrointestinal tract. We complemented this set with bacterial genomes belonging to two recent culture collections: the HBC19 (link) and CGR21 (link). To avoid including duplicate entries due to redundancy between reference databases, we combined genomes obtained from the PATRIC and IMG repositories and added only those without an identical genome in the sets extracted from NCBI, HBC and CGR. This was determined by comparing isolate genomes between different databases using Mash v2.1 (ref. 26 ; ‘mash dist’ function) and only selecting one genome among those estimated to be identical (Mash distance of 0). MAGs (that is, uncultured genomes) were obtained from Pasolli et al.20 (link) (CIBIO), Almeida et al.18 (link) (EBI) and Nayfach et al.16 (link) (HGM). For the CIBIO set, only genomes retrieved from samples collected from the intestinal tract were used.
Metadata for each genome were first retrieved from the five large human gut studies16 (link),18 (link)–21 (link). These were further enriched with data obtained using the ENA API (https://www.ebi.ac.uk/ena/portal/api) and the NCBI E-utilities (http://eutils.ncbi.nlm.nih.gov/). Metadata on the isolate genomes from IMG and PATRIC were retrieved using the GOLD52 (link) system and the PATRIC FTP website (ftp://ftp.patricbrc.org/patric2/current_release/RELEASE_NOTES/genome_metadata), respectively. We only extracted metadata on the geographic origin of each genome, as other factors such as disease status and demographic information were missing from most of the samples.
Free full text: Click here