The 16S rRNA-based phylometagenomic dataset of the normal (healthy) human microbiome was made available through the Human Microbiome Project [13 (link)], and consists of 454 FLX Titanium sequences spanning the V3 to V5 variable regions obtained for 301 samples from 24 healthy subjects (12 male, 12 female) enrolled at a single clinical site in Houston, TX. These samples cover 18 different body sites, including 6 main body site categories: the oral cavity (9 samples), the gut (1 sample), the vagina (3 samples), the retroauricular crease (2 samples), the nasal cavity (1 sample) and the skin (2 samples). Detailed protocols used for enrollment, sampling, DNA extraction, 16S amplification and sequencing are available on the Human Microbiome Project Data Analysis and Coordination Center website [103 ], and are also described elsewhere [55 ,56 (link)]. In brief, genomic DNA was isolated using the Mo Bio PowerSoil kit [104 ] and subjected to 16S amplifications using primers designed incorporating the FLX Titanium adapters and a sample barcode sequence, allowing directional sequencing covering variable regions V5 to partial V3 (primers: 357F 5'-CCTACGGGAGGCAGCAG-3' and 926R 5'-CCGTCAATTCMTTTRAGT-3'). Resulting sequences were processed using a data curation pipeline implemented in mothur [41 (link)], which reduces the sequencing error rate to less than 0.06% as validated on a mock community. As part of the pipeline parameters, to pass the initial quality control step, one unambiguous mismatch to the sample barcode and two mismatches to the PCR amplification primers were allowed. Sequences with an ambiguous base call or a homopolymer longer than eight nucleotides were removed from subsequent analyses, as suggested previously [105 (link)]. Based on the supplied quality scores, all sequences were trimmed when a base call with a score below 20 was encountered. All sequences were aligned using a NAST-based sequence aligner to a custom reference based on the SILVA alignment [106 (link),107 (link)]. Sequences that were shorter than 200 bp or that did not align to the anticipated region of the reference alignment were removed from further analysis. Chimeric sequences were identified using the mothur implementation of the ChimeraSlayer algorithm [108 (link)]. Unique reads were classified with the MSU RDP classifier v2.2 [58 (link)] using the taxonomy proposed by [109 ], maintained at the RDP (RDP 10 database, version 6). The 16S rRNA reads are available in the Sequence Read Archive at [110 ].
Full text: Click here