To achieve high prediction accuracy, a deep learning algorithm needs a large amount of training data. Though a large number of training sequences were obtained from RefSeq, there is a potential to enlarge the training dataset by including viral sequences from metavirome sequencing data. Metavirome sequencing targets at sequencing mainly viruses by removing prokaryotic cells in samples using the physical 0.22 μm filters. Metavirome sequencing does not rely on culturing viruses in the lab, so it is able to capture both cultivated and uncultivated viruses, representing the true viral diversity. A few studies have used this technique to extract viruses and sequenced viral genomes in human gut and ocean samples [1 (link),2 (link),62 (link),63 (link)]. Normal et al. sequenced virome in the human gut sample from IBD patients using Illumina sequencing technology[1 (link)]. Reyes et al. studied viruses in fecal samples from Malawian twins with Severe Acute Malnutrition (SAM) using Roche 454 sequencing technology [2 (link)]. Minot et al. and Kim et al. investigated virome in healthy human gut using Roche 454 [11 (link),62 (link)]. For marine virome, the Tara Ocean Virome project collected the largest number of virome samples from both surface- and deep-ocean sites over the world [63 (link)].
We collected the metavirome samples from those studies and aimed to add more viral diversity, especially adding viruses not- or under-represented in RefSeq, to the training data. We were careful in quality control of the samples because it is likely that the sample can be contaminated by prokaryotic DNA, since the physical filters may not exclude small sized prokaryotic cells. The details of preparation of metavirome data and quality control can be found inSupplementary Materials and Supplementary Table S3 . Up to 1.3 million of sequences were generated from the metavirome data, and they were combined with sequences derived from viral RefSeq before May 2015 for training. The same number of prokaryotic sequences were paired with the viral sequences in the enlarged dataset for training. The new model was evaluated and compared with the original model trained based on RefSeq only, using the test sequences from RefSeq after May 2015.
We collected the metavirome samples from those studies and aimed to add more viral diversity, especially adding viruses not- or under-represented in RefSeq, to the training data. We were careful in quality control of the samples because it is likely that the sample can be contaminated by prokaryotic DNA, since the physical filters may not exclude small sized prokaryotic cells. The details of preparation of metavirome data and quality control can be found in