We collected 1562 virus RefSeq genomes infecting prokaryotes and 31,986 prokaryotic host RefSeq genomes from NCBI in May 2015. The NCBI accession numbers of the RefSeq sequences are provided in the Additional file 2 : Table S2. To mimic fragmented metagenomic sequences, for a given length L = 500, 1000, 3000, 5000, and 10000 bp, viruses were split into non-overlapping fragments of length L and the same number of non-overlapping fragments of length L were randomly subsampled from the prokaryotic genomes. Fragments were generated for virus genomes discovered before 1 January 2014 and after 1 January 2014 and were separately used as training and testing sets, respectively (Table 1 ). To generate evaluation datasets containing 10, 50, and 90% viral contigs, the number of viral contigs was set as in Table 1 and was combined with 9 times more, equal numbers, or 9-fold less randomly sampled host contigs, respectively.
Highly represented host phyla (Actinobacteria, Cyanobacteria, Firmicutes, Proteobacteria) and genera (Mycobacterium, Escherichia, Pseudomonas, Staphylococcus, Bacillus, Vibrio, and Streptococcus) were selected for the analyses where viruses infecting these taxa were excluded from the training of VirFinder. For evaluation of the different trained VirFinder models, equal numbers of contigs of the excluded viruses and all other viruses were selected and then combined with randomly selected host contigs such that total virus and host contigs were equal in number.
For the analysis of VirFinder trained with 14,722 prokaryotic genomes with or without proviruses removed, these genomes were downloaded from the database cited in [6 (link)]. Likewise, the positions of proviruses predicted by VirSorter in these 14,722 genomes were obtained from the published data of [6 (link)] and were used to remove theses sequence from their corresponding host genomes.
Highly represented host phyla (Actinobacteria, Cyanobacteria, Firmicutes, Proteobacteria) and genera (Mycobacterium, Escherichia, Pseudomonas, Staphylococcus, Bacillus, Vibrio, and Streptococcus) were selected for the analyses where viruses infecting these taxa were excluded from the training of VirFinder. For evaluation of the different trained VirFinder models, equal numbers of contigs of the excluded viruses and all other viruses were selected and then combined with randomly selected host contigs such that total virus and host contigs were equal in number.
For the analysis of VirFinder trained with 14,722 prokaryotic genomes with or without proviruses removed, these genomes were downloaded from the database cited in [6 (link)]. Likewise, the positions of proviruses predicted by VirSorter in these 14,722 genomes were obtained from the published data of [6 (link)] and were used to remove theses sequence from their corresponding host genomes.
Full text: Click here