Each metric is computed using sliding windows from 10 to 100 genes wide, starting at every gene along the sequence, and all scores greater than 2 are stored. Local maxima of significance score are then searched and the associated set of genes is defined as a putative viral region. These different predictions (based on the metrics above) are then merged when overlapping (extending the regions to include all predicted windows), leading to a list of putative viral regions associated with a (set of) metric(s). These regions are classified into three categories: (i)
category 1 (“most confident” predictions) regions have significant enrichment in viral-like genes or non-
Caudovirales genes on the whole region and at least one hallmark viral gene detected; (ii)
category 2 (“likely” predictions) regions have either enrichment in viral-like or non-
Caudovirales genes, or a viral hallmark gene detected, associated with at least one other metric (depletion in PFAM affiliation, enrichment in uncharacterized genes, enrichment in short genes, depletions in strand switch); and (iii)
category 3 (“possible” predictions) regions have neither a viral hallmark gene nor enrichment in viral-like or non-
Caudovirales genes, but display at least two of the other metrics with at least one significance score greater than 4. Finally, if a predicted region spans more than 80% of predicted genes on a contig, the entire contig is considered viral. A summary of VirSorter detection types is displayed in
Fig. 1B.
Next, higher confidence predictions are used to refine the sequence space search. Specifically, sequences from all open reading frames from
category 1 predictions that do not match a viral protein cluster are clustered and added to the reference database (RefSeqABVir or Viromes depending on the initial user choice). This updated database is then used in another round of search by VirSorter. This iteration where
category 1 sequences are used to refine the searches is continued until no new genes are added to the database. Once no new genes are added, the final VirSorter output is provided to the user and includes nucleotide sequences of all predicted viral sequences in fasta files, an automatic annotation of each prediction in genbank file format, and a summary table displaying for each prediction the associated category and significance scores of all metrics. By providing the predictions and the underlying significance scoring, users can evaluate each prediction and apply custom thresholds on significance scores through a simple text-parsing script, even for large-scale datasets.
VirSorter is available as an application (App) in the iPlant discovery environment (
https://de.iplantcollaborative.org/de/) under Apps/Experimental/iVirus (see
Fig. S1 for a step-by-step guide of VirSorter app on iPlant). This application allows users to search any set of contigs for viral sequences using either the RefSeqABVir or the Viromes database. The reference values of VirSorter metrics will be evaluated on the complete set of input sequences, hence mixed datasets should be sorted (when possible) by type of bacteria or archaea in order to get the most accurate result possible. In addition to these reference databases, the VirSorter App on iPlant allows users to input their own reference viral genome sequence already assembled or to-be assembled using iPlant Apps prior to analysis with VirSorter. Assembled sequences are processed as follows: (i) genes are predicted with MetaGeneAnnotator (Noguchi, Taniguchi & Itoh, 2008 (
link)), (ii) predicted proteins are clustered with sequences from the user-selected database (either RefSeqABVir or Viromes), and (iii) unclustered proteins are added to the “unclustered” pool. VirSorter scripts are also available through the github repository
https://github.com/simroux/VirSorter.git.