VirSorter: Comprehensive Viral Sequence Detection

Each metric is computed using sliding windows from 10 to 100 genes wide, starting at every gene along the sequence, and all scores greater than 2 are stored. Local maxima of significance score are then searched and the associated set of genes is defined as a putative viral region. These different predictions (based on the metrics above) are then merged when overlapping (extending the regions to include all predicted windows), leading to a list of putative viral regions associated with a (set of) metric(s). These regions are classified into three categories: (i) category 1 (“most confident” predictions) regions have significant enrichment in viral-like genes or non-Caudovirales genes on the whole region and at least one hallmark viral gene detected; (ii) category 2 (“likely” predictions) regions have either enrichment in viral-like or non-Caudovirales genes, or a viral hallmark gene detected, associated with at least one other metric (depletion in PFAM affiliation, enrichment in uncharacterized genes, enrichment in short genes, depletions in strand switch); and (iii) category 3 (“possible” predictions) regions have neither a viral hallmark gene nor enrichment in viral-like or non-Caudovirales genes, but display at least two of the other metrics with at least one significance score greater than 4. Finally, if a predicted region spans more than 80% of predicted genes on a contig, the entire contig is considered viral. A summary of VirSorter detection types is displayed in Fig. 1B.
Next, higher confidence predictions are used to refine the sequence space search. Specifically, sequences from all open reading frames from category 1 predictions that do not match a viral protein cluster are clustered and added to the reference database (RefSeqABVir or Viromes depending on the initial user choice). This updated database is then used in another round of search by VirSorter. This iteration where category 1 sequences are used to refine the searches is continued until no new genes are added to the database. Once no new genes are added, the final VirSorter output is provided to the user and includes nucleotide sequences of all predicted viral sequences in fasta files, an automatic annotation of each prediction in genbank file format, and a summary table displaying for each prediction the associated category and significance scores of all metrics. By providing the predictions and the underlying significance scoring, users can evaluate each prediction and apply custom thresholds on significance scores through a simple text-parsing script, even for large-scale datasets.
VirSorter is available as an application (App) in the iPlant discovery environment (https://de.iplantcollaborative.org/de/) under Apps/Experimental/iVirus (see Fig. S1 for a step-by-step guide of VirSorter app on iPlant). This application allows users to search any set of contigs for viral sequences using either the RefSeqABVir or the Viromes database. The reference values of VirSorter metrics will be evaluated on the complete set of input sequences, hence mixed datasets should be sorted (when possible) by type of bacteria or archaea in order to get the most accurate result possible. In addition to these reference databases, the VirSorter App on iPlant allows users to input their own reference viral genome sequence already assembled or to-be assembled using iPlant Apps prior to analysis with VirSorter. Assembled sequences are processed as follows: (i) genes are predicted with MetaGeneAnnotator (Noguchi, Taniguchi & Itoh, 2008 (link)), (ii) predicted proteins are clustered with sequences from the user-selected database (either RefSeqABVir or Viromes), and (iii) unclustered proteins are added to the “unclustered” pool. VirSorter scripts are also available through the github repository https://github.com/simroux/VirSorter.git.

Free full text: Click here

Roux S., Enault F., Hurwitz B.L, & Sullivan M.B. (2015). VirSorter: mining viral signal from microbial genomic data. PeerJ, 3, e985.

Publication 2015

A genes Apps Archaea Bacteria Caudovirales Genes Open reading frames Proteins Spans 80 Viral gene Viral genome Viral protein Viromes

Corresponding Organization :

Other organizations : University of Arizona, Centre National de la Recherche Scientifique, Laboratoire Microorganismes Génome et Environnement, Clermont Université, Institut Pascal

Top 5 similar protocols

Protocol cited in 62 other protocols

Variable analysis

independent variables

Sliding window size (10 to 100 genes)
Starting position of the sliding window (every gene along the sequence)

dependent variables

Significance score (greater than 2)
Putative viral regions (local maxima of significance score)
Category of predicted viral regions (category 1, 2, or 3)

control variables

Not explicitly mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!