VSEARCH: Efficient Sequence Clustering

VSEARCH includes commands to perform de novo clustering using a greedy and heuristic centroid-based algorithm with an adjustable sequence similarity threshold specified with the id option (e.g., 0.97). The input sequences are either processed in the user supplied order (cluster_smallmem) or pre-sorted based on length (cluster_fast) or abundance (the new cluster_size option). Each input sequence is then used as a query in a search against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold. The search is performed using the heuristic approach described above which generally finds the most similar sequences first. If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If maxaccepts is higher than 1, several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering, DGC), or, if the sizeorder option is turned on, the centroid with the highest abundance (abundance-based greedy clustering, AGC) (He et al., 2015 (link); Westcott & Schloss, 2015 (link); Schloss, 2016 ). VSEARCH performs multi-threaded clustering by searching the database of centroid sequences with several query sequences in parallel. If there are any non-matching query sequences giving rise to new centroids, the required internal comparisons between the query sequences are subsequently performed to achieve correct results. For each cluster, VSEARCH can create a simple multiple sequence alignment using the center star method (Gusfield, 1993 (link)) with the centroid as the center sequence, and then compute a consensus sequence and a sequence profile.

Free full text: Click here

Rognes T., Flouri T., Nichols B., Quince C, & Mahé F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584.

Publication 2016

Consensus sequence Sequence alignment

Corresponding Organization :

Other organizations : University of Oslo, Oslo University Hospital, Heidelberg Institute for Theoretical Studies, Karlsruhe Institute of Technology, University of Glasgow, University of Warwick, Laboratoire des Symbioses Tropicales et Méditerranéennes, University of Kaiserslautern, Centre de Coopération Internationale en Recherche Agronomique pour le Développement

Top 5 similar protocols

Protocol cited in 1 259 other protocols

Variable analysis

independent variables

Sequence similarity threshold specified with the id option
The choice between processing input sequences in the user supplied order (cluster_smallmem), pre-sorting based on length (cluster_fast), or pre-sorting based on abundance (cluster_size)
The value of maxaccepts, which determines whether several centroids with sufficient sequence similarity may be considered
The choice of using distance-based greedy clustering (DGC) or abundance-based greedy clustering (AGC)

dependent variables

The clustering of input sequences into clusters based on sequence similarity

control variables

The heuristic centroid-based algorithm used for de novo clustering
The use of multi-threaded clustering to search the database of centroid sequences with several query sequences in parallel

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!