VSEARCH includes commands to perform de novo clustering using a greedy and heuristic centroid-based algorithm with an adjustable sequence similarity threshold specified with the id option (e.g., 0.97). The input sequences are either processed in the user supplied order (cluster_smallmem) or pre-sorted based on length (cluster_fast) or abundance (the new cluster_size option). Each input sequence is then used as a query in a search against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold. The search is performed using the heuristic approach described above which generally finds the most similar sequences first. If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If maxaccepts is higher than 1, several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering, DGC), or, if the sizeorder option is turned on, the centroid with the highest abundance (abundance-based greedy clustering, AGC) (He et al., 2015 (link); Westcott & Schloss, 2015 (link); Schloss, 2016 ). VSEARCH performs multi-threaded clustering by searching the database of centroid sequences with several query sequences in parallel. If there are any non-matching query sequences giving rise to new centroids, the required internal comparisons between the query sequences are subsequently performed to achieve correct results. For each cluster, VSEARCH can create a simple multiple sequence alignment using the center star method (Gusfield, 1993 (link)) with the centroid as the center sequence, and then compute a consensus sequence and a sequence profile.
Free full text: Click here