Genomic differences between related strains often result in “bulges” and “tips” in the de Bruijn graphs that are not unlike artifacts caused by sequencing errors in genome assembly (Pevzner et al. 2004 (link); Zerbino and Birney 2008 (link)). For example, a sequencing error often results in a bulge formed by two short alternative paths between the same vertices in the de Bruijn graph, a “correct” path with high coverage and an “erroneous” path with low coverage. Similarly, a substitution or a small indel in a rare strain (compared with an abundant strain) often results in a bulge formed by a high-coverage path corresponding to the abundant strain and an alternative low-coverage path corresponding to the rare strain.
Aiming at the consensus assembly of a strain mixture, metaSPAdes masks the majority of strain differences using a modification of the SPAdes procedures for masking sequencing errors (the algorithms for removal of tips, “simple” bulges [Bankevich et al. 2012 (link)], and “complex” bulges [Nurk et al. 2013 (link)]). metaSPAdes uses more aggressive settings than the ones used in assemblies of isolates; for example, it collapses larger bulges and removes longer tips than SPAdes. We note that the bulge projection approach in SPAdes improves on the originally proposed bulge removal approach (Pevzner et al. 2004 (link); Zerbino and Birney 2008 (link)) used in most existing assemblers since it stores valuable information about the processed bulges (see “Bulge Projection Approach” in the Supplemental Material). This feature is important for the repeat resolution approach in metaSPAdes described below.