In addition to single-nucleotide variants and small indels, strain variation is often manifested as highly diverged regions, insertions of mobile elements, rearrangements, large deletions, parallel gene transfer, etc. The green edges in the assembly graph shown in Figure 3 result from an additional copy of a mobile element in a rare strain2 (compared with the abundant strain1), while the blue edge corresponds to a horizontally transferred gene (or a highly diverged genomic region) in a rare strain3 (compared to the abundant strain1). Such edges fragment contigs corresponding to the abundant strain1; for example, the green edges in Figure 3 (bottom right) break the edge c into three shorter edges. We note that the edges in the assembly graph are condensed; that is, they represent nonbranching paths formed by k-mers.
We refer to edges originating from rare strain variants within the assembly graph of a strain mixture as filigree edges. Traditional genome assemblers use a global threshold on read coverage to remove the low-coverage edges (that typically result from sequencing errors) from the assembly graph during the graph simplification step. However, this approach does not work well for metagenomic assemblies, since there is no global threshold that (1) removes edges corresponding to rare strains and (2) preserves edges corresponding to rare species. Similarly to IDBA-UD and MEGAHIT, metaSPAdes analyzes the coverage ratios between adjacent edges in the assembly graph, classifying edges with low-coverage ratios as potential filigree edges.
We denote the coverage of an edge e in the assembly graph as cov(e) and define the coverage cov(v) of a vertex v as the maximum of cov(e) over all edges e incident to v. Given an edge e incident to a vertex v and a threshold ratio (the default value is 10), a vertex v predominates an edge e if its coverage is significantly higher than the coverage of the edge e; that is, if ratio · cov(e) < cov(v). An edge (v,w) is weak if it is predominated by either v or w. Note that filigree edges are often classified as weak since their coverage is much lower than the coverage of adjacent edges resulting from abundant strains.
metaSPAdes disconnects all weak edges from their predominating vertices in the assembly graph. Disconnection of a weak edge (v,w) in the assembly graph from its starting vertex v (ending vertex w) is simply a removal of its first (last) k-mer rather than removal of the entire condensed edge. We emphasize that, in contrast to IDBA-UD and MEGAHIT, we disconnect rather than remove weak edges in the assembly graph since our goal is to preserve the information about rare strains whenever possible, that is, when it does not lead to a deterioration of the consensus backbone.
We refer to edges originating from rare strain variants within the assembly graph of a strain mixture as filigree edges. Traditional genome assemblers use a global threshold on read coverage to remove the low-coverage edges (that typically result from sequencing errors) from the assembly graph during the graph simplification step. However, this approach does not work well for metagenomic assemblies, since there is no global threshold that (1) removes edges corresponding to rare strains and (2) preserves edges corresponding to rare species. Similarly to IDBA-UD and MEGAHIT, metaSPAdes analyzes the coverage ratios between adjacent edges in the assembly graph, classifying edges with low-coverage ratios as potential filigree edges.
We denote the coverage of an edge e in the assembly graph as cov(e) and define the coverage cov(v) of a vertex v as the maximum of cov(e) over all edges e incident to v. Given an edge e incident to a vertex v and a threshold ratio (the default value is 10), a vertex v predominates an edge e if its coverage is significantly higher than the coverage of the edge e; that is, if ratio · cov(e) < cov(v). An edge (v,w) is weak if it is predominated by either v or w. Note that filigree edges are often classified as weak since their coverage is much lower than the coverage of adjacent edges resulting from abundant strains.
metaSPAdes disconnects all weak edges from their predominating vertices in the assembly graph. Disconnection of a weak edge (v,w) in the assembly graph from its starting vertex v (ending vertex w) is simply a removal of its first (last) k-mer rather than removal of the entire condensed edge. We emphasize that, in contrast to IDBA-UD and MEGAHIT, we disconnect rather than remove weak edges in the assembly graph since our goal is to preserve the information about rare strains whenever possible, that is, when it does not lead to a deterioration of the consensus backbone.