Like BLAST, both BLAT and SSAHA2 report all significant alignments or typically tens of top-scoring alignments, but this is not the most desired output in read mapping. We are typically more interested in the best alignment or best few alignments, covering each region of the query sequence. For example, suppose a 1000 bp query sequence consists of a 900 bp segment from one chromosome and a 100 bp segment from another chromosome; 400 bp out of the 900 bp segment is a highly repetitive sequence. For BLAST, to know this is a chimeric read we would need to ask it to report all the alignments of the 400 bp repeat, which is costly and wasteful because in general we are not interested in alignments of short repetitive sequences contained in a longer unique sequence. On this example, a useful output would be to report one alignment each for the 900 bp and the 100 bp segment, and to indicate if the two segments have good suboptimal alignments that may render the best alignment unreliable. Such output simplifies downstream analyses and saves time on reconstructing the detailed alignments of the repetitive sequence.
In BWA-SW, we say two alignments are distinct if the length of the overlapping region on the query is less than half of the length of the shorter query segment. We aim to find a set of distinct alignments which maximizes the sum of scores of each alignment in the set. This problem can be solved by dynamic programming, but as in our case a read is usually aligned entirely, a greedy approximation would work well. In the practical implementation, we sort the local alignments based on their alignment scores, scan the sorted list from the best one and keep an alignment if it is distinct from all the kept alignments with larger scores; if alignment a2 is rejected because it is not distinctive from a1, we regard a2 to be a suboptimal alignment to a1 and use this information to approximate the mapping quality (Section 2.7 ).
Because we only retain alignments largely non-overlapping on the query sequence, we might as well discard seeds that do not contribute to the final alignments. Detecting such seeds can be done with another heuristic before the Smith–Waterman extension and time spent on unnecessary extension can thus be saved. To identify these seeds, we chain seeds that are contained in a band (default band width 50 bp). If on the query sequence a short chain is fully contained in a long chain and the number of seeds in the short chain is below one-tenth of the number of seeds in the long chain, we discard all the seeds in the short chain, based on the observation that the short chain can rarely lead to a better alignment than the long chain in this case. Unlike the Z-best strategy, this heuristic does not have a noticeable effect on alignment accuracy. On 1000 10 kb simulated data, it halves the running time with no reduction in accuracy.
In BWA-SW, we say two alignments are distinct if the length of the overlapping region on the query is less than half of the length of the shorter query segment. We aim to find a set of distinct alignments which maximizes the sum of scores of each alignment in the set. This problem can be solved by dynamic programming, but as in our case a read is usually aligned entirely, a greedy approximation would work well. In the practical implementation, we sort the local alignments based on their alignment scores, scan the sorted list from the best one and keep an alignment if it is distinct from all the kept alignments with larger scores; if alignment a2 is rejected because it is not distinctive from a1, we regard a2 to be a suboptimal alignment to a1 and use this information to approximate the mapping quality (
Because we only retain alignments largely non-overlapping on the query sequence, we might as well discard seeds that do not contribute to the final alignments. Detecting such seeds can be done with another heuristic before the Smith–Waterman extension and time spent on unnecessary extension can thus be saved. To identify these seeds, we chain seeds that are contained in a band (default band width 50 bp). If on the query sequence a short chain is fully contained in a long chain and the number of seeds in the short chain is below one-tenth of the number of seeds in the long chain, we discard all the seeds in the short chain, based on the observation that the short chain can rarely lead to a better alignment than the long chain in this case. Unlike the Z-best strategy, this heuristic does not have a noticeable effect on alignment accuracy. On 1000 10 kb simulated data, it halves the running time with no reduction in accuracy.