Progressive Cactus requires input genomes to be soft-masked, but often repetitive sequence goes unmasked due to poor masking or incomplete repeat libraries for newly-sequenced species. This can negatively affect alignment runtimes (as alignments need to be enumerated to and from all copies of a repetitive sequence) and impact quality. For this reason, we mask overabundant sequence before alignment, using a strategy not based on alignment to repeat consensus libraries, but on over-representation of alignments. We first divide each genome into small, mutually overlapping chunks. For each chunk, we align it to itself and a configurable amount of other randomly sampled chunks (currently 20% of the total pool). To avoid combinatorial explosion due to unmasked repetitive sequence, we use a special mode of LASTZ which stops exploring alignments from any region early if a maximum depth is reached (using the flag --queryhsplimit=keep,nowarn:1500, which stops after a high-scoring-pair depth of 1,500). We then soft-mask any region covered by more than a configurable number of these alignments (currently set to 50). Further details can be found in the src/cactus/preprocessor section of the Progressive Cactus codebase. Although the preprocessing step is automatically run as part of the pipeline, we also provide a cactus_preprocessor utility to run only the preprocessor without producing a full genome alignment.
Free full text: Click here