Efficient Genome Preprocessing for Progressive Cactus

Progressive Cactus requires input genomes to be soft-masked, but often repetitive sequence goes unmasked due to poor masking or incomplete repeat libraries for newly-sequenced species. This can negatively affect alignment runtimes (as alignments need to be enumerated to and from all copies of a repetitive sequence) and impact quality. For this reason, we mask overabundant sequence before alignment, using a strategy not based on alignment to repeat consensus libraries, but on over-representation of alignments. We first divide each genome into small, mutually overlapping chunks. For each chunk, we align it to itself and a configurable amount of other randomly sampled chunks (currently 20% of the total pool). To avoid combinatorial explosion due to unmasked repetitive sequence, we use a special mode of LASTZ which stops exploring alignments from any region early if a maximum depth is reached (using the flag --queryhsplimit=keep,nowarn:1500, which stops after a high-scoring-pair depth of 1,500). We then soft-mask any region covered by more than a configurable number of these alignments (currently set to 50). Further details can be found in the src/cactus/preprocessor section of the Progressive Cactus codebase. Although the preprocessing step is automatically run as part of the pipeline, we also provide a cactus_preprocessor utility to run only the preprocessor without producing a full genome alignment.

Free full text: Click here

Armstrong J., Hickey G., Diekhans M., Fiddes I.T., Novak A.M., Deran A., Fang Q., Xie D., Feng S., Stiller J., Genereux D., Johnson J., Marinescu V.D., Alföldi J., Harris R.S., Lindblad-Toh K., Haussler D., Karlsson E., Jarvis E.D., Zhang G, & Paten B. (2020). Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587(7833), 246-251.

Publication 2020

Cactus Explosion Genome Repetitive sequence Sequence alignment

Corresponding Organization :

Other organizations : University of California, Santa Cruz, University of Copenhagen, Broad Institute, Massachusetts Institute of Technology, Science for Life Laboratory, Uppsala University, Pennsylvania State University, Howard Hughes Medical Institute, Kunming Institute of Zoology, Chinese Academy of Sciences

Top 5 similar protocols

Protocol cited in 16 other protocols

Variable analysis

independent variables

Configurable amount of other randomly sampled chunks (currently 20% of the total pool) used for alignment

dependent variables

Runtime of alignments
Quality of alignments

control variables

Maximum depth of high-scoring-pair (set to 1,500 using the flag --queryhsplimit=keep,nowarn:1500)
Configurable number of alignments covering a region (currently set to 50)

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!