The CHT’s modest memory requirements, and the additional savings yielded by minimizer-based subsampling, allow more reference genomic data to be included in Kraken 2’s standard reference library. Whereas Kraken 1’s default database had data from archeal, bacterial, and viral genomes, Kraken 2’s default database additionally includes the GRCh38 assembly of the human genome [29 (link)] and the “UniVec_Core” subset of the UniVec database [30 ]. We include these in Kraken 2’s default database to allow for easier classification of human microbiome reads and more accurate classification of reads containing vector sequences.
Additionally, we have implemented masking of low-complexity sequences from reference sequences in Kraken 2, by using the “dustmasker” [31 (link)] (for nucleotide sequences) and “segmasker” [32 (link)] (for protein sequences) tools from NCBI. Using the tools’ default settings, nucleotide and protein sequences are checked for low-complexity regions, and those regions identified are masked and not processed further by the Kraken 2 database building process. In this manner, we seek to reduce false positives resulting from these low-complexity sequences, similar to the build process for Centrifuge [1 (link)].
Free full text: Click here