Total and frequency of kmers (n=17) was counted in unassembled Illumina 180 bp insert paired-end sequence data for white clover and its progenitors using Jellyfish v2.2.0 (Marçais and Kingsford, 2011 (link)) with default parameters. The data were plotted to determine maximum read depth for each specific 17-mer, and genome size was estimated as total 17-mer number divided by peak depth.
Kmer abundance graphs for all pair-end libraries were drawn with kmergenie v1.7051 (Chikhi and Medvedev, 2013 (link); Crusoe et al., 2015 (link)) software package. After examination of the resulting graphs, two libraries were selected from the whole genome sequence (WGS) sets (DRX016491 and DRX028980) for assembly, as these two libraries showed the cleaner distribution of kmer abundances and provided sufficient coverage for assembly.
The khmer package (Crusoe et al., 2015 (link)) was then used for in silico digital normalization of WGS reads based on kmer abundance. We employed a different workflow than the ones recommended by the software authors. The general pipeline described by the authors removes high coverage kmers as well as low coverage kmers. This can lead to under-representation of repeat sequences in the final assembly. While de-Bruijn graph assemblers tend to collapse repeats in high coverage contigs, many of these repeats can be properly solved. Thus, khmer was used only to filter low-abundance kmer coverage reads to reduce noise. The normalize-by-median package was used to create a hash of 31 bp kmer abundances in the paired-end and single-end Illumina WGS libraries, and this hash was subsequently used with the filter-abund module to exclude reads with median kmer coverage of two or less. This adapted method allows for a reduction in the complexity of the graph assembly without reducing the representation of high coverage sequences, such as those from transposable elements, duplicated genomic regions, or closely related paralogs. All scripts and parameters used beyond default settings are located in a github repository described above.