Optimized Illumina Data Curation and Genome Estimation

Previously generated (Hirakawa et al., 2016 (link)) Illumina single-end, paired-end, and mate-pair data (Supplementary Table S1; Figure 1) were assessed with the FastQC software package v0.10.1 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (Andrews, 2010 ) to provide metrics of sequencing quality which informed next steps for data curation. Data were then trimmed to remove adapters. For data filtering we employed Skewer (Jiang et al., 2014 (link)), an Illumina-only read trimming and filtering tool. All Illumina short read datasets were filtered as follows: Discard reads with a mean quality lower than 20, trim ends to end quality of 30, discard reads shorter than 54bp.
Total and frequency of kmers (n=17) was counted in unassembled Illumina 180 bp insert paired-end sequence data for white clover and its progenitors using Jellyfish v2.2.0 (Marçais and Kingsford, 2011 (link)) with default parameters. The data were plotted to determine maximum read depth for each specific 17-mer, and genome size was estimated as total 17-mer number divided by peak depth.
Kmer abundance graphs for all pair-end libraries were drawn with kmergenie v1.7051 (Chikhi and Medvedev, 2013 (link); Crusoe et al., 2015 (link)) software package. After examination of the resulting graphs, two libraries were selected from the whole genome sequence (WGS) sets (DRX016491 and DRX028980) for assembly, as these two libraries showed the cleaner distribution of kmer abundances and provided sufficient coverage for assembly.
The khmer package (Crusoe et al., 2015 (link)) was then used for in silico digital normalization of WGS reads based on kmer abundance. We employed a different workflow than the ones recommended by the software authors. The general pipeline described by the authors removes high coverage kmers as well as low coverage kmers. This can lead to under-representation of repeat sequences in the final assembly. While de-Bruijn graph assemblers tend to collapse repeats in high coverage contigs, many of these repeats can be properly solved. Thus, khmer was used only to filter low-abundance kmer coverage reads to reduce noise. The normalize-by-median package was used to create a hash of 31 bp kmer abundances in the paired-end and single-end Illumina WGS libraries, and this hash was subsequently used with the filter-abund module to exclude reads with median kmer coverage of two or less. This adapted method allows for a reduction in the complexity of the graph assembly without reducing the representation of high coverage sequences, such as those from transposable elements, duplicated genomic regions, or closely related paralogs. All scripts and parameters used beyond default settings are located in a github repository described above.

Free full text: Click here

Shirasawa K., Moraga R., Ghelfi A., Hirakawa H., Nagasaki H., Ghamkhar K., Barrett B.A., Griffiths A.G, & Isobe S.N. (2023). An improved reference genome for Trifolium subterraneum L. provides insight into molecular diversity and intra-specific phylogeny. Frontiers in Plant Science, 14, 1103857.

Publication 2023

Clover Collapse Genome libraries Genomic Hash Transposable elements

Corresponding Organization :

Other organizations : Kazusa DNA Research Institute, AgResearch, National Institute of Genetics

Top 5 similar protocols

Variable analysis

independent variables

Sequencing data types (single-end, paired-end, and mate-pair data)

dependent variables

Sequencing quality metrics
Genome size estimate

control variables

Kmer abundance distribution in the unassembled Illumina 180 bp insert paired-end sequence data
Sequencing data filtering parameters (Discard reads with a mean quality lower than 20, trim ends to end quality of 30, discard reads shorter than 54bp)
Parameters used for digital normalization of whole genome sequencing (WGS) reads (exclude reads with median kmer coverage of two or less)

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!