Comprehensive Overrepresented Sequence Analysis

Some sequences, or even entire reads, can be overrepresented in FASTQ data. Analysis of these overrepresented sequences provides an overview of certain sequencing artifacts such as PCR over-duplication, polyG tails and adapter contamination. FASTQC offers an overrepresented sequence analysis module, however, according to the author’s introduction, FASTQC only tracks the first 1 M reads of the input file to conserve memory. We suggest that inferring the overall distribution from the first 1 M reads is not a reliable solution as the initial reads in Illumina FASTQ data usually originate from the edges of flowcell lanes, which may have lower quality and different patterns than the overall distribution.
Unlike FASTQC, fastp samples all reads evenly to evaluate overrepresented sequences and eliminate partial distribution bias. To achieve an efficient implementation of this feature, we designed a two-step method. In the first step, fastp completely analyzes the first 1.5 M base pairs of the input FASTQ to obtain a list of sequences with relatively high occurrence frequency in different sizes. In the second step, fastp samples the entire file and counts the occurrence of each sequence. Finally, the sequences with high occurrence frequency are reported.
Besides the occurrence frequency, fastp also records the positions of overrepresented sequences. This information is quite useful for diagnosing sequence quality issues. Some sequences tend to appear in the read head whereas others appear more often in the read tail. The distribution of overrepresented sequences is visualized in the HTML report. Figure 5 shows a demonstration of overrepresented sequence analysis results.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Chen S., Zhou Y., Chen Y, & Gu J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890.

Publication 2018

Head Memory Polyg Sequence analysis Tail

Corresponding Organization : Shenzhen Institutes of Advanced Technology

Top 5 similar protocols

Protocol cited in 2 458 other protocols

Variable analysis

independent variables

None specified

dependent variables

Occurrence frequency of overrepresented sequences
Positions of overrepresented sequences

control variables

None specified

Annotations

Based on most similar protocols

The input protocol suggests that FASTQC only tracks the first 1 M reads of the input file to conserve memory, which may not be a reliable solution as the initial reads in Illumina FASTQ data usually originate from the edges of flowcell lanes, which may have lower quality and different patterns than the overall distribution. Unlike FASTQC, fastp samples all reads evenly to evaluate overrepresented sequences and eliminate partial distribution bias. (Input protocol)

Protocol 2 mentions that the FastQC and MultiQC tools were used to assess the quality of the fastq files, and the results showed that the reads had very good quality and no further corrections were needed at this stage.

Protocol 3 describes a stringent quality control pipeline that includes trimming of adapter sequences, filtering out reads with low-quality scores, and removing potential PCR duplicates. It also mentions using the interquartile range to identify and remove outliers, and removing samples with an average coverage below 1000X, as all variant callers exhibited power below 80% at coverage below 1000X.

Protocol 4 mentions trimming the final 29 bp to discard lower-quality base calls, filtering reads with at least a single base call with a Phred quality score below 10 (90% call accuracy), and/or more than 5% below a Phred score of 20 (99% call accuracy). It also mentions discarding reads that did not contain an SbfI cut site or a unique P1 barcode in the 5' end, as well as those with adapter contamination in the 3' end.

Protocol 5 mentions that low count read FASTQs were discarded from any further analysis as an entire flowcell showed cluster generation problems, and they were left with 134 FASTQs for the 95 samples.

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!