Some sequences, or even entire reads, can be overrepresented in FASTQ data. Analysis of these overrepresented sequences provides an overview of certain sequencing artifacts such as PCR over-duplication, polyG tails and adapter contamination. FASTQC offers an overrepresented sequence analysis module, however, according to the author’s introduction, FASTQC only tracks the first 1 M reads of the input file to conserve memory. We suggest that inferring the overall distribution from the first 1 M reads is not a reliable solution as the initial reads in Illumina FASTQ data usually originate from the edges of flowcell lanes, which may have lower quality and different patterns than the overall distribution.
Unlike FASTQC, fastp samples all reads evenly to evaluate overrepresented sequences and eliminate partial distribution bias. To achieve an efficient implementation of this feature, we designed a two-step method. In the first step, fastp completely analyzes the first 1.5 M base pairs of the input FASTQ to obtain a list of sequences with relatively high occurrence frequency in different sizes. In the second step, fastp samples the entire file and counts the occurrence of each sequence. Finally, the sequences with high occurrence frequency are reported.
Besides the occurrence frequency, fastp also records the positions of overrepresented sequences. This information is quite useful for diagnosing sequence quality issues. Some sequences tend to appear in the read head whereas others appear more often in the read tail. The distribution of overrepresented sequences is visualized in the HTML report. Figure 5 shows a demonstration of overrepresented sequence analysis results.