fastp is designed for multi-threading parallel processing. Reads loaded from FASTQ files will be packed with a size of N (N = 1000). Each pack will be consumed by one thread in the pool, and each read of the pack will be processed. Each thread has an individual context to store statistical values of the reads it processes, such as per-cycle quality profiles, per-cycle base contents, adapter trimming results and k-mer counts. These values will be merged after all reads are processed, and a reporter will generate reports in HTML and JSON formats. fastp reports statistical values for pre-filtering and post-filtering data to facilitate comparisons of changes in data quality after filtering is complete.
fastp supports single-end (SE) and paired-end (PE) data. While most steps of SE and PE data processing are similar, PE data processing requires some additional steps such as overlapping analysis. For the sake of simplicity, we only demonstrate the main workflow of paired-end data preprocessing, shown in
for seed in sorted_adapter_seeds:
seqs_after_seed = get_seqs_after(seed)
forward_tree = build_nucleotide_tree(seqs_after_seed)
found = True
node = forward_tree.root
after_seed = “”
while node.is_not_leaf():
if node.has_dominant_child():
node = node.dominant_child()
after_seed = after_seed + node.base
else:
found = False
break
if found == False:
continue
else:
seqs_before_seed = get_seqs_before(seed)
backward_tree = build_nucleotide_tree(seqs_before_seed)
node = backward _tree.root
before_seed = “”
while node.is_not_leaf():
if node.has_dominant_child():
node = node.dominant_child()
before_seed = node.base + before_seed
else:
break
adapter = before_seed + seed + after_seed
break