Initially, the sequencing reads were submitted to a quality control check by using the scripts fastq_quality_filter.pl and fastq_quality_trimmer.pl from FASTX-Toolkit [12 ]. The phred value 20 was chosen as the minimum threshold for base quality. The reads having more than 80 % of low quality bases were removed or had their 3′ extremity bases trimmed when the minimum threshold was not reached. After, the reads were aligned to the human reference genome hg19/GRCh37 using BWA-MEM [13 (link)] with default parameters and Picard [14 ] was applied for post-alignment procedures as sorting, indexing, and marking duplicates. The alignments were submitted to local realignment around INDELs and base quality score recalibration (BQSR) by using the Genome Analysis Toolkit (GATK) version 3.0 [15 (link)].
MuTect [16 (link)] and GATK (Haplotype Caller) were used for the single nucleotide variant calling. GATK variants were filtered with the Variant Quality Score Recalibration tool following the best practices on the GATK website. GATK performs the variant calling and filtration in the normal and tumor samples independently, thus the subtraction between the tumor and the normal variants resulted in our first set of candidate somatic variants.
To ensure the somatic classification of the SNVs called by GATK, we adapted the MuTect algorithm and applied its LODN classifier after the GATK variant calling and filtering. The LODN is a bayesian classifier that compares the likelihood of two models: (1) the mutation does not exist in the normal sample and all non-reference bases are explained by sequencing noise, and (2) the mutation truly exists in the normal sample as a germ-line heterozygous variant. The ratio of these two likelihoods is called LOD (Log Odds) score and when it exceeds a decision threshold, the mutation can be classified as somatic. For this filtering, we considered only sites that had total read depth greater or equal than 8 in the normal sample and greater or equal than 14 in the tumor sample. Our final candidate list consisted in the union of MuTect and GATK-LODN results.
The variants were annotated by ANNOVAR [17 (link)], with the Ensembl Gene annotation database for human genome build 37 (http://www.ensembl.org/ ), and searched for matches in the dbSNP138 and 1000 Genomes data. We selected exonic single nucleotide variants (SNVs) that were non-synonymous and gain or loss of stop codon. Variants present in dbSNP138 and 1000 Genomes with minor allele frequency (MAF) greater than 0.05 were removed. Figure 1 shows the summary of the pipeline steps. The scripts for running the main pipeline steps are availabe in the link: https://bitbucket.org/BBDA-UNIBO/wes-pipeline .![]()
A subset of variants from MuTect, GATK and GATK-LODN calls were selected for validation. Variants with allelic frequency higher than 0.2 were validated by Sanger Sequencing and those with allelic frequency lower than 0.2 were validated by using the Illumina TruSight Myeloid Sequencing Panel and Illumina MiSeq sequencing. Data were analyzed by the VariantStudio software (Illumina), according to manufacturer’s instruction.
MuTect [16 (link)] and GATK (Haplotype Caller) were used for the single nucleotide variant calling. GATK variants were filtered with the Variant Quality Score Recalibration tool following the best practices on the GATK website. GATK performs the variant calling and filtration in the normal and tumor samples independently, thus the subtraction between the tumor and the normal variants resulted in our first set of candidate somatic variants.
To ensure the somatic classification of the SNVs called by GATK, we adapted the MuTect algorithm and applied its LODN classifier after the GATK variant calling and filtering. The LODN is a bayesian classifier that compares the likelihood of two models: (1) the mutation does not exist in the normal sample and all non-reference bases are explained by sequencing noise, and (2) the mutation truly exists in the normal sample as a germ-line heterozygous variant. The ratio of these two likelihoods is called LOD (Log Odds) score and when it exceeds a decision threshold, the mutation can be classified as somatic. For this filtering, we considered only sites that had total read depth greater or equal than 8 in the normal sample and greater or equal than 14 in the tumor sample. Our final candidate list consisted in the union of MuTect and GATK-LODN results.
The variants were annotated by ANNOVAR [17 (link)], with the Ensembl Gene annotation database for human genome build 37 (
Pipeline of SNV detection in sequencing data of cancer samples. Summary of steps and their respective tools in the detection of SNVs in paired normal-cancer sequencing data
Full text: Click here