The first part of the IntOGen-mutations pipeline assesses the potential functional impact of somatic mutations detected across the cohort of tumor samples. The Ensembl variant effect predictor
10 (link) (VEP, v.70) script and precomputed cache files, downloaded from the Ensembl FTP site (
ftp://ftp.ensembl.org/pub/), are used to determine the consequences of somatic mutations in annotated functional elements. The pipeline obtains SIFT
11 (link) and PolyPhen2 (ref. 12 (
link)) functional impact from VEP. Precomputed MutationAssessor
13 (link) functional impacts are obtained from the MutationAssessor Web server (
http://www.mutationassessor.org/) during the installation of the pipeline and are queried locally during execution. The transformation of functional impact scores to account for the baseline tolerance of genes to germline mutation (transFIC), described elsewhere
14 (link), has been reimplemented in Python as a module of the IntOGen-mutations pipeline.
The pipeline implements an expression filter to disregard genes that are not expressed across the tumor samples in the cohort. This list of expressed genes is an optional input to the pipeline, which excludes all genes outside the list from the foreground of both OncodriveFM and OncodriveCLUST (see below) while keeping their mutations in the background. In the current release of the IntOGen-mutations Web discovery tool, we have employed as a filter the list of genes expressed across any of the 12 pan-cancer data sets (ref. syn1734155).
The OncodriveFM and OncodriveCLUST approaches, also described elsewhere
7 (link),9 (link), have been reimplemented as IntOGen-mutations pipeline modules and are available as independent programs from two Git-controlled repositories at
https://bitbucket.org/bbglab/. Briefly, OncodriveFM receives as input the list of synonymous, nonsynonymous and frameshift-indel mutations and their corresponding SIFT, PolyPhen2 and MutationAssessor scores. Then it assesses whether any gene shows a trend toward the accumulation of mutations with high functional impact as compared to the background distribution of these functional impact scores in all mutations detected across the cohort of tumor samples (FM bias). For each functional impact score included in the pipeline, the method produces an empirical
P value that evaluates this FM bias. These three
P values are subsequently combined using Fisher's approach to produce one integrated
P value for each gene. To account for possible nondependence between the three
P values included in the combination, the IntOGen-mutations Web discovery tool considers as significant those with a false discovery rate (FDR) below 0.05.
OncodriveFM also computes an FM bias for pathways. Three
z scores are computed in this case to assess the trend of pathways to accumulate mutations with high functional impact. The
z scores are combined using Stouffer's approach, and the combined
z score is transformed into an integrated
P value.
OncodriveCLUST, on the other hand, receives as input two separate lists of mutations: potentially protein-affecting mutations (nonsynonymous, stop and splice site) and silent mutations (synonymous), with their corresponding locations across the proteins' sequences. It then assesses the significance of the trend of potentially protein-affecting mutations to be clustered with respect to a background represented by the homologous trend for silent mutations.
Genes mutated in less than 1% of the samples in projects whose median of mutations per sample was below 100 were not analyzed by OncodriveFM. In projects with higher median of mutations per samples, this threshold was set to 5 samples with mutations. For OncodriveCLUST, the thresholds were 3 and 5 mutated samples, respectively. These and many other parameters of the pipeline are configurable by the user, as explained in its documentation.
In addition to third-party (and in-house) software and data, IntOGen-mutations pipeline installation requires some Python libraries. The most important of these are the numpy and scipy scientific computing libraries and the statsmodels Python statistical library.
The pipeline also relies on other external data files. During pipeline installation, all of the needed external and third-party data files are downloaded and correctly placed, and external libraries are downloaded and compiled, thereby creating a Python environment where the pipeline executes.
The analysis of the 4,623 tumor samples currently included in the IntOGen-mutations Web discovery tool takes approximately 5 h on an eight-core, 12 GB RAM computer.
Gonzalez-Perez A., Perez-Llamas C., Deu-Pons J., Tamborero D., Schroeder M.P., Jene-Sanz A., Santos A, & Lopez-Bigas N. (2013). IntOGen-mutations identifies cancer drivers across tumor types. Nature Methods, 10(11), 1081-1082.