To identify cancer gene expression datasets with corresponding patient outcome data we queried NCBI Gene Expression Omnibus (GEO), EBI ArrayExpress, NCI caArray, and Stanford Microarray Database for the terms, survival, prognosis, prognostic, or outcome. Perl scripts were implemented to download processed and raw data, and associated annotation. For data within NCBI, the array platform was determined from the SOFT format file, and the corresponding annotation file was retrieved from GEO. From these, the Probe ID, Genbank accession, HUGO gene symbol and gene description were extracted based on the internal headers of the SOFT annotation file. The desired fields were specified manually if this automated procedure failed. For older platforms, such as cDNA microarrays, where annotations had not been recently updated, we re-mapped the probe sequences to HUGO gene symbols via the Genbank or Refseq accession number through the NCBI Entrez gene identifier. In cases without available accessions, but with the DNA sequence of the probe, we performed the mapping using BLAT to compare probes to a Refseq reference and look for unique highest-scoring hits.
Scripts were written to extract sample annotation information from GEO SOFT format files and parse them into tables. Since the contents of annotation fields are not semantically enforced, sample data can be contained in various fields, including Sample_title, Sample_characteristics, Sample_description, and Sample_source. Moreover, not all fields are specified for every sample. To parse this information into tabular format, we attempted to estimate the correct variable name (column header) by searching for common substrings across samples. In some cases, a dataset clearly had survival information, but was not deposited with the genomic data. In such cases, we first searched supplementary information of corresponding literature for the missing information. Failing this, we contacted corresponding and first authors, of which roughly half supplied the requested data.
All tabulations of clinical annotations were further checked and manually curated. This process included verification of results in selected studies by direct comparison of Kaplan-Meier plots and time scales with those in the corresponding primary publications, as well as consistency of prognostic genes across studies. Separately, errors due to technical issues or the curation process were estimated by comparing annotated gender to the ratio of RPS4Y1 to XIST (male:female) expression levels56 after microarray normalization, as detailed below (Supplementary Fig. 1a–c). Furthermore, identical samples present in more than one dataset were identified using MD5 checksums for Affymetrix data, and by cross-correlation analysis of expression vectors, and redundant samples were accordingly eliminated.
We applied the following gene expression normalization strategy to allow unification of data from diverse microarray platforms within PRECOG. For Affymetrix GeneChip data, raw CEL files were obtained when possible, and were normalized with the MAS5 algorithm (affy package v. 1.26 of Bioconductor v. 1.8 in R 2.15.1), using a custom CDF (Chip Definition File) for probeset summarization, which updates and maps array oligonucleotides to Entrez gene identifiers57 -59 (http://brainarray.mbni.med.umich.edu/Brainarray/). Each dataset, regardless of platform, was quantile normalized separately. Moreover, each gene was log2 transformed if not already in log space, and was then unit mean/variance standardized across samples within a given dataset. While alternative microarray normalization methods have been proposed (e.g., RMA60 , gcRMA61 , fRMA62 , SCAN-UPC63 ), for survival analysis we did not observe any significant benefit in comparing Affymetrix data normalized as described above to alternate normalization strategies (data not shown). TCGA RNA-seq and clinical data were downloaded from the TCGA Data Coordinating Center using TCGA-assembler64 . The gene-level RNA-seq data were pre-processed using TCGA-assembler's ProcessRNASeqData function. RNA-seq and clinical data were matched via the patient barcode provided by TCGA.
For each study, the association of each probe on an array platform with survival outcomes was assessed via Cox proportional hazards regression using the coxph function of the R survival package (v. 2.37). Cox coefficients, hazard ratios with 95% confidence intervals, P values, and z-scores were obtained for each array probe. For datasets that had not been processed with Custom CDF, which yields a unique per-gene expression value, survival z-scores for probes were collapsed to the gene level by averaging z-scores of probes that matched to the same HUGO gene symbol. Z-scores for each gene were summarized across all datasets in each malignancy using Lipták's weighted meta-z test65 ,66 , with weights set to the square roots of sample sizes67 . To identify genes with cancer-wide prognostic significance, and avoid bias due to cancers with different sample sizes, we further combined weighted meta-z-scores into a single global meta-z-score for each gene using Stouffer's method (unweighted)66 .
Gentles A.J., Newman A.M., Liu C.L., Bratman S.V., Feng W., Kim D., Nair V.S., Xu Y., Khuong A., Hoang C.D., Diehn M., West R.B., Plevritis S.K, & Alizadeh A.A. (2015). The prognostic landscape of genes and infiltrating immune cells across human cancers. Nature medicine, 21(8), 938-945.