Datasets were obtained mainly from GEO (http://www.ncbi.nlm.nih.gov/geo/ ) and TCGA (https://tcga-data.nci.nih.gov ) after searching for keywords related to cancer, survival, and gene expression technologies. Additionally, a few were obtained from author’s websites and from ArrayExpress (http://www.ebi.ac.uk/arrayexpress/ ). The data source used is shown in the web interface. We favored cancer types above two different cohorts and datasets containing survival data over 30 samples in which censoring indicator and time to death, recurrence, relapse, or metastasis were provided. Clinical data was provided by dataset authors via personal email when not available online in corresponding repositories. Datasets were annotated from provider files as found up to September 2012, and were quantile-normalized and log2 transformed when needed. From TCGA, all datasets were obtained at the gene level (level 3). RNA-Seq counts data were log2 transformed. In some cancer types where many datasets were found for the same gene expression platform, we also provide a merged meta-base. In meta-bases, datasets were quantile normalized; probesets means were equalized conserving the standard deviation by each cohort; and datasets were merged by probeset id. At the moment we provide meta-bases for breast, lung, and ovarian cancer. To facilitate gene searches and conversions between gene identifiers, human gene information was used and obtained from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz ). To simplify the user interface, datasets were grouped by related organ or tissue using disease ontologies [10] (link).
Full text: Click here