Method stability was examined using 500 TCGA breast cancer samples with both RNA-seq and microarray data (Sample IDs in Additional file 2: Table S1), sub-sampled to vary the number of samples and genes present for each evaluation. To examine sample size effects upon a given sample, si, two data sets were created by sampling from both the RNA-seq and microarray data to select a sample si and n − 1 other random samples. The score for sample si was then computed using all listed methods, and this process was repeated across all 500 samples at a given sample size, such that there are 500 matched scores in total from both the microarray data and RNA-seq data. The Spearman’s rank correlation coefficient and concordance index were then calculated between sample scores from the microarray and the RNA-seq data. We note that for some methods sampling data in this manner can modify the background samples for a sample of interest, reflecting the influence of overall sample composition on the final scores. A similar analysis was performed by varying the number of genes, sub-sampling genes from the gene set of interest.
We performed this analysis with both epithelial and mesenchymal gene sets (expected up-regulated gene sets) [17 (link)], and the bidirectional TGFβ-EMT signature [8 (link)], varying the number of samples, NS = (2, 5, 25, 50, 500), and genes, NG = (1000, 3000, 5000, 10000, ALLGENES). All permutations were repeated 20 times to estimate error margins.
Free full text: Click here