We used single cell data sets from 10 published studies (Bi et al., 2021 (link); Chung et al., 2017 (link); Darmanis et al., 2017 (link); He et al., 2021 (link); Lee et al., 2020 (link); Ma et al., 2019 (link); Neftel et al., 2019 (link); Puram et al., 2017 (link); Tirosh et al., 2016 (link); Venteicher et al., 2017 (link)) for the evaluation of number of expressed genes in tumor versus normal cells to identify significant heterogenous patterns among the two phenotypes. Annotations of cell identity were also downloaded from each publication. We filtered all the data sets by removing non-expressed genes and then applied regularized negative binomial regression implemented in Seurat for normalization. We used C2 (n = 6226), C3 (n = 3556) and Hallmarks (n = 50) modules from MSigDB (Subramanian et al., 2005 (link)) v.7.2, to calculate the ratio of signature genes across all data sets and further signature scoring. We tested five tools for signature score calculations, including SCSE (Pont et al., 2019 (link)), AUCell (Aibar et al., 2017 (link)), ssGSEA, GSVA (Hänzelmann et al., 2013 (link)), and JASMINE. GSVA was included in tumor-normal comparisons but was dropped in gold standard tests and down sampling experiments due to slow running speed and highly correlated outputs with ssGSEA. We used GSVA and ssGSEA methods implemented in the GSVA Bioconductor (Hänzelmann et al., 2013 (link)) and AUCell method from AUCell Bioconductor packages (Aibar et al., 2017 (link)) with default parameters. We implemented SCSE in the R environment (v4.0) according to the equation reported in their paper (Pont et al., 2019 (link)). The output scores were used as is in tumor/normal cell comparisons and simulation analyses.
Free full text: Click here