A number of computational methods have been proposed to infer cell type abundance, cell type-specific GEPs, or both from bulk tissue expression profiles2 (link)–8 (link). These methods generally assume that biological mixture samples can be modeled as a system of linear equations, where a single mixture transcriptome m with n genes is represented as the product of H and f, where H represents an n × c cell type expression matrix consisting of expression profiles for the same n genes across c distinct cell types, and f represents a vector of size c, consisting of cell type mixing proportions.
To infer cell type abundance using this linear model within CIBERSORTx, let M be an n × k matrix with n genes and k mixture GEPs, let matrix B be a subset of H containing discriminatory marker genes for each of the c cell subsets (i.e., signature or basis matrix15 (link),74 (link),75 (link)), and let M’ be the subset of M that contains the same marker genes as B. Given M’ and B, the following equation can then be used to impute F, a c × k fractional abundance matrix with columns [f1,f2,…,fk]:
B×F,j=M,j,1jk
where Fi,j 0 for all i, j, the system is overdetermined (i.e., n > c), and expression data in M’ and B are represented in non-log linear space76 . (Note that Mi,• and M•,j denote row i and column j of matrix M, respectively). Many methods either normalize F or impose an additional constraint on F such that for each mixture sample, the inferred mixing coefficients sum to one, allowing F to be directly interpreted as cell type proportions (with respect to the cell subsets in B)3 (link). We previously introduced CIBERSORT as a method to estimate F using an implementation of ν-support vector regression, a machine learning technique that is robust to noise, unknown mixture content, and collinearity among cell type reference profiles15 (link). CIBERSORT was used to impute F in this work, and within this imputation workflow, the batch correction scheme described below was used for all cross-platform analyses, unless stated otherwise (Supplementary Table 1).