The fractional abundance matrix F can be determined for the bulk GEP matrix M, either through expression deconvolution as described above2 (link),3 (link) or with prior empiric knowledge of the compositional representation of cell types within the bulk specimen (e.g., by an automated hematology analyzer, or by flow cytometry)25 (link). Once F is determined for a given M, a representative imputed GEP for each cell type in F can be estimated by solving the following system of linear equations:
Hi,×F=Mi,,1in
where H is a n × c expression matrix of n genes and c cell types, Hi,j 0 for all i, j, and F is defined as above with the constraint that relative cell fractions sum to one for each mixture sample. Like Equation 1 above, the system should be overdetermined (k > c), with a greater difference between k and c generally leading to improved GEP estimation (Fig. 3e, Supplementary Fig. 6). To ensure biologically realistic estimates of gene expression, we employ non-negative least squares regression (NNLS), an optimization framework to solve the least squares problem with non-negativity constraints. Although NNLS is robust on simple mixtures and toy examples, its performance on more complex mixtures inherent within real tissue samples can be affected by noise, imprecision, and missing data in the linear system15 (link). We therefore developed a series of novel data normalization and filtering techniques to help mitigate these issues (Fig. 3b-d, see ‘Imputation of group-mode expression profiles’ in Supplementary Note 1).