As described previously, for a given cell, a gene can be defined as on (i.e. a positive et value is recorded) or as off (i.e. the gene is undetected and ). To simplify our model, we will denote by the indicator variable equal to one if the gene is expressed in well and zero otherwise. Following classical statistical conventions, we use upper cases to denote the random variables and lower cases to denote the values taken by these random variables. Using these notations, we introduce the following model of single-cell expression
where denotes a point mass at zero, and are the -based mean and variance expression-level parameters conditional on the gene being expressed (i.e. ), and is the frequency of expression of gene across all cells. In the datasets considered here, the frequency of expression greatly varies across genes from 0 to 0.99 with a median value of ∼0.1 (seeSupplementary Fig. S1 ). Assuming a log-Normal model for is equivalent to modeling as normally distributed. The empirical distribution of the data (Fig. 1 and Supplementary Figs S8–S10 ) motivates our selection of a log-normal distribution and follows observations of previous authors (Bengtsson et al., 2005 (link)).
![]()
Thus, in a particular gene, three parameters characterize the expression distribution: , the mean and standard deviation of the , and , the Bernoulli probability of expression.
where denotes a point mass at zero, and are the -based mean and variance expression-level parameters conditional on the gene being expressed (i.e. ), and is the frequency of expression of gene across all cells. In the datasets considered here, the frequency of expression greatly varies across genes from 0 to 0.99 with a median value of ∼0.1 (see
Histogram and theoretical (normal) distribution of
Full text: Click here