where Yl, Sl, S, are the transcript counts of endogenous RNA in cell lysate, spike-in transcript counts in cell lysate and the spike-in transcript counts added into the cell lysate. The first subscript in all variables (here and below) corresponds to cell while the second subscript corresponds to gene index. Note that we are not able to directly observe
, the true transcript counts for gene j in cell i and thus α is an unknown variable.
The RNA molecules and spike-in transcripts will then be subjected to reverse transcription and amplified to make a cDNA library. The expected number of cDNA molecules generated from each RNA molecules is denoted by θ. The cDNA counts can be written:
where Yd, Sd, are the cDNA counts of endogenous RNA, spike-in cDNA counts successfully converted from the corresponding transcript counts Yl, Sl in cell lysate under a uniform capture rate θ, which for current protocols is less than 1.
Our model generates sequencing reads from the cDNA. The relative cDNA abundances are calculated as
for endogenous RNA, or
for spike-in RNA.
The model then generates γ reads per cDNA molecule on average; with sufficient sequencing, γ will be larger than 1; we expect each cDNA molecule to generate at least one sequencing read. This process can be regarded as a multinomial sampling of R reads
from the distribution of relative cDNA abundances mentioned above which can be represented as:
where
, denotes the reads sampled for cDNA from the endogenous RNA or spike- in RNA in cell i,
denotes the reads sampled for cDNA from the endogenous RNA j or spike-in RNA j in cell i.
The model described here is essentially a special case of the model in Kim et al., and differs mainly in that their model describes transcript-level capture rates and sequencing rates with beta and gamma distributions, respectively. In contrast, we simply use global constants for these rates. As Census does not make use of variance estimates from the generative model, this simpler model is sufficient for calculating key statistics (e.g. mode of the transcript counts) needed to convert relative to absolute abundances.