Census is motivated by a generative model of single-cell (sc) RNA-Seq similar to the one developed by Kim et al.47 . When performing sc-RNA-seq, each individual cell is lysed to recover its endogenous RNA molecules, some fraction of which may be degraded or lost. Lysis thus involves an RNA recovery rate α. Spike-in transcripts are then added into the cell lysate. Note that spike-in transcripts are added to the lysate as naked RNA, and thus may be degraded at different rates from the endogenous RNA. We denote the ladder recovery rate as β. The RNA counts in the lysate can be written:
Cell lysate:{YijlαiYijcSijlβiS.j, where Yl, Sl, S, are the transcript counts of endogenous RNA in cell lysate, spike-in transcript counts in cell lysate and the spike-in transcript counts added into the cell lysate. The first subscript in all variables (here and below) corresponds to cell while the second subscript corresponds to gene index. Note that we are not able to directly observe
Yijc , the true transcript counts for gene j in cell i and thus α is an unknown variable.
The RNA molecules and spike-in transcripts will then be subjected to reverse transcription and amplified to make a cDNA library. The expected number of cDNA molecules generated from each RNA molecules is denoted by θ. The cDNA counts can be written:
cDNA:{Yijd=YijlθiSijd=Sijlθi, where Yd, Sd, are the cDNA counts of endogenous RNA, spike-in cDNA counts successfully converted from the corresponding transcript counts Yl, Sl in cell lysate under a uniform capture rate θ, which for current protocols is less than 1.
Our model generates sequencing reads from the cDNA. The relative cDNA abundances are calculated as
Yijdj=1n(Yijd+Sijd) for endogenous RNA, or
Sijdj=1n(Yijd+Sijd) for spike-in RNA.
The model then generates γ reads per cDNA molecule on average; with sufficient sequencing, γ will be larger than 1; we expect each cDNA molecule to generate at least one sequencing read. This process can be regarded as a multinomial sampling of R reads
(Ri=γj=1n(Yijd+Sijd)) from the distribution of relative cDNA abundances mentioned above which can be represented as:
Read counts:{Yijrmultinomial(Yijdj=1n(Yijd+Sijd),Rie)S.jrmultinomial(Sijdj=1n(Yijd+Sijd),Ris), where
Rie,Ris , denotes the reads sampled for cDNA from the endogenous RNA or spike- in RNA in cell i,
Yijr,S.jr denotes the reads sampled for cDNA from the endogenous RNA j or spike-in RNA j in cell i.
The model described here is essentially a special case of the model in Kim et al., and differs mainly in that their model describes transcript-level capture rates and sequencing rates with beta and gamma distributions, respectively. In contrast, we simply use global constants for these rates. As Census does not make use of variance estimates from the generative model, this simpler model is sufficient for calculating key statistics (e.g. mode of the transcript counts) needed to convert relative to absolute abundances.