In sequencing-based ribosome footprinting, the RF read count is naturally confounded by mRNA abundance (Fig. 1A). We seek a strategy to compare RF measurements taking mRNA abundance into account in order to accurately discern the translation effect in case–control experiments. We model the vector of RNA-Seq and RF read counts ymRNAi and yRFi , respectively, for gene i with Negative Binomial (NB) distributions, as described before (for instance, Love et al., 2014 (link); Drewe et al., 2013 (link); Robinson et al., 2010 (link)): yiNB(μi,κi), where μi is the expected count and κi is the estimated dispersion across biological replicates. Here yi denotes the observed counts normalized by the library size factor (Supplementary Section A). Formulating the problem as a generalized linear model (GLM) with the logarithm as link function, we can express expectations on read counts as a function of latent quantities related to mRNA abundance βC in the two conditions ( C={0,1} ), a quantity βRNA that relates mRNA abundance to RNA-Seq read counts, a quantity βRF that relates mRNA abundance to RF read counts and a quantity βΔ,C that captures the effect of the treatment on translation. In particular, the expected RNA-Seq read count μmRNA,Ci is given by the equation log(μmRNA,Ci)=βCi+βRNAi .
We assume that transcription and translation are successive cellular processing steps and that abundances are linearly related. The expected RF read count, μRF,Ci , is given by log(μRF,Ci)=βCi+βRFi+βΔ,Ci . A key point to note is that βCi is revealed to be a shared parameter between the expressions governing the expected RNA-Seq and RF counts. It can be considered to be a proxy for shared transcriptional/translation activity under condition C in this context. Then, βΔ,Ci indicates the deviation from that activity under condition C, with βΔ,Ci=0 for C = 0 and free otherwise (See Supplementary Section B for more details).
Fitting the GLM consists of learning the parameters βi and dispersions κi given mRNA and RF counts for the two conditions C={0,1} . We perform alternating optimization of the parameters βi given dispersions κi and the dispersion parameters κi given βi, similar to the EM algorithm (Supplementary Sections B and C):
βi=argmaxβiglm(βi|yi,κi)andκi=argmaxκiNB(κi|yi,μi).
As experimental procedures for measuring mRNA counts and RF counts differ, we enable the estimating of separate dispersion parameters for the data sources of RNA-Seq and RF profiling to account for different characteristics (Supplementary Section E).
As in Anders et al. (2012) (link), with raw dispersions estimated from previous steps, we regress all κi given the mean counts to obtain a mean-dispersion relationship f(μ)=λ1/μ+λ0 . We perform empirical Bayes shrinkage (Love et al., 2014 (link)) to shrink κi towards f(μ) to stabilize estimates (see Supplementary Section D). The proposed model in RiboDiff with a joint dispersion estimate is conceptually identical to using the following GLM design matrix protocol+condition+condition:protocol (for instance, in conjunction with edgeR or DESeq1/2).
In a treatment/control setting, we can then evaluate whether a treatment (C = 1) has a significant differential effect on translation efficiency compared to the control (C = 0). This is equivalent to determining whether the parameter βΔ,1 differs significantly from 0 and whether the relationship denoted by the dashed arrow in Figure 1A is needed or not. We can compute significance levels based on the χ2 distribution by analyzing log -likelihood ratios of the Null model ( βΔ,1i=0 ) and the alternative model ( βΔ,1i=0 ).
Free full text: Click here