In sequencing-based ribosome footprinting, the RF read count is naturally confounded by mRNA abundance (
Fig. 1A). We seek a strategy to compare RF measurements taking mRNA abundance into account in order to accurately discern the translation effect in case–control experiments. We model the vector of RNA-Seq and RF read counts
and
, respectively, for gene
i with Negative Binomial (NB) distributions, as described before (for instance, Love
et al., 2014 (
link); Drewe
et al., 2013 (
link); Robinson
et al., 2010 (
link)):
where μ
i is the expected count and κ
i is the estimated dispersion across biological replicates. Here
yi denotes the observed counts normalized by the library size factor (
Supplementary Section A). Formulating the problem as a generalized linear model (GLM) with the logarithm as link function, we can express expectations on read counts as a function of latent quantities related to
mRNA abundance β
C in the two conditions (
), a quantity
that relates mRNA abundance to RNA-Seq read counts, a quantity
that relates mRNA abundance to RF read counts and a quantity
that captures the effect of the treatment on translation. In particular, the expected RNA-Seq read count
is given by the equation
.
We assume that transcription and translation are successive cellular processing steps and that abundances are linearly related. The expected RF read count,
, is given by
. A key point to note is that
is revealed to be a shared parameter between the expressions governing the expected RNA-Seq and RF counts. It can be considered to be a proxy for shared transcriptional/translation activity under condition
C in this context. Then,
indicates the deviation from that activity under condition
C, with
for
C = 0 and free otherwise (See
Supplementary Section B for more details).
Fitting the GLM consists of learning the parameters β
i and dispersions κ
i given mRNA and RF counts for the two conditions
. We perform alternating optimization of the parameters β
i given dispersions κ
i and the dispersion parameters κ
i given β
i, similar to the EM algorithm (
Supplementary Sections B and C):
As experimental procedures for measuring mRNA counts and RF counts differ, we enable the estimating of separate dispersion parameters for the data sources of RNA-Seq and RF profiling to account for different characteristics (
Supplementary Section E).
As in Anders
et al. (2012) (
link), with raw dispersions estimated from previous steps, we regress all κ
i given the mean counts to obtain a mean-dispersion relationship
. We perform empirical Bayes shrinkage (Love
et al., 2014 (
link)) to shrink κ
i towards
to stabilize estimates (see
Supplementary Section D). The proposed model in RiboDiff with a joint dispersion estimate is conceptually identical to using the following GLM design matrix
(for instance, in conjunction with edgeR or DESeq1/2).
In a treatment/control setting, we can then evaluate whether a treatment (
C = 1) has a significant differential effect on translation efficiency compared to the control (
C = 0). This is equivalent to determining whether the parameter
differs significantly from 0 and whether the relationship denoted by the dashed arrow in
Figure 1A is needed or not. We can compute significance levels based on the
distribution by analyzing
-likelihood ratios of the Null model (
) and the alternative model (
).