To find regions exhibiting consistent differences between groups of samples, taking biological variation into account, we compute a signal-to-noise statistic similar to the
t-test. Specifically, we denote individuals with
i and use
Xi do denote group; for example,
Xi=0 if the
ith sample is a control and
Xi=1 if a case. The number of controls is denoted
n1 and the number of cases
n2. We assume that the samples are biological replicates within a group. Similar to the previous section, we denote the number of reads for the
ith sample associated with the
jth CpG being methylated and unmethylated with
Mi,j and
Ui,j, respectively. We assume that
Yi,j follows a binomial distribution with
Mi,j+
Ui,j trials and success probability π
i,j, which we assume is a sample-specific smooth function of genomic location
lj: π
i,j=
fi(
lj). Furthermore, we assume that
fi has the form
fi(
lj)=α(
lj)+β(
lj)
Xi+ε
i,j. Here α(
lj) represents the baseline methylation profile and β(
lj) the true difference between the two groups. The latter is the function of interest, with non-zero values associated with DMRs. The ε
i,j s represent biological variability with the location-dependent variance var(ε
i,j)≡σ
2(
j) assumed to be a smooth function. Note that increasing coverage does not reduce the variability introduced by ε; for this we need to increase the number of biological replicates.
We use the smoothed methylation profiles described in the previous section as estimates for the fi, denoted
. We estimate α and β as empirical averages and difference of averages:
and
. To estimate the smooth location-dependent standard deviation, we first compute the empirical standard deviation across the two groups. To improve precision, we used an approach similar to [30 (
link)]: we floored these standard deviations at their 75th percentile. To further improve precision, we smoothed the resulting floored values using a running mean with a window size of 101. We denote this final estimate of local variation with
. We then formed signal-to-noise statistics:
. To find DMRs, that is, regions for which β(
lj)≠0, we defined groups of consecutive CpGs for which all
t(
lj)>
c or
t(
lj)<-
c with
c>0 a cutoff selected based on the marginal empirical distribution of
t. We adapted our algorithm so that CpGs further than 300 bp apart were not permitted to be in the same DMR.
We recommend including in the procedure only CpGs that have some coverage in most or all samples. Furthermore, we recommend filtering the set of DMRs by requiring each DMR to contain at least three CpGs, have an average β of 0.1 or greater, and have at least one CpG every 300 bp.
Hansen K.D., Langmead B, & Irizarry R.A. (2012). BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83.