Emulsion PCR and sequencing was performed using the standard SOLiD 2 system for 35 bp reads. Raw color-space data was mapped to the human genome (hg18) using corona-light (ABI). The sequence tag start and strand was imported into a custom R-language data structure for analysis. To correct for minor differences in the sequencing depth between specimens, the total number of tags was normalized to 106 for each specimen by dividing each tag weight by the total number of tags and multiplying this by 106. To calculate the STAMP signal, each tag was extended to a distribution of lengths modelling the DNA fragmentation pattern (Supporting Information S1). It is necessary to track Watson and Crick DNA mapped strands to determine the direction in which the mapped end should be extended during STAMP analysis. These tag densities were then summed to generate a methylation signal (Fig. 2). We used this approach to calculate a STAMP signal surrounding all TSS, TTS, at CpGs interrogated by the Illumina HumanMethylation27 microarray and for 15,000 randomly selected genomic loci. The composite methylation profiles at the TSS and TTS are determined by the superposition of all enriched fragments mapping near the TSS. Individual fragments contribute little to this compound signal and the profiles are insensitive to the fragmentation profile of the DNA.
To assess noise, we calculated a composite STAMP signal from the sequence tags within random 15 kb windows for all samples. We defined the mean STAMP signal density within these windows as the noise floor (NF). Using this approach, we found that NF was uniformly 0.114. The NF estimate was independent of the sample and is approximately equal to one sequence tag per kb when the total number of sequence tags per data set is scaled to 106.
To identify DMEs, we first chose all regions with STAMP signal greater than a threshold value. Then the flanks of those regions were extended until the signal declined to 4× NF. To minimize false discovery of DMEs, the detection threshold value was chosen to ensure that the number of DMEs identified in unenriched DNA was less than 5% of that identified in an His-MBD enriched sample from the same source (Supporting Information S1).
Comparison of DNA methylation across genomic regions with varying CG density was performed by classifying each region as either methylated or unmethylated (lowest 10% of sequence tags for regions with similar sequence content) prior to the generation of contingency tables. Fisher's exact test for count data was used to assess the likelihood of any genic element being methylated if any other element is methylated. This approach largely uncouples the analysis from CG density because the element class assignment is insensitive to the magnitude of the STAMP signal. We also calculated the CpG density (CGf), GC fraction (GCf) and CGoe ratio (CGoe = CGf/(Gf * Cf) in sliding windows of 200 bp tiled every 10 bp across the entire human genome. We then identified the fraction of each genic element that could be classified as HCP, ICP or LCP as defined by Weber et al [37] (link), and the fraction of the element detectable (CGf>0.1) by STAMP. Analyses performed using subsets of the genome restricted by these various sequence classes had minimal effect on the results and did not alter the interpretation. Data presented in Figs. 2 and 5 was analysed for the STAMP detectable portion of each genic element.
To generate density plots of CGoe vs GC fraction (Fig. 4), we first analyzed the sequence characteristics of a 200 bp window surrounding each of the ∼28 million CGs in the human genome. We then performed similar analyses for each CG within an annotated CGI and within each of the DMEs we identified. This analysis was performed using custom written tools written in R, utilizing Bioconductor packages BSgenome and IRanges [60] .
Free full text: Click here