ChIP–chip dataset. A number of assays have been recently developed that use immunopercipitation-based enrichment of cellular DNA for the purpose of identifying binding or other chemical events and the genomic locations at which they occur. Location analysis, also known as ChIP–chip, is a technique that enables the mapping of transcription binding events to genomic locations at which they occur [1 (link),54 ]. The output of the assay is a fluorescence dye ratio at each spot of the array. If spots are taken to represent genomic regions, then we can regard the ratio and p-value associated with each spot as an indication of TF binding in the corresponding genomic region. We applied DRIM to S. cerevisiae genome-wide location data reported in Harbison et al. [25 (link)] and Lee et al. [28 (link)]. The first consists of the genomic occupancy of 203 putative TFs in rich media conditions (YPD). In addition, the genomic occupancy of 84 of these TFs was measured in at least one other condition (OC). In each of the experiments, the genomic sequences were ranked according to the TF binding p-value. Surprisingly, we observed that 69 of the 203 ranked sequence lists of YPD had significantly longer sequences at the top of the list (first 300 sequences) compared with the rest of the list with t-test p-value ≤ 10−3. We observed a similar phenomenon in 76 of the 148 ranked sequence lists of OC experiments (see Figure S1). In other words, for some TFs, longer sequences are biased toward stronger binding signals. This observation is unexpected since, although longer probes hybridize more labeled material than shorter probes, the increase should be proportional in both channels. This type of length bias may cause spurious results under our model assumptions and hence the final dataset, termed “Harbison filtered dataset,” refers to the remaining 207 experiments (135 YPD, and 72 OC) of 162 unique TFs that did not have length bias (Table S1).
An additional ChIP–chip dataset was constructed using the data reported in Lee et al. [28 (link)] containing 113 experiments in rich media. The data is partially exclusive to the data of Harbison et al. [25 (link)]. The same filtering procedure was performed, resulting in a set of 65 experiments, termed “Lee filtered dataset.”
Methylated CpG dataset. Using a technique similar to ChIP–chip, termed methyl-DNA immunoprecipitation (mDIP), enables the measurement of methylated CpG island patterns [2 (link),55 (link)]. The third dataset contains the CpG island methylation patterns of four different human cancer cell lines (Caco-2, Polyp, Carcinoma, PC3) where several replicate experiments were done for each of the cell lines. In each of these experiments, the CpG methylation signal was measured in ∼13,000 gene promoters as reported in [2 (link)].
Free full text: Click here