Iterative mapping and error correction of the chromatin interaction data were performed as previously described29 (link). Supplementary Table 1 summarizes the mapping results and lists the different categories of DNA molecules encountered in the libraries. We obtained around 70 million valid pairs that represent chromatin interactions per replicate. The frequency of redundant read pairs, due to PCR amplification, were found to be below ~5%. Redundant read pairs were removed. The number of Hi-C interactions mapped to sequences belonging to homologous chromosomes (both intra-chromosomal [cis] and inter-homolog [trans] interactions) was much higher than the interactions mapped to non-homologous chromosomes (inter-chromosomal [trans] interactions). Assuming that inter-homolog interactions (trans) are as frequent as non-homologous inter-chromosomal interactions (trans), we estimate that 80–90% of interactions mapped to the same chromosomes are intra-chromosomal (cis) interactions, with DC mutant (90%) higher than wild-type (> 85%). Whether this difference reflects a biological phenomenon or is due to technical differences is currently not known. Conversion of interaction data into Z-scores eliminates this difference (see below).
The data were binned at both 10 kb and 50 kb non-overlapping genomic intervals. Binned data were normalized for intrinsic biases such as differences in number of restriction fragments within bins using the previously developed ICE method29 (link). To normalize for differences in read depth of different datasets we summed the entire genome-wide binned ICE-corrected interaction matrix, excluding the diagonal (x = y) bins. We then transformed each interaction into a fraction of the matrix sum (minus diagonal x = y bins). Each fraction was then multiplied by 106. Biological replicates were highly correlated (Pearson’s correlation coefficients > 0.98 for 50 kb binned data excluding short-range interactions up to 50 kb). The correlations between biological replicates were higher than those between wild-type and DC mutant. Overall these numbers indicate that the modified Hi-C procedure was reproducible and performed as expected. For most analyses sequence reads obtained for biological replicates were pooled and ICE-corrected as described above to create a combined replicate dataset.
At 10 kb resolution, very long-range interactions are not sampled deeply enough to provide robust and reliable data. Therefore, we truncated the 10 kb binned data to include only cis interaction pairs separated by 4 Mb or less in linear genomic distance. This distance cutoff was chosen based on the observation that beyond this point, both wild-type and DC mutant datasets have no observed reads in more than 50% of bin-bin interactions. In addition to limiting the dynamic range of interaction counts at these large distances, this high frequency of un-sampled interactions beyond 4 Mb causes a dramatic collapse in the standard deviation of the overall chromatin interaction decay over distance, making the LOWESS expected and Z-score calculations beyond 4 Mb unreliable. For 50 kb bins, all distances were included in analyses, because the coverage of cis interaction pairs never dropped below 50% for any distance at this resolution.
The data were binned at both 10 kb and 50 kb non-overlapping genomic intervals. Binned data were normalized for intrinsic biases such as differences in number of restriction fragments within bins using the previously developed ICE method29 (link). To normalize for differences in read depth of different datasets we summed the entire genome-wide binned ICE-corrected interaction matrix, excluding the diagonal (x = y) bins. We then transformed each interaction into a fraction of the matrix sum (minus diagonal x = y bins). Each fraction was then multiplied by 106. Biological replicates were highly correlated (Pearson’s correlation coefficients > 0.98 for 50 kb binned data excluding short-range interactions up to 50 kb). The correlations between biological replicates were higher than those between wild-type and DC mutant. Overall these numbers indicate that the modified Hi-C procedure was reproducible and performed as expected. For most analyses sequence reads obtained for biological replicates were pooled and ICE-corrected as described above to create a combined replicate dataset.
At 10 kb resolution, very long-range interactions are not sampled deeply enough to provide robust and reliable data. Therefore, we truncated the 10 kb binned data to include only cis interaction pairs separated by 4 Mb or less in linear genomic distance. This distance cutoff was chosen based on the observation that beyond this point, both wild-type and DC mutant datasets have no observed reads in more than 50% of bin-bin interactions. In addition to limiting the dynamic range of interaction counts at these large distances, this high frequency of un-sampled interactions beyond 4 Mb causes a dramatic collapse in the standard deviation of the overall chromatin interaction decay over distance, making the LOWESS expected and Z-score calculations beyond 4 Mb unreliable. For 50 kb bins, all distances were included in analyses, because the coverage of cis interaction pairs never dropped below 50% for any distance at this resolution.