For each input dataset (or merged input for human) from ENCODE, the number of reads per mappable base and the number of multimapping reads per million reads is calculated for each bin of 1 kb with 100 bp overlap across all chromosomes. The values across bins are then quantile normalized and a standard value at the 50% quantile is selected to represent each bin. This threshold was selected to avoid high signal outliers from individual cell types (for example, from copy number variants) and to avoid low signal from failed or incorrectly labeled input datasets. The standard values across the genome are then flagged if they are in the top 0.1% of signal for either read depth or mappability. Neighboring regions are merged if they maintain a signal in the top 1% of all signal or if they have no signal due to no mappability in the genome and any flagged regions within 20 kb were combined. This generates contiguous regions of abnormal signal across the genome.
Free full text: Click here