Before applying ChromHMM, we converted the normalized signal tracks into binarized data at a 200-bp resolution. We used the maximum signal for a mark in each 200-bp interval to represent the mark in that interval. The threshold for each mark was the maximum of 4.0 and the value corresponding to a Poisson tail distribution probability of 0.0001. Requiring a fold threshold, in addition to the tail distribution threshold, enabled more meaningful binarization of some of the most deeply sequenced datasets. We excluded regions that associated with repetitive elements such as α- and β-satellite repeats, ribosomal and mitochondrial DNA (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityRegionsExcludable.bed.gz).
For Segway, we excluded the ENCODE Data Analysis Consortium Blacklisted Regions (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz), comprising a comprehensive set of regions in the human genome that exhibit anomalous or unstructured read-counts in next gen sequencing experiments, independent of cell line and type of experiment. To identify these regions, we used 80 open chromatin tracks (DNase and FAIRE datasets) and 20 ChIP-seq input/control tracks spanning ∼60 human tissue types/cell lines in total. The regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability. Some of these regions overlap pathological repeat elements such as satellite, centromeric and telomeric repeats. However, simple filters based on mappability do not account for most of these regions.