Enhancers were mined from four sources:

Ensembl enhancers and promoter flanks from the version 82 regulatory build (19 (link)), based on datasets from ENCODE (10 (link)) and Roadmap Epigenomics (40 (link)).

FANTOM5 ‘permissive enhancers’ dataset from the Transcribed Enhancer Atlas (22 (link)).

Human enhancers from the VISTA Enhancer Browser accessed on 7 April 2016; This includes elements that show consistent cross-tissue reporter expression patterns in replicates (positive enhancers), as well as elements with weaker evidence (negative enhancers) (15 (link)). The latter are non-coding regions showing sequence or epigenome signatures that suggest functionality, but fail in vivo validation in mouse. Their inclusion has only a negligible effect on our analyses due to their small count (846). Also, these sequences may well be active at different embryonic time points than examined by VISTA, hence worthy of inclusion.

ENCODE proximal and distal enhancer regions (46 datasets) provided to ENCODE by the Zhiping Weng Lab, UMass (Supplementary Table S5) (10 (link)). Here, enhancer prediction relied on the identification of DNase hypersensitivity regions and histone H3K27 acetylation signals (http://zlab-annotations.umassmed.edu/enhancers/methods).

Data were processed differently for each source. All datasets were transferred to BED format and, apart from the Ensembl dataset (which was already in the latest genome build), subsequently converted to hg38 using CrossMap (41 (link)) using the UCSC Genome Browser (42 (link)) chain file. In some cases, enhancers were split into several sequences in the new genome build. In those cases, if the total length of the intervals between the split sequences was 2% or less of the total length of all sequences combined, then the sequences were treated as a single enhancer. Otherwise, the original enhancer, which was split in the new genome build, was not used in further analyses. For Ensembl, FANTOM5 and VISTA enhancers, we used data that underwent unification by the sources across all tissues and cell lines. For the ENCODE dataset, enhancer elements were only reported separately for 46 cell lines and tissue types, and such data often showed strong overlaps (e.g. Supplementary Figure S1). To attain uniformity of source utilization, we pre-processed the ENCODE data by performing across-tissue unification similar to that done by the other sources. The coverage for each nucleotide was computed with BEDtools version 2.25.0 (43 (link)). Every contiguous region with coverage of at least 2 was defined as an ENCODE enhancer, with redundancy level comparable to that of the other sources (Table 1).

GeneHancer content

Enhancer sourceTotal number of elementsMean length (bp)SD lengthTotal genome coverage (bp)Total genome coverage (%)PMID
Ensembl213 260108013372.30E+087.1825887522
FANTOM42 9792891631.24E+070.38724670763
VISTA1746178410023.09E+060.096417130149
ENCODEa176 154164420712.90E+089.0222955616
All sources combined434 139123316723.98E+0812.4This study
GeneHancer284 834139719343.98E+0812.4This study

Basic statistics of GeneHancer mined enhancer entities from four sources along with the integrated candidate enhancers. The ‘All sources’ row describes the combination of all mined enhancer elements before applying the GeneHancer unification algorithm.

Data in the ENCODE row represent 1 742 514 original enhancer elements, which underwent pre-processing (see Materials and methods).

For the clustering procedure, enhancer elements from all of the above sources were used in order to define candidate enhancers. Overlaps between any number of enhancers from different sources were examined using BEDtools. Then, groups of overlapping enhancer elements were defined as candidate enhancers; a candidate enhancer’s start and end positions are based on the lowest start and highest end positions, within its group of enhancer elements. A similar procedure was utilized for comparison to a validation dataset from EnhancerAtlas (39 ). EnhancerAtlas data, ∼2.5 M enhancer elements reported separately in 105 tissues/cells, was downloaded from the EnhancerAtlas website, accessed on 12 January 2017. DENdb data, ∼3.5 M enhancer elements reported separately in 15 cell lines, was downloaded from the DENdb website, accessed on 15 December 2016.
For estimating the significance of the pairwise overlaps among enhancer sources, the numbers of overlapping and non-overlapping regions were computed for each source pair, taking into account the size of the human genome. We employed BEDtools using the fisher function. A two-sided P-value was calculated using Fisher's Exact Test Calculator for 2x2 Contingency Tables (http://research.microsoft.com/en-us/um/redmond/projects/mscompbio/fisherexacttest/). As the P-value was very low, the reported value is the upper bound of the true value. Additionally, we used the same methodology to test whether our clustered enhancers overlapped significantly with conserved regions from UCNE (a database of ultra-conserved non-coding elements) (44 (link)). All other analyses estimating significance of pairwise overlaps were performed similarly.
Free full text: Click here