Integrative Enhancer Annotation and Clustering Protocol

Enhancers were mined from four sources:

Ensembl enhancers and promoter flanks from the version 82 regulatory build (19 (link)), based on datasets from ENCODE (10 (link)) and Roadmap Epigenomics (40 (link)).

FANTOM5 ‘permissive enhancers’ dataset from the Transcribed Enhancer Atlas (22 (link)).

Human enhancers from the VISTA Enhancer Browser accessed on 7 April 2016; This includes elements that show consistent cross-tissue reporter expression patterns in replicates (positive enhancers), as well as elements with weaker evidence (negative enhancers) (15 (link)). The latter are non-coding regions showing sequence or epigenome signatures that suggest functionality, but fail in vivo validation in mouse. Their inclusion has only a negligible effect on our analyses due to their small count (846). Also, these sequences may well be active at different embryonic time points than examined by VISTA, hence worthy of inclusion.

ENCODE proximal and distal enhancer regions (46 datasets) provided to ENCODE by the Zhiping Weng Lab, UMass (Supplementary Table S5) (10 (link)). Here, enhancer prediction relied on the identification of DNase hypersensitivity regions and histone H3K27 acetylation signals (http://zlab-annotations.umassmed.edu/enhancers/methods).

Data were processed differently for each source. All datasets were transferred to BED format and, apart from the Ensembl dataset (which was already in the latest genome build), subsequently converted to hg38 using CrossMap (41 (link)) using the UCSC Genome Browser (42 (link)) chain file. In some cases, enhancers were split into several sequences in the new genome build. In those cases, if the total length of the intervals between the split sequences was 2% or less of the total length of all sequences combined, then the sequences were treated as a single enhancer. Otherwise, the original enhancer, which was split in the new genome build, was not used in further analyses. For Ensembl, FANTOM5 and VISTA enhancers, we used data that underwent unification by the sources across all tissues and cell lines. For the ENCODE dataset, enhancer elements were only reported separately for 46 cell lines and tissue types, and such data often showed strong overlaps (e.g. Supplementary Figure S1). To attain uniformity of source utilization, we pre-processed the ENCODE data by performing across-tissue unification similar to that done by the other sources. The coverage for each nucleotide was computed with BEDtools version 2.25.0 (43 (link)). Every contiguous region with coverage of at least 2 was defined as an ENCODE enhancer, with redundancy level comparable to that of the other sources (Table 1).
Table 1

GeneHancer content

Enhancer source	Total number of elements	Mean length (bp)	SD length	Total genome coverage (bp)	Total genome coverage (%)	PMID
Ensembl	213 260	1080	1337	2.30E+08	7.18	25887522
FANTOM	42 979	289	163	1.24E+07	0.387	24670763
VISTA	1746	1784	1002	3.09E+06	0.0964	17130149
ENCODE^a	176 154	1644	2071	2.90E+08	9.02	22955616
All sources combined	434 139	1233	1672	3.98E+08	12.4	This study
GeneHancer	284 834	1397	1934	3.98E+08	12.4	This study

Basic statistics of GeneHancer mined enhancer entities from four sources along with the integrated candidate enhancers. The ‘All sources’ row describes the combination of all mined enhancer elements before applying the GeneHancer unification algorithm.

Data in the ENCODE row represent 1 742 514 original enhancer elements, which underwent pre-processing (see Materials and methods).

For the clustering procedure, enhancer elements from all of the above sources were used in order to define candidate enhancers. Overlaps between any number of enhancers from different sources were examined using BEDtools. Then, groups of overlapping enhancer elements were defined as candidate enhancers; a candidate enhancer’s start and end positions are based on the lowest start and highest end positions, within its group of enhancer elements. A similar procedure was utilized for comparison to a validation dataset from EnhancerAtlas (39 ). EnhancerAtlas data, ∼2.5 M enhancer elements reported separately in 105 tissues/cells, was downloaded from the EnhancerAtlas website, accessed on 12 January 2017. DENdb data, ∼3.5 M enhancer elements reported separately in 15 cell lines, was downloaded from the DENdb website, accessed on 15 December 2016.
For estimating the significance of the pairwise overlaps among enhancer sources, the numbers of overlapping and non-overlapping regions were computed for each source pair, taking into account the size of the human genome. We employed BEDtools using the fisher function. A two-sided P-value was calculated using Fisher's Exact Test Calculator for 2x2 Contingency Tables (http://research.microsoft.com/en-us/um/redmond/projects/mscompbio/fisherexacttest/). As the P-value was very low, the reported value is the upper bound of the true value. Additionally, we used the same methodology to test whether our clustered enhancers overlapped significantly with conserved regions from UCNE (a database of ultra-conserved non-coding elements) (44 (link)). All other analyses estimating significance of pairwise overlaps were performed similarly.

Free full text: Click here

Fishilevich S., Nudel R., Rappaport N., Hadar R., Plaschkes I., Iny Stein T., Rosen N., Kohn A., Twik M., Safran M., Lancet D, & Cohen D. (2017). GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database: The Journal of Biological Databases and Curation, 2017, bax028.

Publication 2017

Corresponding Organization : Weizmann Institute of Science

Top 5 similar protocols

Protocol cited in 25 other protocols

Variable analysis

independent variables

Sources of enhancers mined: Ensembl, FANTOM5, VISTA, ENCODE

dependent variables

Not explicitly mentioned

control variables

Not explicitly mentioned

controls

Positive controls: None mentioned
Negative controls: None mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!