We built a deep convolutional neural network to predict tissue-specific enhancer activity directly from the enhancer DNA sequence. The DLM comprises five convolution layers with 320, 320, 240, 240, and 480 kernels, respectively (table S14). Higher-level convolution layers receive input from larger genomic ranges and are able to represent more complex patterns than the lower layers. The convolutional layers are followed by a fully connected layer with 180 neurons, integrating the information from the full length of 1000-bp sequence. In total, the DLM has 3,631,401 trainable parameters. We used the Python library Keras version 2.4.0 (https://github.com/keras-team/keras) to implement our model.
The model was trained for each of the four temporal-spatial groups of enhancers (CS16, CS23, F2F, and F2O). The positive sets contain the human embryonic enhancers of each group. The DHS profiles of non–CNS-related and nonembryonic tissues from Roadmap Epigenomics projects (55 (link)), which do not overlap the positive sets, were collected as the negative training set of the DL model. The reason we used DHS sites not overlapping embryonic neocortex H3K27ac peaks as negative control regions is that we aim to identify tissue-specific enhancers of embryonic neocortex, and DHS is a good representation of active chromatin. The fact that DHS in general overlaps H3K27ac makes it a stringent control, and in fact, our choice of DHS as the control is analogous to DeepSEA, which uses the genomic regions not overlapping the positive set and with at least one TF binding as the negative set, which broadly overlap with DHS regions.
Training and testing sets were split by chromosomes. Chromosomes 8 and 9 were excluded from training to test prediction performances. Chromosome 6 was used as the validation set, and the rest of the autosomes were used for training. Each training sample consists of a 1000-bp sequence (and their reverse complement) from the human GRCh37 (hg19) reference genome. Larger DL score of the genomic sequence corresponds to a higher propensity to be an active enhancer. The genomic sequence with DLM score ≥ 0.197 (FPR ≤ 0.1) is predicted to be active enhancers. We used the difference of the DLM score induced by a human-macaque single-nucleotide mutation to estimate its impact on enhancer activity.
Given a human (hg19) or macaque (rheMac2) enhancer, we used liftOver (56 (link)) to identify their orthologs. Only the reciprocal counterparts with their length difference no more than 50 bp were considered to be ortholog pairs. For a human sequence with n mutations relative to its macaque ortholog, to score the impact of combinations of m (m < n) mutations on enhancer activity, all possible combinations of m (n choose m) human alleles at the human-macaque mutation sites were introduced to the macaque orthologs if the total number of combinations (n choose m) is no more than 10,000; otherwise, we randomly sample 10,000 combinations of m human alleles from the human-macaque mutation sites and introduce them to the macaque ortholog. The change of DL score caused by the set of introduced human mutations was used to estimate their impact on enhancer activity.
We applied the same convolutional neural network architecture to build a HepG2 enhancer (H3K27ac peaks centered by DNase peaks) classifier. Next, we further used the HepG2 DLM to evaluate the allele-specific effects on enhancer activity using raQTLs (52 (link)).
Free full text: Click here