The model was trained for each of the four temporal-spatial groups of enhancers (CS16, CS23, F2F, and F2O). The positive sets contain the human embryonic enhancers of each group. The DHS profiles of non–CNS-related and nonembryonic tissues from Roadmap Epigenomics projects (55 (link)), which do not overlap the positive sets, were collected as the negative training set of the DL model. The reason we used DHS sites not overlapping embryonic neocortex H3K27ac peaks as negative control regions is that we aim to identify tissue-specific enhancers of embryonic neocortex, and DHS is a good representation of active chromatin. The fact that DHS in general overlaps H3K27ac makes it a stringent control, and in fact, our choice of DHS as the control is analogous to DeepSEA, which uses the genomic regions not overlapping the positive set and with at least one TF binding as the negative set, which broadly overlap with DHS regions.
Training and testing sets were split by chromosomes. Chromosomes 8 and 9 were excluded from training to test prediction performances. Chromosome 6 was used as the validation set, and the rest of the autosomes were used for training. Each training sample consists of a 1000-bp sequence (and their reverse complement) from the human GRCh37 (hg19) reference genome. Larger DL score of the genomic sequence corresponds to a higher propensity to be an active enhancer. The genomic sequence with DLM score ≥ 0.197 (FPR ≤ 0.1) is predicted to be active enhancers. We used the difference of the DLM score induced by a human-macaque single-nucleotide mutation to estimate its impact on enhancer activity.
Given a human (hg19) or macaque (rheMac2) enhancer, we used liftOver (56 (link)) to identify their orthologs. Only the reciprocal counterparts with their length difference no more than 50 bp were considered to be ortholog pairs. For a human sequence with n mutations relative to its macaque ortholog, to score the impact of combinations of m (m < n) mutations on enhancer activity, all possible combinations of m (n choose m) human alleles at the human-macaque mutation sites were introduced to the macaque orthologs if the total number of combinations (n choose m) is no more than 10,000; otherwise, we randomly sample 10,000 combinations of m human alleles from the human-macaque mutation sites and introduce them to the macaque ortholog. The change of DL score caused by the set of introduced human mutations was used to estimate their impact on enhancer activity.
We applied the same convolutional neural network architecture to build a HepG2 enhancer (H3K27ac peaks centered by DNase peaks) classifier. Next, we further used the HepG2 DLM to evaluate the allele-specific effects on enhancer activity using raQTLs (52 (link)).