TBA [19] (link) alignments of the human genome (hg18) to 43 other vertebrate species were obtained from the UCSC genome browser [20] (link), [21] (link) together with a phylogenetic tree with the generally accepted topology (Fig S1 ) and neutral branch lengths estimated from 4-fold degenerate sites. Both the tree and alignments were projected to the 34 mammalian species. The alignment was compressed to remove gaps in the human sequence, and GERP++ scores were computed for every position with at least 3 ungapped species present, or approximately 88.9% of the 3.08 billion positions on the 22 autosomes and X/Y chromosomes. We used the HKY85 [13] (link) model of evolution with the transition/transversion ratio set to 2.0 and nucleotide frequencies estimated from the multiple alignment.
To limit memory requirements and allow parallelization of the constrained element computation, each chromosome was broken up into regions of approximately 2 megabases, with long segments where no RS score was computed chosen as boundaries. These boundary segments contain no information usable by GERP++ and because the algorithm never annotates constrained elements spanning them, excluding such segments did not sacrifice any predictive ability. These boundary regions made up approximately 6.8% of the human genome, including a 30.2 megabase region that made up more than half of chromosome Y. Constrained element predictions were generated using default parameters and a 5% false positive cutoff measured in terms of number of predictions; the estimated nucleotide-level false positive rate was under 1%. As additional validation, we computed overlap between our predictions and a set of ancestral repeats (L2) annotated by RepeatMasker. We found the overlap to be in line with what we expected given our estimated false positive rates: about 5% of the repeats overlap a predicted CE, with around 1.6% nucleotide-level overlap.
Gene, noncoding RNA, and PhastCons conserved element annotations were obtained from the UCSC genome browser's [20] (link), [21] (link) Known Genes [22] (link), RNA Genes, and Conservation [4] (link) tracks respectively. To avoid skewed statistics due to alternative splicing, gene annotations were resolved to a consistent nonoverlapping set where any segment belonging to multiple conflicting annotations was assigned a single annotation in the following order of priority: coding exon, 5′ UTR, 3′ UTR, intron. For meaningful comparison against phastCons, separate GERP++ scores and constrained elements were generated according to the same procedure as above but using only placental mammal data (ignoring platypus and opossum in the alignment and projecting them out of the phylogenetic tree).
PolII binding regions were defined as 50 bp upstream and downstream of PolII binding ‘peaks’ as identified from ChIP-seq experiments performed by the ENCODE Consortium [3] (link). A 100 bp window allows capture of the likely PolII binding site and its flanking sequence. We obtained data from nine ChIP-seq experiments conducted in two labs (the Snyder lab at Yale and the Myers lab at Hudson Alpha) on six cell types. Data was downloaded through the DCC at UCSC (ftp://encodeftp.cse.ucsc.edu ). All data have passed publication embargo periods. Overlap statistics were calculated as described above for other annotation sets and averaged across all nine experiments.
To limit memory requirements and allow parallelization of the constrained element computation, each chromosome was broken up into regions of approximately 2 megabases, with long segments where no RS score was computed chosen as boundaries. These boundary segments contain no information usable by GERP++ and because the algorithm never annotates constrained elements spanning them, excluding such segments did not sacrifice any predictive ability. These boundary regions made up approximately 6.8% of the human genome, including a 30.2 megabase region that made up more than half of chromosome Y. Constrained element predictions were generated using default parameters and a 5% false positive cutoff measured in terms of number of predictions; the estimated nucleotide-level false positive rate was under 1%. As additional validation, we computed overlap between our predictions and a set of ancestral repeats (L2) annotated by RepeatMasker. We found the overlap to be in line with what we expected given our estimated false positive rates: about 5% of the repeats overlap a predicted CE, with around 1.6% nucleotide-level overlap.
Gene, noncoding RNA, and PhastCons conserved element annotations were obtained from the UCSC genome browser's [20] (link), [21] (link) Known Genes [22] (link), RNA Genes, and Conservation [4] (link) tracks respectively. To avoid skewed statistics due to alternative splicing, gene annotations were resolved to a consistent nonoverlapping set where any segment belonging to multiple conflicting annotations was assigned a single annotation in the following order of priority: coding exon, 5′ UTR, 3′ UTR, intron. For meaningful comparison against phastCons, separate GERP++ scores and constrained elements were generated according to the same procedure as above but using only placental mammal data (ignoring platypus and opossum in the alignment and projecting them out of the phylogenetic tree).
PolII binding regions were defined as 50 bp upstream and downstream of PolII binding ‘peaks’ as identified from ChIP-seq experiments performed by the ENCODE Consortium [3] (link). A 100 bp window allows capture of the likely PolII binding site and its flanking sequence. We obtained data from nine ChIP-seq experiments conducted in two labs (the Snyder lab at Yale and the Myers lab at Hudson Alpha) on six cell types. Data was downloaded through the DCC at UCSC (
Full text: Click here