Short read libraries were downloaded from the Short Read Archive [39 (link)] (SRX020777, SRX020781-6). Reads from the deep sequencing libraries were first stripped of the 3' adapter sequence using the FASTX toolkit [40 ]. Reads that were less than 13 nucleotides in length or contained an ambiguous nucleotide were discarded. The remaining reads were aligned to the human genome (hg19) by the Bowtie algorithm [41 (link)], with up to two mismatches allowed. Mapped locations were only reported for the optimal mismatch-stratum for each read up to a maximum of ten different locations. All T = > C mismatches between a read and the genomic sequence were subtracted from the mismatch count at each mapped location. Only reads that mapped to a single genomic location with no mismatches after conversion subtraction were used for further analysis. The location that a read mapped to, relative to a known transcript, was determined based on the ENSEMBL database (release 57) [42 (link)]. If a read mapped to a location that could be placed in multiple categories, it was assigned based on the following order of preference: 3' UTR, coding sequence, 5' UTR, miRNA, intron, intergenic. Reads that overlapped by at least a single nucleotide were grouped together to form read groups. The location of a read group relative to known transcripts was determined in the same way as for individual reads. Original clusters and CCRs were obtained from Hafner et al. [7 (link)] and converted to hg19 coordinates using the liftover tool from the UCSC genome browser [43 (link)].
Repetitive sequence regions were identified by RepeatMasker [44 ] and the specific locations were downloaded from the UCSC genome browser [43 (link)]. The following repeat types were collected for this analysis: low complexity repeat family (low complexity), long interspersed nuclear elements (LINE), short interspersed nuclear elements (SINE), DNA transposons (DNA), RNA repeat families (RNA), satellite repeat family (Satellite), rolling circle (RC), unknown repeat family (Unknown), long terminal repeats (LTR) and other repeats (Other).
Free full text: Click here