For our analysis, we employed publicly available read data (Supplementary Methods) from: Saccharomyces cerevisiae (9 (link)), Arabidopsis thaliana (28 (link)), Mus musculus (11 (link)), the same Homo sapiens sample sequenced with two different RNA-Seq protocols, i.e. flowcell RT-Seq (FRT) and standard hydrolysis (STD) protocol (17), and RNA control sequences spiked-in in high concentrations (29 (link)). In a first step, we mapped and split-mapped non-redundantly all the reads to the respective reference genome sequence using the GEM library (http://sourceforge.net/projects/gemlibrary); in the case of the cress data set, which is comparatively small, we also considered additional read mappings with long indels obtained with BLAT (30 (link)).
Subsequently, we focused on the distribution of reads that map to transcripts without alternatively processed forms. To define such transcripts, we considered a standard reference annotation of the transcriptome, i.e. the SGD annotation for yeast (31 (link)), the TAIR annotation for cress (32 (link)) and the murine as well as the human RefSeq annotation (33 (link)). This procedure provided us with mappings for 6 606 768 reads (47%) from yeast, 351 336 reads (65%) from cress and for 21 359 481 reads (68%) from mouse, and with 530 996 reads that map in proper pairs to the spike-in control sequences. Due to substantially different data set sizes (90 million versus 13 million reads), in the case of the human FRT- and the STD-Seq experiments, we extracted subsets of reads of suitable size before mapping to ensure comparability (Supplementary Table S1).