Current gene annotations for S. pombe were downloaded as file ‘pombe_290110.gff’ from GeneDB (http://old.genedb.org/genedb/pombe/). RefSeq transcript gene annotations were downloaded for mouse at the UCSC mouse genome browser gateway (http://genome.ucsc.edu/cgi-bin/hgGateway?db=mm9) in BED format. Protein coding nucleotide sequences were extracted from the genome sequences based on the gene annotations using custom PERL scripts. The mouse reference coding sequences were further distilled to remove entirely identical sequences corresponding to isoforms encoding identical proteins and paralogous sequences: the original 19,947 genes encoding 23,881 transcripts were reduced to 19,857 genes encoding 22,717 on-identical coding transcripts.
Reconstructed transcript sequences (via de novo assembly, Scripture, or Cufflinks) were mapped to the reference coding sequences using BLAT35 (link). Full-length reference annotation mappings were defined as having at least 95% sequence identity covering the entire reference coding sequence and containing at most 5% insertions or deletions (cumulative gap content). In evaluating methods that leverage the strand-specific data (Trinity and Cufflinks), proper sense-strand mapping of sequences was required. Transcripts reconstructed by the alternative methods (Scripture, ABySS, and SOAPdenovo) were allowed to map to either strand. Fusion transcripts were identified as individual reconstructed transcripts that mapped as full-length to multiple reference coding sequences and lacked overlap among the matching regions within the reconstructed transcript. One-to-one mappings were required between reconstructed transcripts and reference transcripts, including alternatively spliced isoforms, with the exception of fusion transcripts.