To refine exon/intron junction locations, 305,000 teleost protein sequences from Uniprot37 (link) and Ensembl38 (link) were aligned on the genome sequence using the BLAT algorithm39 (link) to first select the best match (plus matches greater than 0.8X best matches) and each matched protein was then realigned using Genewise40 (link) on the same trout genomic region. 93% of these teleost proteins matched at 41,300 different genomic loci in the rainbow trout genome assembly.
For building gene models, rainbow trout GenBank mRNA sequences were aligned onto the genome assembly using BLAT39 (link) and est2genome41 (link) resulting in 93% of mapping of these 421,414 mRNA sequences. Only the best matches with at least 90% of nucleotide identity were kept. On average, similarity level was 97.8% and half of these alignments supported splicing evidence, with an average of 2.5 exons per mRNA. We also used publicly available rainbow trout Roche 454 EST sequences available in SRA (accession number
Final gene models were built using Gaze43 (link) leading to 55,735 gene models with an average of 6 exons per gene (median=4). At the genome level, coding bases cover 3% of the assembly. Because 3,088 exons were overlapping gaps in the assembly, we inserted in-frame introns to avoid a long stretch of N letters in the corresponding protein sequences. We also tagged 585 genes that still contained transposable elements despite repeated cleaning procedures. In summary, the final gene set can be categorized into 4 classes of decreasing confidence level: (i) 46,585 protein-coding gene models with supporting protein evidence from other vertebrates (