We constructed 43,627 transcript assemblies from about 727 million reads of paired-end Illumina RNA-seq data. These transcript assemblies were constructed using PERTRAN (S.S., unpublished data). We built 47,464 transcript assemblies using PASA
52 (link) from 79,630
P. vulgaris Sanger ESTs and the RNA-seq transcript assemblies. Loci were identified by transcript assembly alignments and/or EXONERATE alignments of peptides from
Arabidopsis, poplar,
Medicago truncatula, grape (
Vitis vinifera) and rice (
Oryza sativa) peptides to the repeat-soft-masked genome using RepeatMasker
53 (link) on the basis of a transposon database developed as part of this project (see
URLs) with up to 2,000-bp extension on both ends, unless they extended into another locus on the same strand. Gene models were predicted by the homology-based predictors FGENESH+ (ref. 53 (
link)), FGENESH_EST (similar to FGENESH+; EST as splice-site and intron input instead of peptide/translated ORF) and GenomeScan
54 (link). The highest scoring predictions for each locus were selected using multiple positive factors, including EST and peptide support, and one negative factor—overlap with repeats. Selected gene predictions were improved by PASA, including by adding UTRs, correcting splicing and adding alternative transcripts. PASA-improved gene model peptides were subjected to peptide homology analysis with the above-mentioned proteomes to obtain Cscore values and peptide coverage. Cscore is the ratio of the peptide BLASTP score to the mutual best hit BLASTP score, and peptide coverage is the highest percentage of peptide aligned to the best homolog. A transcript was selected if its Cscore value was greater than or equal to 0.5 and its peptide coverage was greater than or equal to 0.5 or if it had EST coverage but the proportion of its coding sequence overlapping repeats was less than 20%. For gene models where greater than 20% of the coding sequence overlapped with repeats, the Cscore value was required to be at least 0.9 and homology coverage was required to be at least 70% to be selected. Selected gene models were subjected to Pfam analysis, and gene models whose encoded peptide contained more than 30% Pfam transposon element domains were removed. The final gene set consisted of 27,197 protein-coding genes and 31,638 protein-coding transcripts.
Schmutz J., McClean P.E., Mamidi S., Wu G.A., Cannon S.B., Grimwood J., Jenkins J., Shu S., Song Q., Chavarro C., Torres-Torres M., Geffroy V., Moghaddam S.M., Gao D., Abernathy B., Barry K., Blair M., Brick M.A., Chovatia M., Gepts P., Goodstein D.M., Gonzales M., Hellsten U., Hyten D.L., Jia G., Kelly J.D., Kudrna D., Lee R., Richard M.M., Miklas P.N., Osorno J.M., Rodrigues J., Thareau V., Urrea C.A., Wang M., Yu Y., Zhang M., Wing R.A., Cregan P.B., Rokhsar D.S, & Jackson S.A. (2014). A reference genome for common bean and genome-wide analysis of dual domestications. Nature Genetics, 46(7), 707-713.