The P. dulcis genome assembly was annotated by combining transcript alignments, protein alignments and ab initio gene predictions. A flowchart of the annotation process is shown in Figure S13. Scripts are available at https://github.com/jesgomez/annotation_pipeline.
First, almond RNA‐seq reads were downloaded from NCBI with the accession number SRR1251980 and aligned to the genome with STAR (v.2.5.3a) (Dobin et al., 2013). Transcript models were subsequently generated using Stringtie (v.1.0.4) (Pertea et al., 2015) and, along with the P. persica transcriptome (annotation Pp2.0a) and 4509 almond expressed sequence tags downloaded from NCBI on July 2015, were assembled into a non‐redundant set by PASA (v.2.3.3) (Haas et al., 2008). The TransDecoder program, which is part of the PASA package, was run on the PASA assemblies to detect coding regions in the transcripts. Second, the complete Rosaceae proteome was downloaded from Uniprot on July 2015 and aligned to the genome using Exonerate (v.2.4.7) (Slater and Birney, 2005). Third, ab initio gene predictions were performed on the repeat masked pdulcis26 assembly with three different programs: GeneID v.1.4 (Alioto et al., 2018), Augustus v.3.2.3 (Stanke et al., 2015) and GeneMark‐ES v.2.3e (Lomsadze et al., 2014) with and without incorporating evidence from the RNA‐seq data. Finally, all the data were combined into consensus coding sequence models using EvidenceModeler‐1.1.1 (EVM) (Haas et al., 2008). Additionally, untranslated regions and alternative splicing forms were annotated through two rounds of PASA annotation updates.
Non‐coding RNAs were annotated as follows: first, the program cmsearch v.1.1 (Cui et al., 2016) from the INFERNAL package (Nawrocki and Eddy, 2013) was run against the RFAM (Nawrocki et al., 2015) database of RNA families (v.12.0). Also, tRNAscan‐SE v.1.23 (Lowe, 1997) was run to detect the transfer RNA genes present in the genome assembly. To annotate long non‐coding RNAs (lncRNAs) we first selected PASA assemblies that had not been included in the annotation of protein‐coding genes. Those longer than 200 bp and whose length was not covered to at least 80% by a small ncRNA were incorporated into the ncRNA annotation as lncRNAs. The resulting transcripts were clustered into genes using shared splice sites or significant sequence overlap as criteria for designation as the same gene.
Free full text: Click here