Nanopore sequence data was filtered to remove the control lambda-phage and sequences shorter than 1,000 bases using the nanopack tool suite [v1.0.1] (De Coster et al. 2018 (link)). Trimmomatic [v0.32] (Bolger et al. 2014 (link)) was used to remove adapters, trim low-quality bases, and filter out reads shorter than 85 bp. The filtered nanopore data were assembled into contigs using wtdbg2 [v2.4] (Ruan and Li 2020 (link)). The contigs were polished using two iterations of racon [v1.4.0] (Vaser et al. 2017 (link)) with minimap2 [v2.17] (Li 2018 (link)) mapping the nanopore reads. The contigs were further polished with Illumina paired-end read data using pilon [v1.23] (Walker et al. 2014 (link)) with bwa [v0.7.10] (Li 2013 ) mapping the Illumina paired reads. The resulting contigs were scaffolded using Bionano Solve [Solve3.4.1_09262019] using the optical mapping data generated from the Saphyr run. SALSA [v2.3] (Ghurye et al. 2019 (link)) was used to produce super-scaffolds using the Hi-C library and the Bionano scaffolded sequences. Those scaffolds larger than 10Mb were linked and oriented based on the Onychostoma macrolepis genome (Sun et al. 2020 (link)), the chromosome assembly most similar to L. rohita available on NCBI, using RagTag [v1.1.1] (Alonge et al. 2022 (link)).
RepeatModeler [v2.0.1] (Flynn et al. 2020 (link)) and RepeatMasker [v4.1.1] (Smit et al. 2013 ) were used to create a species-specific repeat database, and this database was subsequently used by RepeatMasker to mask those repeats in the genome. All available RNA-seq libraries for L. rohita (comprising brain, pituitary, gonad, liver, pooled, and whole body tissues for both sexes; Supplementary Table 1) were downloaded from NCBI and mapped to the masked genome using hisat2 [v2.1.0] (Kim et al. 2019 (link)). These alignments were used in both the mikado [v2.0rc2] (Venturini et al. 2018 (link)) and braker2 [v2.1.5] (Brůna et al. 2021 (link)) pipelines. Mikado uses putative transcripts assembled from the RNA-seq alignments generated via stringtie [v2.1.2] (Kovaka et al. 2019 (link)), cufflinks [v2.2.1] (Trapnell et al. 2012 (link)), and trinity [v2.11.0] (Grabherr et al. 2011 (link)) along with the junction site prediction from portcullis [v1.2.2] (Mapleson et al. 2018 (link)), the alignments of the putative transcripts with UniprotKB Swiss-Prot [v2021.03] (The UniProt Consortium 2021 (link)), and the ORFs from prodigal [v2.6.3] (Hyatt et al. 2010 (link)) to select the best representative transcript for each locus. Braker2 uses those RNA-seq alignments and the gene prediction from GeneMark-ES [v4.61] (Borodovsky and Lomsadze 2011 (link)) to train a species-specific Augustus [v3.3.3] (Stanke et al. 2006 (link)) model. Maker2 [v2.31.10] (Holt and Yandell 2011 (link)) predicts genes based on the new Augustus, GeneMark, and SNAP models derived from Braker2 along with the Mikado predicted transcripts as an external ab-initio source, modifying the predictions based on the available RNA and protein evidence from the Cyprinidae family in the NCBI RefSeq database. Any predicted genes with an annotation edit distance (AED) above 0.47 were removed from further analysis. The remaining genes were functionally annotated using InterProScan [v5.47-82.0] (Jones et al. 2014 (link)) and BLAST + [v2.9.0] (Camacho et al. 2009 (link)) alignments against the UniprotKB Swiss-Prot database. BUSCO [v5.2.2] (Manni et al. 2021 (link)) was used to verify the completeness of both the genome and annotations against the actinopterygii_odb10 database. Lastly, genes spanning large gaps or completely contained within another gene on the opposite strand were removed using a custom Perl script (https://github.com/IGBB/rohu-genome/).
Free full text: Click here