Repetitive elements were identified de novo using RepeatModeler v2.0.1 (Flynn
et al. 2020 (
link)) with the “LTRStruct” option. RepeatMasker v4.1.1 (Tempel 2012 (
link)) was used to screen known repetitive elements with two inputs: (1) the RepeatModeler output and (2) the vertebrata library of Dfam v3.3 (Storer
et al. 2021 (
link)). The resulting output files were validated and merged before redundancy was removed using GenomeTools v1.6.1 (Gremme
et al. 2013 (
link)). To identify and annotate candidate gene models, BRAKER v2.1.6 (Brůna
et al. 2021 (
link)) was used with mRNA and protein evidence. For annotation with BRAKER, the chromosome sequences were soft masked using the maskfasta function of BEDTools v2.30.0 (Quinlan 2014 (
link)) with the “soft” option. Protein evidence consisted of protein records from UniProtKB/Swiss-Prot (UniProt Consortium 2021 (
link)) as of 2021 January 11 (563,972 sequences) as well as selected fish proteomes from the NCBI database (
A. ocellaris: 48,668,
Danio rerio: 88,631,
Acanthochromis polyacanthus: 36,648,
Oreochromis niloticus: 63,760,
Oryzias latipes: 47,623,
Poecilia reticulata: 45,692,
Stegastes partitus: 31,760,
Takifugu rubripes: 49,529, and
Salmo salar: 112,302). Transcriptomic reads from 13 tissues were used as mRNA evidence. These Illumina short reads were trimmed with Trimmomatic v0.39 (Bolger
et al. 2014 (
link)) as described above and mapped to the chromosome sequences with HISAT2 v2.2.1 (Kim
et al. 2019 (
link)). The resulting SAM files were converted to BAM format with SAMtools v1.10 (Li
et al. 2009 (
link)) and used as input for BRAKER. Of the resulting gene models, only those with supporting evidence (mRNA or protein hints) or with homology to the Swiss-Prot protein database (UniProt Consortium 2021 (
link)) or Pfam domains (Mistry
et al. 2021 (
link)) were selected as final gene models. Homology to Swiss-Prot protein database and Pfam domains was identified using Diamond v2.0.9 (Buchfink
et al. 2015 (
link)) or InterProScan v5.48.83.0 (Zdobnov and Apweiler 2001 (
link)), respectively. Functional annotation of the final gene models was completed using NCBI BLAST v2.10.0 (Altschul
et al. 1990 (
link)) with the NCBI non-redundant (nr) protein database. Gene Ontology (GO) terms were assigned to
A. clarkii genes using the BLAST output and the “gene2go” and “gene2accession” files from the NCBI ftp site (
https://ftp.ncbi.nlm.nih.gov/gene/DATA/). Completeness of the gene annotation was assessed with BUSCO v4.1.4 (actinopterygii_odb10) (Simão
et al. 2015 (
link)).
Moore B., Herrera M., Gairin E., Li C., Miura S., Jolly J., Mercader M., Izumiyama M., Kawai E., Ravasi T., Laudet V, & Ryu T. (2023). The chromosome-scale genome assembly of the yellowtail clownfish Amphiprion clarkii provides insights into the melanic pigmentation of anemonefish. G3: Genes|Genomes|Genetics, 13(3), jkad002.