The sratoolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software ) in combination with Trinity (Grabherr et al., 2011 (link)) was used in the search for transcripts encoding peptides that might be somewhat similar to insulin in insect gonad transcriptome short read archives (SRAs). The method consisted of using the tblastn_vdb command from the sratoolkit to recover individual reads from transcriptome SRAs that show possible sequence homology with insulin-like molecules. Since insulin-like peptides have highly variable sequences the command is run with the -seg no and -evalue 100 options. Reads that are identified are then collected using the vdb-dump command from the sratoolkit. The total number of reads recovered is much smaller than those typically present in an SRA and this allows one to use Trinity on a normal desktop computer to make a mini-transcriptome of those reads. This transcriptome is than searched using the BLAST+ program (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocsDOC_TYPE=Download ) for possible insulin transcripts. This first round usually yields numerous false positives and perhaps a few partial transcripts that look interesting. These promising but partial transcripts are then used as query using the blastn_vdb command from the sratoolkit on the same SRAs and reads are collected anew and Trinity is used to make another transcriptome that is again queried for the presence of insulin-like transcripts. In order to obtain complete transcript the blastn_vdb search may need to be repeated several times. Alternatively genes coding such transcripts were identified in genome assemblies using the BLAST+ program and Artemis (Rutherford et al., 2000 (link)). Once such transcripts had been found, it was often possible to find orthologs from related species. For example, once the honeybee gonadulin was found, it was much easier to find it in other Hymenoptera. The same methods were used to identify relaxin and C-terminally extended ilps, which have much better conserved primary amino acid sequences and consequently are more easily identified, as well as their putative receptors. Whenever possible all sequences were confirmed in both genome assemblies and in transcriptome SRAs. In many cases transcripts for the various ilps and receptors were already present in genbank, although they were not always correctly identified. All these sequences are listed in Spreadsheet S1 .
Expression was estimated by counting how many RNAseq reads in each SRA contained coding sequence for each of the genes. In order to avoid untranslated sequences of the complete transcripts, that sometimes share homologous stretches with transcripts from other genes and can cause false positives, only the coding sequences were used as query in the blastn_vdb command from the sratoolkit. This yielded the blue numbers inSpreadsheet S2 . In order to more easily compare the different SRAs these numbers were then expressed as per million spots in each particular SRA. These are the bold black numbers in Spreadsheet S2 .
For the expression of alternative aIGF (arthropod insulin-like growth factor) splice forms reads for each splice variant were first separately identified. Unique identifiers in these two sets were determined to obtain the total number of reads for aIGF. Those identifiers that were present in the initial counts for both splice forms were counted separately and subtracted from the initial counts of the two splice variants to obtain the number of reads specific for each isoform.
The various SRAs that were used are listed in the supplementary pdf file and were downloaded fromhttps://www.ncbi.nlm.nih.gov/sra/ . The following genome assemblies were also analyzed: Aedes aegypti (Matthews et al., 2018 (link)), Blattella germanica (Harrison et al., 2018 (link)), Bombyx mori (Kawamoto et al., 2019 (link)), Galleria melonella (Lange et al., 2018 (link)), Glossina morsitans (Attardo et al., 2019 (link)), Hermetia illucens (Zhan et al., 2020 (link)), Latrodectus hesperus (https://www.ncbi.nlm.nih.gov/genome/?term=Latrodectus+hesperus ), Mesobuthus martensii (Cao et al., 2013 (link)), Oncopeltus fasciatus (Panfilio et al., 2019 (link)), Parasteatoda tepidariorum (Schwager et al., 2017 (link)), Pardosa pseudoannulata (Yu et al., 2019 (link)), Periplaneta americana (Li et al., 2018 (link)), Stegodyphus dumicola (Liu et al., 2019 (link)), Tetranychus urticae (Grbic et al., 2011 (link)), Timema cristinae (Riesch et al., 2017 (link)), Tribolium castaneum (Herndon et al., 2020 (link)) and Zootermopsis nevadensis (Terrapon et al., 2014 (link)). All genomes were downloaded from https://www.ncbi.nlm.nih.gov/genome/ .
Expression was estimated by counting how many RNAseq reads in each SRA contained coding sequence for each of the genes. In order to avoid untranslated sequences of the complete transcripts, that sometimes share homologous stretches with transcripts from other genes and can cause false positives, only the coding sequences were used as query in the blastn_vdb command from the sratoolkit. This yielded the blue numbers in
For the expression of alternative aIGF (arthropod insulin-like growth factor) splice forms reads for each splice variant were first separately identified. Unique identifiers in these two sets were determined to obtain the total number of reads for aIGF. Those identifiers that were present in the initial counts for both splice forms were counted separately and subtracted from the initial counts of the two splice variants to obtain the number of reads specific for each isoform.
The various SRAs that were used are listed in the supplementary pdf file and were downloaded from
Full text: Click here