The reads quality check was performed using FASTQC ([58 ]). A trimming step of the low-quality bases at 5’ and 3’ was performed using Trimmomatic ([59 (link)]). Low-quality nucleotides were trimmed from the ends of the reads (first 8 bases), setting the minimum quality per base at a Phread score of 20 and minimum and maximum length of the reads after cleaning at 25 bp and 240 bp, respectively. Cleaned reads were assembled into transcript sequences using Trinity v.2.11.0 ([60 (link)]) with in silico read normalization, setting the -min_kmer_cov parameter at 2. The clustering of the transcriptome was performed using the CD-hit-est software (v. 4.6.8, [61 (link)],) with 90% identity threshold in order to remove transcriptome redundancy. The whole transcriptome was aligned with BLASTx software ([62 (link)]) versus the Uniprot SwissProt database (downloaded in July 2020), setting the e-value threshold to 1e−3. A filtering step was performed at this stage for removing all the matches against bacterial sequences from the transcriptome.
The prediction of the encoded proteins from the assembled transcripts was obtained via TransDecoder v 5.3.0 (https://github.com/TransDecoder/TransDecoder/releases). Coding sequences were identified by the software based on: 1) a minimum length Open Reading Frame (100 by default to minimize the number of false positives); 2) an internal score system; 3) if a candidate ORF is entirely included within the coordinates of another candidate ORF, the longer one is reported. The functional annotation of the predicted proteins was performed by InterProScan (version 5.33) ([63 (link)]).
Free full text: Click here