The image data output from the sequencing apparatus was transformed via base calling into raw data and stored in FASTQ format. The raw data were cleaned with filter_fq software to discard low-quality reads, reads with regions with greater than 5 % unknown bases or reads with adapters.
De novo transcriptome assembly was conducted with the short reads assembly program Trinity [34 (link), 77 (link)] (minimal contig_length: 100; group_pairs distance: 250; minimal kmer_cov: 2). Trinity first combined overlapping reads to form contigs with at least a 100-bp length and a minimum of 2 reads to be assembled. Then, the contigs were assembled again to obtain longer sequences that could not be further extended, which are unigenes. During this process and before obtaining the final unigenes, the reads were mapped against the contigs to confirm the assembly procedure. When multiple samples from the same species are sequenced (biological replicates or different tissues), unigenes from each sample can be applied together to perform another assembly step. This process detects sequence splicing and redundancy to acquire the longest sequences and group them into clusters. Each cluster is formed by several unigenes with more than 70 % similarity. To simplify the terminology employed in this study, all the non-redundant sequences will be called “transcripts”, regardless of whether they are unique unigenes or belong to a cluster. The completeness of the mussel transcriptome was confirmed with the CEGMA package (http://korflab.ucdavis.edu/datasets/cegma/).
A total of 151,320 transcripts were obtained following this protocol. This number represents all the detectable variability in the mRNAs from the four studied tissues, including splicing variants, non-overlapping fragments of the same mRNA, UTRs or mRNAs in different splicing stages.
The transcripts were first annotated using BLASTx and BLASTn (with an e-value threshold of 10e−5) against the NCBI nr, Swiss-Prot, KEGG and COG protein databases and the NCBI nt nucleotide database. The annotation step provided the identity of the transcript with the species harboring the matching sequence, which is useful for detecting possible contaminants in our samples. Using the KEGG database information, the metabolic pathways and functions of the annotated transcripts could be obtained and presented.
The oyster proteome was downloaded from http://www.oysterdb.com/FrontDownloadAction.do?method=download and compared with the translated mussel transcripts.
Free full text: Click here