RNA-Seq was pooled and assembled into a transcriptome de novo using the Trinity software package with default settings [24 (
link)]. This produced 48,833 transcripts with an average length of 921.6 nucleotides. We discarded any transcripts encoding open reading frames (ORFs) <100 amino acids, and 21,021 transcripts remained, with an average length of 1,004.6 nucleotides. These transcripts were then compared to NCBI NR protein database using BLASTX. In cases where BLAST
e-values were >1 × e
−5, we determined transcript direction based on longest predicted ORF.
For gene expression quantification in the three species, all reads were trimmed to 50 nucleotides, and only the forward end was used, when paired-ends existed. Reads were then mapped to the transcriptome using bowtie [25 (
link)] with −m 1 −v 2 parameters (requiring unique mapping and two or less mismatches across the full alignment). Correction for multiple mapping was done as follows: each position in each transcript using 50-basepair windows was mapped back against the whole transcriptome using bowtie with the same parameters. If this sequence mapped somewhere else in addition to itself, it was discarded and discounted from the transcript effective length (length-49). This “effective length” was then used to divide the raw read counts per million mapped reads for each gene to obtain corrected-RPKM values (cRPKM).
For human and mouse, only the canonical transcripts as defined by Ensembl (via Biomart) were used for gene expression quantification. Our cRPKM values with canonical transcripts correlate very well with those obtained by Brawand et al. [21 (
link)]. (
r >0.92 for all correlations between log
2-scaled data, who used a much more sophisticated method, that cannot be used in planarians due to lack of similar annotations [repeat and pseudogene annotation, constitutively spliced exons, etc.]).
To determine the 1:1:1 orthologs of planarian genes to mammals, we used pairwise BLASTX of all transcripts from each species using −F F (not filtering out low complexity regions). Only best reciprocal hits for the three pairwise comparisons were used for conservation analyses. Cluster analysis was performed using R function hclust with default settings (“complete” method) and Euclidean distance matrix for log2-scaled cRPKM values. Only 1:1:1 orthologs with ≥2-fold in cRPKM ratios for any of the three pairwise comparisons between planarian samples were used (1,691 genes in total).
Labbé R.M., Irimia M., Currie K.W., Lin A., Zhu S.J., Brown D.D., Ross E.J., Voisin V., Bader G.D., Blencowe B.J, & Pearson B.J. (2012). A Comparative Transcriptomic Analysis Reveals Conserved Features of Stem Cell Pluripotency in Planarians and Mammals. Stem cells (Dayton, Ohio), 30(8), 1734-1745.