The RSEQtools suite contains a set of modules to perform a large variety of tasks including the quantification of expression values, manipulation of gene annotation sets, visualization of the mapped reads, generation of signal tracks, the identification of transcriptional active regions and several auxiliary utilities (Supplementary Table S1).
Genome annotation tools: to generate a splice junction library from any annotation set, we extract the genomic sequences of all the exons and synthetically create all splice junctions specified in the annotation set. This splice junction library can be used in combination with the reference sequences. A second tool is particularly useful when estimating expression levels. In order to capture the information of the various transcript isoforms, a ‘gene model’ is required. The module mergeTranscripts collapses the transcript isoforms into a single gene model by either taking the union or intersection of the exonic nucleotides.
Quantification of gene expression: one of the key features of RNA-Seq is the quantification of expression at different levels. Hence, a key module calculates the gene expression values for a given annotation set and a collection of mapped reads in MRF format. The annotation set specifies which ‘elements’ will be quantified. The program mrfQuantifier calculates RPKM (reads per kilobase per million mapped reads) values at the nucleotide level (Mortazavi et al., 2008 ). Briefly, for a given entry in the annotation set (typically an exon or gene model), the number of nucleotides from all the reads that overlap with this annotation entry are added up and then this count is normalized by sequence length of the annotation entry (per killobase) and by the total number of mapped nucleotides (per million). This calculation is not performed at the transcript level, which requires a more sophisticated analysis (Guttman et al., 2010 (link); Trapnell et al., 2010 (link)).
Visualization of mapped reads: the RSEQtools package also contains various tools for visualizing the results in genome browsers, by means of wiggle (WIG) and bedGraph files, which are commonly used to represent a signal track of mapped reads. Also, a GFF file can be generated from MRF files to visualize splice junction reads (example in Fig. 1).
Identification of transcriptionally active regions (TARs): transcribed regions can be identified de novo by performing a maxGap/minRun segmentation (Kampa et al., 2004 (link); Royce et al., 2005 (link)) from the signal files using the wigSegmenter program. Briefly, the signal is first thresholded to identify transcribed elements. Contiguous elements whose distance is less than ‘maxGap’ are joined together and then filtered if the final size is less than ‘minRun’. This type of analysis is particularly useful in discovering novel TARs such as small RNAs, etc.
MRF selection and auxiliary utilities: lastly, RSEQtools includes a set of utilities to easily manipulate MRF files and a collection of format conversion tools allowing for rapid pipeline development.
Implementation and run time: the modules of the RSEQtools suite were implemented in C and the code was optimized in order to efficiently handle large datasets. The importance of code scalability cannot be overemphasized in a time where datasets become increasingly large and easily exceed several gigabytes. For example, the conversion of an ELAND export file (uncompressed file size: ∼4 GB; total number of reads: ∼20 million; number of mapped reads: ∼12 million) to MRF takes ∼2 min and the resulting MRF file is significantly smaller (∼400 MB uncompressed, ∼130 MB compressed with gzip). Converting the same ELAND export file to SAM generates a file of ∼3.1 GB (uncompressed) and the corresponding BAM file has a size of ∼1.2 GB. The subsequent quantification of gene expression using mrfQuantifier requires 45 s to calculate estimates for about 20 000 genes.
In addition, the modularity of RSEQtools also enables the development of additional programs in any programming language and their seamless integration into this framework. Finally, most modules use STDIN and STDOUT to process the data, making them suitable to be integrated into an automated pipeline.