SARTools (Statistical Analysis of RNA-Seq data Tools) addresses these limitations by proposing a comprehensive, easy-to-use, DESeq2- and edgeR-based R pipeline that covers all the steps of a differential analysis, from the quality control of raw count data to the detection of differentially expressed genes. It applies to experimental designs involving one biological factor with two or more levels, such as time series or KO vs. WT experiments. When more than two levels are included in the design, all pairwise comparisons are performed. A blocking factor can be specified to take into account data pairing or the presence of a batch effect (e.g. day of preparation effect). However, SARTools does not handle complex experimental designs with interactions since it involves a careful definition of the design formula and of the contrasts to be tested according to the biological question under study. Indeed, it is neither desired nor safe to automate this part of the analysis process. Users who would have to analyse complex experimental designs are encouraged to use directly either DESeq2 or edgeR which both provide extensive help about this kind of experiments.
SARTools is composed of an R package and two R script templates that allow to run the analysis with either DESeq2 or edgeR. Both scripts rely on each package-specific functions as often as possible, and on SARTools functions to export figures and tables and to generate the HTML report. Each script starts with a section of about 15 parameters that refer to (i) paths to input files and the working directory where the analysis will be performed, (ii) project identification, (iii) experimental design, (iv) normalization and statistical test, (v) filtering process and (vi) plotting. Parameters (i) to (iii) have to be adapted to each analysis. The other parameters have default values and can be left unchanged but are accessible to advanced users if they wish to tune the analysis or the reporting more finely.
SARTools requires two types of input files: count data files containing raw counts and a target file that describes the experimental design [13 (
link)]. Count data files are sample-specific and are composed of two columns (a unique feature identifier and a raw feature count) with no header. Note that the alignment and counting steps are out of the scope of SARTools and have to be carried out before using specific tools. HTSeq-count output files can be used as input for instance [14 (
link)]. The target file contains one row per sample and at least three columns with headers: a unique sample identifier or label, the name of the associated raw counts file and the sample biological condition (see
Table 1). If a blocking factor has to be accounted for (e.g. in case of batch effect or paired samples), it is reported in a fourth column. These input files are read by SARTools to build a matrix of integer values in which the intersection of the
i-th row and the
j-th column reflects how many reads have been mapped to feature
i in sample
j. This matrix is then used as input for DESeq2 or edgeR.
The source code of the package and instructions to quickly install it are available on GitHub (
https://github.com/PF2-pasteur-fr/SARTools).
Fig 1 describes the successive steps of the workflow and provides the names of the scripts and R functions corresponding to each step. Furthermore, the Galaxy wrappers to integrate SARTools into a Galaxy instance [15 (
link)–17 (
link)] are available on the Galaxy Tool Shed of the Institut Franҫais de Bioinformatique at
http://toolshed.france-bioinformatique.fr/view/lgueguen/sartools_1_1_0. Galaxy is known to be very user-friendly for biologists and allows them to create worflows to deal with RNA-Seq data. Many tools were already available for the cleaning, mapping and counting steps and SARTools now offers the possibility to run the differential analysis step within the Galaxy environment.