piPipes comprises five pipelines designed to analyze small RNA-seq, RNA-seq, degradome- and CAGE-seq, ChIP-seq or genome-seq data. The small RNA-seq pipeline reports the abundance, length distribution, nucleotide composition and 5′-to-5′ distance (‘Ping-Pong’ signature) of piRNAs assigned to genomic annotations, including individual transposon families and piRNA clusters, the initial sources of piRNA precursor transcripts. The RNA-seq pipeline reports the normalized abundance of transcripts from both genes and transposons. The degradome-seq pipeline offers methods to identify piRNA-directed cleavage products. This pipeline can also be used to analyze any long RNA sequencing method designed to define RNA 5′ ends, e.g. CAGE-seq. The ChIP-seq pipeline uses the widely used peak-calling algorithm MACS2 (Zhang et al., 2008 (link)), focusing on piRNA clusters and transposons. The genome-seq pipeline detects novel transposition events as well as structural variation. Supplementary Figure S1 illustrates the general piPipes workflow, using the small RNA-seq pipeline as an example. First, all reads aligned to ribosomal RNA (rRNA) sequences are removed. The remaining reads are then mapped to microRNA (miRNA) hairpin sequences to quantify the abundance and 5′- and 3′-end heterogeneity of mature miRNAs. Reads that do not match rRNAs or miRNAs are then mapped to the reference genome. piPipes next assigns reads to different genomic features (e.g., transposons, piRNA clusters and genes) by their coordinates. To achieve maximal speed, piPipes parallelizes this step on multiple threads using ParaFly software from the Trinity package (Grabherr et al., 2011 (link)). For the reads assigned to each genomic feature, piPipes draws publication-quality graphs of length distribution and nucleotide composition, as well as the distance between the 5′ ends of two small RNAs from opposite strands of the same locus, a standard method for detecting piRNA ‘Ping-Pong’ amplification or siRNA phasing (Fig. 1A and Supplementary Fig. S1B). Furthermore, piPipes generates a table that summarizes the number of unique and multiple mappers counted as species (distinct sequences) or reads. The RNA-seq pipeline also starts with rRNA removal. The remaining reads are then mapped to the genome using STAR (Dobin et al., 2013 (link)). piPipes quantifies transcript abundance from genomic alignment by both Cufflinks (Trapnell et al., 2010 (link)) and HTSeq-count (Anders et al., 2014 (link)). In addition, direct mapping of the reads to the transcriptome is performed using Bowtie2 followed by eXpress quantification (Roberts and Pachter, 2013 (link)). Degradome-seq and CAGE-seq share the same pipeline because both methods aim to characterize the 5′ ends of RNAs. This pipeline discards reads that can only be mapped to the genome via soft clipping of their 5′ ends (i.e. the prefixes of these reads do not map to the genome). The alignment procedure is otherwise similar to that used for RNA-seq data. The nucleotide composition for each genomic feature is calculated as in the small RNA pipeline (Supplementary Fig. S3B). The ChIP-seq pipeline aligns the ChIP and input libraries to the genome using Bowtie2. piPipes calls peaks using MACS2 (Zhang et al., 2008 (link)), which supports both narrow (such as transcription factors) and broad (such as histone 3 trimethyl lysine 9, H3K9me3) peaks. Transcription start site, transcription end site and metagene analyses of different genomic features are implemented by bwtool (Pohl and Beato, 2014 (link)). The genome-seq pipeline applies different algorithms, including BreakDancer (Chen et al., 2009 (link)), RetroSeq (Hormozdiari et al., 2010 (link); Keane et al., 2013 (link)) and TEMP (Zhuang et al., 2014 (link)), to discover transposon insertion, deletion and other structural variation events (Supplementary Fig. S5). piPipes uses a Circos plot (Zhang et al., 2013 (link)) to represent the variant loci discovered by each algorithm across different chromosomes (Fig. 1D).

Gallery of piPipes Figures (A) Barplot representing length distribution of Drosophila w1 ovary small RNAs assigned to sense (blue) and antisense (red) strands of transposons. (B) Scatterplot comparing w1 to aubHN2/QC42 Drosophila ovary RNA-seq reads assigned to mRNA (NM; red), non-coding RNA (NR; green) and transposons (blue). (C) Metagene plot of H3K9me3 ChIP-seq of piRNA clusters from flies in which piwi mRNA was depleted by double-stranded RNA-triggered RNA driven by a triple Gal4 driver (SRX215630). (D) Circos plot representing the locations of, from the periphery to the center, cytological position, piRNA clusters, SV discovered by TEMP (tiles), retroSeq (tiles) and VariationHunter (links) using genomic sequencing of 2–4-day-old ovaries

The small RNA-seq, RNA-seq and ChIP-seq pipelines can each be run in two modes, allowing analysis of a single sample or a pair of samples. The dual-sample mode uses the output from the single-sample mode and performs pair-wise comparison as illustrated by balloonplots and scatterplots (Supplementary Fig. S1C and D). The comparison can be performed on miRNA, piRNA or mRNA. Figure 1B illustrates a scatterplot showing the mRNA abundance in an RNA-seq dataset analyzed by the RNA-seq pipeline in the dual-sample mode. The dual-sample mode of the RNA-seq pipeline also uses Cuffdiff (Trapnell et al., 2013 (link)) to perform differential analysis on genic transcripts. In the dual-sample mode, the ChIP-seq pipeline uses MACS2 to identify differentially enriched loci (Supplementary Fig. S4).