sRNA-seq libraries from Arabidopsis thaliana, Oryza sativa, and Zea mays were obtained from the NCBI Sequence Read Archive (SRA) (Supplemental Material, Table S1). Libraries were selected that had > 5 million raw reads, were available in an unprocessed format, and were derived from an Illumina instrument. 3′-Adapter sequences were discovered using find_3p_adapter.pl (available at http://sites.psu.edu/axtell/), and removed using ShortStack’s internal adapter trimming protocol. Simulated sRNA-seq libraries were produced to closely emulate real sRNA-seq data. This process was accomplished through a custom python script and wrapper run under default settings: sRNA-simulator.py (File S1). This script uses a real sRNA-seq library as the basis for each simulated library. Real sRNA-seq libraries were aligned using bowtie (Langmead et al. 2009 (link)) reporting all alignments. Regions of the genome that had no alignments were removed from consideration as simulated loci, while genomic regions prone to alignments with certain length classes of sRNAs became candidate regions for simulated heterochromatic siRNA (hc-siRNA; 23–24 nt) and trans-acting siRNA (21 nt) loci. miRNA candidate regions were picked based on prior annotated loci, available through miRBase (Kozomara and Griffiths-Jones 2014 (link)). Simulated loci were chosen from these candidate regions at random. Five million reads were then generated from these simulated loci, generating roughly 3.25 M hc-siRNA, 1.5 M miRNA, and 250 k tasiRNA reads. Loci were made to approximate real loci in size and pattern: hc-siRNA as primarily 24 nt RNAs from 200- to 1000-nt loci, from both genomic strands; miRNA as 21-nt RNAs from 125-nt loci with a miRNA and miRNA* pattern; tasiRNA as 21-nt RNAs from 140-nt loci producing a number of phased reads, from both genomic strands. All three loci types produced a realistic distribution of differently sized or shifted reads to simulate misprocessing. Sequencing errors are simulated at a rate of one mis-sequenced base per 10,000 reads. Unlike real data, simulated reads are traceable to their loci of origin, and thus are suitable to discern correct placements from incorrect ones. PolyA+ mRNA-seq data were obtained from SRA (Table S1). Reference genome versions were TAIR10 (A. thaliana), IRGSP7 (O. sativa), and B73v3 (Z. mays).
Free full text: Click here