After the reads have been aligned to the genome, read regions are defined. A read region is defined as a contiguous span of overlapping reads. Only reads with fewer than five hits to the genome are considered for the purposes of defining the read regions. Read regions shorter than 160 nucleotides and that do not overlap a repeat region or a tRNA are then used as candidate loci to be tested as a possible miR.
Our approach for the identification of microRNAs using high-throughput sequencing reads is to compute a set of quantities for each candidate locus, and by using thresholds for each quantity we define a space of values that contain the microRNA loci.
A key challenge to the program is to designate all read products on a potential hairpin as corresponding to miR/miR*, moR/moR* and/or loops because our program relies on this information to test whether the products are consistent with miRNA biogenesis. Once candidate loci are folded, all reads that overlap the locus are grouped to define 'products', and these products are then identified as miR, moR, or loop products according to Figure S1 in Additional file
Many quantities we consider pertain to the structure of the hairpin and positions of reads. The distance between a miR and moR on the same arm of the hairpin, the offset of the 5' positions of products that overlap at least 2 nucleotides on the same arm of the hairpin, and the offset of overlapping products on opposite arms of the hairpin are used to evaluate the spacing and distribution of products. The 5' heterogeneity, defined as the fraction of reads within the miR product with the same 5' position as the predominant splice variant of this product, is evaluated for the most abundant miR product. Furthermore, we define the AAPD as the average distance between sense and antisense products that overlap, and apply this measure across all sense products that overlap antisense products. Additionally, the minimum number of base pairs per nucleotide for either a miR or miR* product is used to evaluate the locus.
Two additional quantities take into account information from the sequencing data outside the candidate locus under consideration. The average number of hits to the genome for reads within the most abundant miR product is evaluated as an additional level of repeat filtering. Finally, after producing a list of predicted positive loci using the above measures, we define the non-miR-neighbor-count as the number of read regions that do not overlap a predicted positive locus within a ± 1-kb window surrounding the locus in question. All read regions, including those overlapping repeat regions, tRNAs, and those longer than 160 nucleotides, are considered for this calculation.
Each of these quantities has user-defined thresholds that can be adjusted to meet the desired level of stringency of the predictions. The default values used in this analysis are summarized in Table S1 in Additional file