WR approach (24 (link)), previously exploited in the
SFmap web server for mapping SF binding sites (23 (link)). The mapping algorithm considers the clustering propensity of the
binding sites and the overall tendency of regulatory regions to be conserved (24 (link)). In RBPmap we have improved the algorithm
by adding new features including the ability to map PSSM motifs, a
conservation-based filtering to reduce the rate of false-positive predictions and a
new background model which is specific to different genomic regions, namely intronic
regions flanking the splice sites, internal exons, exons in 5’ and
3’ UTR regions, non-coding RNAs and mid-intron/intergenic regions (a
detailed description of RBPmap algorithm is given in Supplementary file 1). A
pipeline summarizing RBPmap algorithm is shown in Figure
either a consensus sequence or a PSSM) and a query sequence (Figure
motif per each position in the sequence in overlapping windows (Figure
background that is calculated specifically per each motif, filtering out all matches
below a significant threshold (default P-value<0.005)
(Figure
is employed to calculate the multiplicity score which reflects the propensity of
suboptimal motifs (default P-value<0.01) to cluster around
the significant motif in a window of 50 nts, weighted by their match to the motif of
interest (24 (link)) (Figure
scores are compared to a background model that is calculated independently per each
motif for the relevant genomic region. A Z-score is calculated for each WR score and
coupled to a P-value, which represents the probability of obtaining
a specific Z-score, considering a normal one-tailed distribution. RBPmap requires
that the final WR score of a site will be significantly greater (with
P-value<0.05) than the mean score calculated for the
background, in order to consider this site as a predicted binding site (Figure
provides more accurate and specific thresholds for the different regulatory regions
on the RNA (see above). For sequences from genomes other than human, mouse or
Drosophila, the WR scores are compared to a theoretical
threshold instead of the genome-specific background model which cannot be obtained
(see Supplementary file 1). This threshold is calculated for each motif separately,
according to its length and complexity (23 (link)).
At the last stage, we have added to the WR approach a conservation-based filtering,
which exploits the tendency of regulatory regions to be evolutionary conserved. The
conservation filter is optional and is applied only to sites that are mapped to
mid-intron/intergenic regions on the query sequence. These positions are removed
from the results if the mean conservation score of their environment is lower than
the mean conservation score calculated for intronic regulatory regions (Figure
conservation information is retrieved from the UCSC phyloP conservation table (28 (link)), based on the conservation of all placental
mammals. For Drosophila sequences we use the phastCons insect
conservation table (28 (link)). Both the
position-specific background model and the conservation filtering are applied only
for motifs which are searched in human, mouse or Drosophila sequences.