Thirty gene sets from FactorBook were selected for motif discovery tool comparison (Fig. 2D , Table S1 ). These gene sets have been selected because the motif of the ChIP'ped TF was detected as top enriched motif in the top 500 peaks in FactorBook. We extracted the top 200 genes having the highest peaks in their 20 kb region around the TSS. The comparison was performed on TF and motif recovery using the parameters indicated in Table S3 . The parameters were left to default and when possible, we only adjusted the parameters to allow for larger upstream regions (when possible we choose TSS+−10 kb). iRegulon was compared to eight other publicly available motif enrichment tools, namely OPOSSUM [117] (link), DIRE [80] (link), [112] (link), PASTAA [32] (link), [113] (link), PSCAN [114] (link), Clover [16] (link), AME [118] (link), Allegro [115] (link) and HOMER2 [116] (link) (in the case of Homer2, de novo and known motif discovery are performed simultaneously but we consider them as different approaches and validate them separately). We selected these tools because they mostly take as input a set of human co-expressed genes, and they all return, at least to some extent, information on which TF could be regulating the input genes. For this reason, it not feasible to compare iRegulon with classical de novo motif discovery methods (e.g., MEME-like methods) because such methods are intractable on large human gene sets (e.g., 200 genes×20 kb×10 species represents a sequence set of 40 Mb), and they result in new motifs rather than candidate TFs. We also attempted to use SMART [119] (link) but we did not succeed in running the software. For tools that require regulatory sequences as input (AME and Clover) we used the same sequences as used by iRegulon. For some tools like Clover, it is theoretically possible to use a large search space but one run on one dataset takes too long (∼17 hours), and therefore we limited the analysis to 500 bp promoter sequences. In the case of AME, we found no positive results with a large search space (data not shown), so we show the results with the default search space. For comparison, we used the number of motifs/TFs found in top 1 and within top 5 positions. The total number of detected motifs was not reported for comparison, because some tools use more stringent thresholds than others. All these tools rely on the available motif annotation to identify the candidate TF such as Jaspar (J) or Transfac (T). However, we also manually re-associated the detected motifs to candidate TFs (mainly by comparison of the detected motif with the FactorBook motif) (see column “USING SIMILARITY” in the Table S3 ). For Homer2, 14 motifs that are derived from ENCODE ChIP-Seq data matching the actual Factorbook ChIP-Seq data were discarded from their in-house PWM collection to avoid over-fitting (indeed, iRegulon does not include FactorBook PWMs either, nor do any of the other tools). Note that for the other large-scale analysis (e.g. full ENCODE analysis), we use a command-line version of iRegulon.
Full text: Click here