Sequences were aligned to the MG1655 genome (NC_000913.2) using the CLC Genomics Workbench. Mapped reads were piled up and written to a .gff file using a custom Python script and viewed in SignalMap (Nimblegen). All ChIP-seq images presented in this study are captured from SignalMap and manipulated in the image editing software GIMP to highlight baselines (zero reads) and fill gaps in the data resulting from image artifacts.
Almost all ChIP-seq analysis programs have been designed and optimized for eukaryotic ChIP-seq data and, in our experience, do not perform well with bacterial ChIP-seq data. We have generated custom Python scripts to identify peaks in bacterial ChIP-seq data. First, all datasets were normalized to 100 million reads. Pairs of replicate datasets were considered together. For each replicate dataset in the pair, an appropriate threshold was determined. The plus and minus strands were considered separately. For the first replicate, for a given strand, a value
T1 was selected as the threshold. For the second replicate, a value
T2 was selected as the threshold. Values for
T1 and
T2 were considered between 1 and 1000. For each combination of values for
T1 and
T2, the number of genome positions with values ≥
T1 in the first replicate and with values ≥
T2 in the second replicate was determined. The false discovery rate was estimated using the null hypothesis that no regions are enriched. The combination of thresholds yielding the highest number of true positive positions, with an estimated false discovery rate of less than 0.01, was selected. Once
T1 and T
2 were chosen, peak calling was performed as previously described (Supplementary Material of [54] (
link)). Briefly, a region was identified as a peak if both replicates showed enrichment above the corresponding thresholds for each strand. For a peak to be called there must be a peak on the plus strand within a threshold distance of a peak on the minus strand, as previously described (Supplementary Material of [54] (
link)). To identify regions of artifactual enrichment, peaks identified in tagged strains were compared to those called in a control ChIP-seq experiment using an untagged strain (DMF35). For each factor, the calculated
T values were adjusted to reflect the total number of reads in control experiment replicates and then applied for peak calling in the controls. Any regions for which a peak was called in the true ChIP-seq experiment and in the untagged control experiment within 50 bp of each other were considered potential artifacts and excluded from further analysis.
Fitzgerald D.M., Bonocora R.P, & Wade J.T. (2014). Comprehensive Mapping of the Escherichia coli Flagellar Regulatory Network. PLoS Genetics, 10(10), e1004649.