The data in these public databases were produced by numerous laboratories, and the processed results were derived using a variety of algorithms. To improve the consistency of Cistrome DB data, raw DNA sequence data for each sample was downloaded and uniformly processed by the ChiLin pipeline (22 (link)), which uses BWA (23 (link)) to map reads to the hg38 or mm10 genomes and MACS2 (24 (link)) to identify statistically significant peaks. The raw data of SRA file was downloaded from NCBI at ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/. We obtained FASTQ files from SRA files using the fastq-dump software (https://ncbi.github.io/sra-tools/fastq-dump.html). Motif scanning was also performed on transcription factor or chromatin regulator ChIP-seq samples based on enrichment of the motif sequence relative to the center of the peaks (25 (link)). Target genes were predicted from ChIP-seq peaks using the regulatory potential model which weighs the impact of each peak by exponential decay of distance to gene transcription start site (TSS) (26 (link)). Additional information about these data can be found on the Cistrome DB document page at http://cistrome.org/db/#/documents.
Cistrome DB data quality controls include six metrics, representing DNA sequencing quality, ChIP quality, and genomic distribution characteristics. Read quality is based on the median FASTQ read quality, mapping quality is measured by the percentage of reads that each map to a unique genomic locus, and the PCR bottleneck coefficient (PBC) is used to estimate the rate of read duplication through PCR amplification (27 (link),28 (link)). The fraction of non-mitochondrial reads in peak regions (FRiP) and the number of peaks with 10-fold enrichment are used to reflect the quality of the ChIP experiment (27 (link),28 (link)). A union of DNase hypersensitive sites (Union DHS) was summarized using a large collection of DNase-seq samples from the Cistrome DB (19 (link),29 (link)). The percentage of peaks that overlap with the union of DHS sites is used to characterize the data quality based on the genomic distribution of the peaks. Although most TFs and chromatin associated factors tend to bind at DHS sites, some histone marks and factors do not follow this trend. Cutoffs were determined based on the distribution of these quality control metrics in the Cistrome DB (22 (link)), and a red dot indicates data with lower quality on a metric while a green dot indicates higher quality of a sample (Figure 1). These QC measures are meant to guide users in their appraisal of data, instead of being used strictly to categorize samples as pass or fail. Although the Cistrome DB includes some samples which appear to be of poor quality by several metrics, these samples may nevertheless hold valuable clues to some aspect of regulatory biology not represented by other samples in the database.