Uniformly Processed Cistrome DB Data

The data in these public databases were produced by numerous laboratories, and the processed results were derived using a variety of algorithms. To improve the consistency of Cistrome DB data, raw DNA sequence data for each sample was downloaded and uniformly processed by the ChiLin pipeline (22 (link)), which uses BWA (23 (link)) to map reads to the hg38 or mm10 genomes and MACS2 (24 (link)) to identify statistically significant peaks. The raw data of SRA file was downloaded from NCBI at ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/. We obtained FASTQ files from SRA files using the fastq-dump software (https://ncbi.github.io/sra-tools/fastq-dump.html). Motif scanning was also performed on transcription factor or chromatin regulator ChIP-seq samples based on enrichment of the motif sequence relative to the center of the peaks (25 (link)). Target genes were predicted from ChIP-seq peaks using the regulatory potential model which weighs the impact of each peak by exponential decay of distance to gene transcription start site (TSS) (26 (link)). Additional information about these data can be found on the Cistrome DB document page at http://cistrome.org/db/#/documents.
Cistrome DB data quality controls include six metrics, representing DNA sequencing quality, ChIP quality, and genomic distribution characteristics. Read quality is based on the median FASTQ read quality, mapping quality is measured by the percentage of reads that each map to a unique genomic locus, and the PCR bottleneck coefficient (PBC) is used to estimate the rate of read duplication through PCR amplification (27 (link),28 (link)). The fraction of non-mitochondrial reads in peak regions (FRiP) and the number of peaks with 10-fold enrichment are used to reflect the quality of the ChIP experiment (27 (link),28 (link)). A union of DNase hypersensitive sites (Union DHS) was summarized using a large collection of DNase-seq samples from the Cistrome DB (19 (link),29 (link)). The percentage of peaks that overlap with the union of DHS sites is used to characterize the data quality based on the genomic distribution of the peaks. Although most TFs and chromatin associated factors tend to bind at DHS sites, some histone marks and factors do not follow this trend. Cutoffs were determined based on the distribution of these quality control metrics in the Cistrome DB (22 (link)), and a red dot indicates data with lower quality on a metric while a green dot indicates higher quality of a sample (Figure 1). These QC measures are meant to guide users in their appraisal of data, instead of being used strictly to categorize samples as pass or fail. Although the Cistrome DB includes some samples which appear to be of poor quality by several metrics, these samples may nevertheless hold valuable clues to some aspect of regulatory biology not represented by other samples in the database.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Zheng R., Wan C., Mei S., Qin Q., Wu Q., Sun H., Chen C.H., Brown M., Zhang X., Meyer C.A, & Liu X.S. (2018). Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Research, 47(Database issue), D729-D735.

Publication 2018

Chip Chip seq Chromatin Dnase Dump Gene transcription start site Genes Genomic Histone marks Hypersensitive Mitochondrial Samples collection Transcription factor

Corresponding Organization : Dana-Farber Cancer Institute

Other organizations : Harvard University

Top 5 similar protocols

Protocol cited in 121 other protocols

Variable analysis

independent variables

Raw DNA sequence data for each sample was downloaded and uniformly processed by the ChiLin pipeline, which uses BWA to map reads to the hg38 or mm10 genomes and MACS2 to identify statistically significant peaks.
Motif scanning was also performed on transcription factor or chromatin regulator ChIP-seq samples based on enrichment of the motif sequence relative to the center of the peaks.
Target genes were predicted from ChIP-seq peaks using the regulatory potential model which weighs the impact of each peak by exponential decay of distance to gene transcription start site (TSS).

dependent variables

Statistically significant peaks identified by MACS2.
Enrichment of motif sequence relative to the center of the peaks.
Impact of each peak on target gene prediction based on distance to gene transcription start site (TSS).

control variables

DNA sequencing quality metrics (median FASTQ read quality, percentage of reads that map to a unique genomic locus, PCR bottleneck coefficient (PBC)).
ChIP quality metrics (fraction of non-mitochondrial reads in peak regions (FRiP), number of peaks with 10-fold enrichment).
Genomic distribution characteristics (percentage of peaks that overlap with the union of DNase hypersensitive sites (Union DHS)).

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!