DNA motifs are short, recurring patterns within DNA sequences that play crucial roles in gene regulation, chromatin organization, and other biological processes.
These sequence elements, often just a few base pairs in length, can serve as binding sites for transcription factors, insulators, or other regulatory proteins, influencing gene expression and chromatin structure.
Analyzing and identifying DNA motifs is a key step in understanding the complex mechanisms underlying gene regulation and cellular function.
Researchers can leverage AI-powered platforms like PubCompare.ai to streamline DNA motif analysis, locate relevant protocols, and enhance the reproducibility and accuracy of their scientific studies.
With a user-friendly interface and the power of artificial intelligence, PubCompare.ai empowers researchers to experiance the future of DNA motif analysis today.
Our goals were to produce a resource that (i) contains a comprehensive collection of relevant motifs for each factor; (ii) avoids repetitive, weakly enriched motifs that do not contribute to the in vivo specificity of the factor or its partners; and (iii) excludes variants of the same motif, particularly among the discovered motifs. With this in mind, we conducted motif discovery separately on each data set using five motif discovery tools and manually placed all its data sets into ‘factor groups’ on the basis of known motifs and homology (Figure 2). Known motifs from the literature and the top 10 most enriched discovered motifs (excluding duplicates) were collected for each factor group (see Supplementary Methods) and named as TF_known# for known motifs and TF_disc# for discovered motifs, where TF denotes the factor group (e.g. FOXA, CTCF, etc.). Known motifs were ordered arbitrarily, whereas the discovered motifs were ordered in descending order of the enrichment value that was used for their selection.
Outline of motif discovery pipeline. Input regions for each data set are randomly partitioned into two groups. The top 250 regions of one of the partitions are scanned for motifs using five de novo motif discovery tools. These motifs are evaluated using the peaks from the other partitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group.
The 427 ENCODE experiments analyzed correspond to 123 TFs, which we place into 84 factor groups (Figure 3a). We failed to discover an enriched motif for only 12 of the 84 factor groups, of which 9 lack DNA binding domains (BRF, CTBP2, HDAC8, KAT2A, NELFE, SUPT20H, SUZ12, WRNIP1 and XRCC4) as identified by UniProt (27 (link)), and 6 have all their data sets flagged as unreliable based on various quality metrics [BRF, KAT2A, NELFE, NR4A, SUPT20H and ZZZ3; see (A. Kundaje, L.Y. Jung, P.V. Kharchenko, B. Wold, A. Sidow, S. Batzoglou and P.J. Park, in preparation)]. Of these factor groups, only NR4A has a previously identified known motif.
(a) Summary of input data used. The outside ring indicates the experimental data sets (one tick for each of 427), which are separated into 123 transcription factors (second ring). The TFs are further grouped into 84 factor groups (third ring). We are able to find a matching discovered motif for 41 of the 56 factor groups with a known motif; 29 of these 41 factor groups have additional discovered motifs that may be associated with cofactors. For all but 1 of the 15 factor groups where the known motif is not recovered we still find enriched discovered motifs. We also discovered enriched motifs for 17 of the 28 factor groups without a known motif. (b) Recovery of known motifs by each of the discovery tools. Performance of discovery in terms of number of factor groups for which the known motif was recovered. A motif is considered a match if it matches any of the known motifs for a factor group (see Supplementary Methods for details on how matches are computed). The number of additional factors that have a match is shown with each additional motif (only three motifs are taken from each individual method, whereas we have up to 10 for the pipeline). The number of factor groups with no motif match is shown in parenthesis. When multiple data sets exist for a factor group, the fraction that matches is used in computing its contribution for computing the performance of the individual tools.
We exclude from the discussion below motifs that we consider unlikely to be relevant to our analysis, while maintaining them as part of the overall resource where they may be useful. These include 46 discovered motifs that are either low-complexity (e.g. dinucleotide repeats) or consistently have weak enrichment (<2) and do not match known motifs (Supplementary Table S1). These are likely a consequence of slight biases in the discovery pipeline, or are due to real, but relatively weak, specificity for the factor. We also exclude an additional 36 motifs that have a weak similarity to the known motif for the factor but for which a better matching and enriched motif is also found (Supplementary Table S2). These are most frequently seen for longer motifs that can be broken up into recognizable, but globally dissimilar, patterns that are not captured by our automatic exclusion criteria (see Supplementary Methods). Together, these represent 28% of the 293 discovered motifs.
Kheradpour P, & Kellis M. (2013). Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research, 42(5), 2976-2987.
A-factor (Streptomyces) CTGF protein, human Debility Dinucleotide Repeats DNA Motifs factor A Factor IX Factor XII negative elongation factor E, human Ticks Transcription Factor XRCC4 protein, human
The motif affinity function, f(Sg, M, tm) (Eqn. 2), is used in MEA to assign a motif affinity score, Xg, to a DNA sequence, Sg. The score represents the affinity for the sequence of a DNA-binding molecule with binding motif M. The most commonly used motif affinity functions either count the number of "matches" to a motif in the DNA sequence or compute some function that represents the total binding of the TF or microRNA to the sequence. We study both of these types of affinity function and, in both cases, we represent the motif, M, by a log likelihood ratio PWM [11 (link)]. All motif PWMs were generated using a uniform background model in the denominator of the likelihood ratio. When counting matches, we use FIMO [7 (link)], which scores each position in a sequence, Sg, (on both strands) using the PWM, M, and computes the p-value of each score. (The p-value is based on a zero-order Markov model of the input sequences.) The value of the affinity function, f(Sg, M, tm), is the number of positions the sequence with p-value less than or equal to tm, the motif score threshold. We refer to this motif affinity function as "MC" (for "match-count"). For our other motif affinity function, which estimates the total binding of the TF or microRNA represented by the motif, we use the AMA algorithm [12 (link)] to compute the average motif affinity (AMA) score of the sequence, Sg, to the motif, M [13 (link)]. The AMA score is equal to the average likelihood ratio (not the log likelihood ratio) of the sequence (on both strands). We use a minor variant of the AMA score, which we call RMA (for relative motif affinity), when computing the linear regression association function (see below). To compute RMA, we divide the AMA score by the maximum possible AMA score of a single position in any sequence. This ensures that the range of the binding affinity function is [0,...,1]. No motif match threshold (tm) is required when using AMA as the motif affinity function.
McLeay R.C, & Bailey T.L. (2010). Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics, 11, 165.
RcisTarget is a new R/Bioconductor implementation of the motif enrichment framework of i-cisTarget and iRegulon. RcisTarget identifies enriched transcription factor binding motifs and candidate transcription factors for a gene list. In brief, RcisTarget is based on two steps. First, it selects DNA motifs that are significantly over-represented in the surroundings of the transcription start site (TSS) of the genes in the gene-set. This is achieved by applying a recovery-based method on a database that contains genome-wide cross-species rankings for each motif. The motifs that are annotated to the corresponding TF and obtain a Normalized Enrichment Score (NES) > 3.0 are retained. Next, for each motif and gene-set, RcisTarget predicts candidate target genes (i.e. genes in the gene-set that are ranked above the leading edge). This method is based on the approach described by Aerts et al. 32 (link) which is also implemented in i-cisTarget (web interface) 33 (link) and iRegulon (Cytoscape plug-in) 34 (link). Therefore, when using the same parameters and databases, RcisTarget provides the same results as i-cisTarget or iRegulon, benchmarked against other TFBS-enrichment tools in Janky et al. 34 (link). More details about the method and its implementation in R are given in the package documentation. To build the final regulons, we merge the predicted target genes of each TF-module that show enrichment of any motif of the given TF. To detect repression, it is theoretically possible to follow the same approach with the negative-correlated TF modules. However, in the datasets we analyzed, these modules were less numerous and showed very low motif enrichment, suggesting that these are lower quality modules. For this reason, we finally decided to exclude the detection of direct repression from the workflow, and continue only with the positive-correlated targets. The databases used for the analyses presented in this paper are the "18k motif collection" from iRegulon (gene-based motif rankings) for human and mouse. For each species, we used two gene-motif rankings (10kb around the TSS or 500bp upstream the TSS), which determine the search space around the transcription start site.
Aibar S., González-Blas C.B., Moerman T., Huynh-Thu V.A., Imrichova H., Hulselmans G., Rambow F., Marine J.C., Geurts P., Aerts J., van den Oord J., Atak Z.K., Wouters J, & Aerts S. (2017). SCENIC: Single-cell regulatory network inference and clustering. Nature methods, 14(11), 1083-1086.
The FIET [10 (link)] is an analytical computation of the Pearson χ2 P value. In particular, this calculation is important when marginal frequencies are small, which is often the case in position frequency matrices. The marginal P value of the contingency table for DNA motifs (Table 3) follows the multiple hypergeometric distribution [24 ]:
The formula for protein motifs is similar. The two-sided P value for the table is the sum of probabilities of all tables that are at least as extreme. This P value is computed using the algorithm described by Mehta and Patel [25 ]. As with the χ2 test, this P value is used as an additive score.
We obtained six cancer cell stemness scores calculated from mRNA expression (RNA expression-based stemness scores [RNAss], epigenetically regulated RNA expression-based stemness scores [EREG-EXPss]), DNA methylation signatures (DNA methylation-based stemness scores [DNAss], epigenetically regulated DNA methylation-based stemness scores [EREG-METHss], differentially methylated probes-based stemness scores [DMPss], and enhancer elements/DNA methylation-based stemness scores [ENHss]) from previous studies (25 (link)), and integrated the stemness scores and gene expression data of the samples for correlation analysis. The role of ITGA8 in regulating cancer cell stemness was evaluated with Gene Oncology (GO) (26 (link)) and Kyoto Encyclopedia of Genes (KEGG) (27 (link)) enrichment analysis of the genes, which highly correlated with ITGA8 using Person’s correlation analysis (r > 0.7, p < 0.05), using “clusterProfiler” package in R.
Li X., Zhu G., Li Y., Huang H., Chen C., Wu D., Cao P., Shi R., Su L., Zhang R., Liu H, & Chen J. (2023). LINC01798/miR-17-5p axis regulates ITGA8 and causes changes in tumor microenvironment and stemness in lung adenocarcinoma. Frontiers in Immunology, 14, 1096818.
The DNA methylation data and corresponding clinical data of TGCT patients were obtained from the Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) database by using the R TCGAbiolinks package14 (link). All DNA methylation data were generated from the Illumina Infinium Human Methylation 450 platform and the levels of DNA methylation were expressed as β values, and calculated as M/(M + U + 100). M and U represent the signal from methylated beads and unmethylated beads at the target CpG sites, respectively. The methylomic data from patients with complete clinicopathological information were selected. The most recent clinicopathological and follow-up information was obtained from the TCGA database on 6 January 2023, clinical information and methylation data of a total of 128 TGCT samples were downloaded and analyzed in this study, and the samples were randomly classified into training cohort (89 samples) and validation cohort (39 samples) at a ratio of 7:3. Prognostic DNA methylation signature was identified based on the training cohort data, and the evaluation of the predictive ability was performed on the basis of the validation cohort data. Progression-free survival was specified as the primary clinical endpoint, referring to the time period between the date of diagnosis and the date when a new event associated with the cancer—such as progression, local recurrence, distant metastases or death—occurred.
Gao F., Xu Q., Jiang Y, & Lu B. (2023). A novel DNA methylation signature to improve survival prediction of progression-free survival for testicular germ cell tumors. Scientific Reports, 13, 3759.
We searched for the short tandem repeat (STR)-signature of the employed cell product within isolated baboon DNA. Three SUR-sampled pieces of lung, liver, and spleen tissue per animal were obtained prior to isolation of genomic DNA using column-based techniques (Qiagen) and pooling of equal amounts DNA per animal. The same procedure was performed for frozen arterial blood pellets obtained during, and 24 or 72 hours after cell therapy and the administered cell product. We then ran 100 ng genomic DNA per animal and sample using a highly sensitive, 17 loci-extended forensic STR analysis kit (AmpFLSTR NGM SElect) on an Applied Biosystems 3500 forensic genetic analyzer. Data was analyzed utilizing the GeneMapper ID-X software package (all from Thermo Fisher). The STR-pattern of the employed cell product, as well as baboon-specific STR were identified and separated from the signatures of other human DNA (laboratory, medical and veterinarian staff) contaminating the samples.
Möbius M.A., Seidner S.R., McCurnin D.C., Menschner L., Fürböter-Behnert I., Schönfeld J., Marzahn J., Freund D., Münch N., Hering S., Mustafa S.B., Anzueto D.G., Winter L.A., Blanco C.L., Hanes M.A., Rüdiger M, & Thébaud B. (2023). Prophylactic Administration of Mesenchymal Stromal Cells Does Not Prevent Arrested Lung Development in Extremely Premature-Born Non-Human Primates. Stem Cells Translational Medicine, 12(2), 97-111.
Animals Arteries BLOOD Cell Therapy DNA Motifs Freezing Genome Homo sapiens isolation Liver Lung Papio Pellets, Drug Short Tandem Repeat Spleen Tandem Repeat Sequences Tissues Veterinarian
Motif analysis was mainly based on the chromVAR (38 (link)) R package. In brief, we run the AddMotifs function to add the DNA sequence motif information required for motif analyses. Then, we could calculate a per-cell motif activity score by running chromVAR and identify differential activity scores between cell types. Motif activity scores were normalized by z-scores, and the differential activity scores between cell types were replaced with “avg_diff.” TF footprinting was gathered by Footprint function and plotted by PlotFootprint function.
Yu Z., Lv Y., Su C., Lu W., Zhang R., Li J., Guo B., Yan H., Liu D., Yang Z., Mi H., Mo L., Guo Y., Feng W., Xu H., Peng W., Cheng J., Nan A, & Mo Z. (2023). Integrative Single-Cell Analysis Reveals Transcriptional and Epigenetic Regulatory Features of Clear Cell Renal Cell Carcinoma. Cancer Research, 83(5), 700-719.
The ForenSeq™ DNA Signature Prep Kit is a laboratory product designed for sample preparation prior to DNA sequencing. It enables the simultaneous amplification of multiple genetic markers for forensic DNA profiling.
The MiSeq FGx is a benchtop sequencing system designed for forensic and human identification applications. It utilizes Illumina's proprietary sequencing-by-synthesis technology to generate high-quality sequencing data. The system is capable of analyzing a variety of sample types and is suitable for use in accredited forensic laboratories.
The LightShift Chemiluminescent EMSA Kit is a laboratory tool designed to detect and analyze protein-DNA interactions. It uses chemiluminescent detection to visualize and quantify the binding of proteins to specific DNA sequences.
Sourced in United States, China, Germany, United Kingdom, Canada, Switzerland, Sweden, Japan, Australia, France, India, Hong Kong, Spain, Cameroon, Austria, Denmark, Italy, Singapore, Brazil, Finland, Norway, Netherlands, Belgium, Israel
The HiSeq 2500 is a high-throughput DNA sequencing system designed for a wide range of applications, including whole-genome sequencing, targeted sequencing, and transcriptome analysis. The system utilizes Illumina's proprietary sequencing-by-synthesis technology to generate high-quality sequencing data with speed and accuracy.
The MiSeq FGx Reagent Kit is a sequencing reagent designed for use with Illumina's MiSeq FGx forensic genomics system. It provides the necessary reagents and consumables required to perform DNA sequencing on the MiSeq FGx platform.
Sourced in United States, Germany, China, Canada, Italy, United Kingdom, Australia, Netherlands
The EZ DNA Methylation-Gold Kit is a product offered by Zymo Research for bisulfite conversion of DNA samples. It is designed to convert unmethylated cytosine residues to uracil, while leaving methylated cytosines unchanged, enabling the detection and analysis of DNA methylation patterns.
Sourced in United States, Lithuania, United Kingdom, Germany, India
The GeneJET Genomic DNA Purification Kit is a lab equipment product designed for the rapid and efficient extraction of high-quality genomic DNA from a variety of sample types. The kit uses a simple and reliable spin-column-based method to isolate DNA, which can then be used in downstream applications such as PCR, sequencing, and other molecular biology procedures.
The PyroMark Q24 2.0.6 Software is a software package designed for the analysis and interpretation of pyrosequencing data generated by the PyroMark Q24 system. It provides tools for sequence analysis, quality control, and data management.
LentiCas9-Blast is a lentiviral vector that expresses the Cas9 endonuclease from Streptococcus pyogenes and a blasticidin resistance marker. It is designed for the delivery and expression of Cas9 in target cells.
DNA motifs can exist in a variety of forms, including transcription factor binding sites, insulator elements, silencer sequences, and enhancer motifs. These different types of motifs play distinct roles in gene regulation, chromatin organization, and other biological processes. Transcription factor binding sites, for example, allow regulatory proteins to bind to specific DNA sequences and modulate gene expression, while insulator elements can block the influence of enhancers or silencers, partitioning the genome into independent regulatory domains.
Analyzing and identifying DNA motifs is crucial for understanding the complex mechanisms underlying gene regulation and cellular function. Researchers can use DNA motif information to predict gene expression patterns, identify regulatory networks, and design targeted genetic engineering strategies. For instance, knowing the specific motifs that bind to transcription factors involved in a disease pathway can help develop more effective therapies by disrupting or enhancing those regulatory interactions.
One of the key challenges in DNA motif analysis is the sheer volume of genomic data and the computational power required to sift through it. Manually searching for and comparing motifs across large datasets can be time-consuming and error-prone. Additionally, the short length and degeneracy of many motifs can make them difficult to identify with high confidence, leading to false positives or missed discoveries.
PubCompare.ai's AI-powered platform can streamline the DNA motif analysis process by helping researchers more efficiently screen protocl literature, leveraging artificial intelligence to pinpoit critical insights. The platform's AI-driven analysis can highlight key differences in protocol effectiveness, enabling researchers to choose the best options for their specific research goals and enhancing the reproducibility and accuracy of their studies. By harnessing the power of AI, PubCompare.ai empowers researchers to experiance the future of DNA motif analysis today.