The dataset published by Koch et al. [4 (link)] was used as a starting point for the Sensbio database. It contains a 2018 collection of TF-ligand interactions from different databases and literary resources. To expand and update this dataset, data dumps detailing aTFs and their triggering compounds were collected, cleaned and formatted accordingly from the following databases: BioNemo [5 (link)], RegulonDB [6 (link)], RegPrecise [7 (link)], RegTransBase [8 (link)], Sigmol [9 (link)] and GroovDB [10 (link)].
Custom Python 3 scripts (using standard libraries like Pandas and Numpy) were used to populate, clean, format and analyze the database and to build a web application through the Streamlit framework (https://streamlit.io/). Molecular fingerprints were extracted, analyzed and compared using the RDKit python library [11 ]. Networkx python module was used to describe and produce the molecular network. A local BLAST+ installation allowed the scoring and ranking of the protein sequences. Ete3 python toolkit [12 (link)] produced the phylogenetic trees of the TF sequences. Deep learning techniques were applied to build the predictive model through the Tensorflow and Keras Python libraries.
Classyfire [13 (link)] and iFragment [14 (link)] external web applications were used to classify the different molecules by chemical and metabolic categories respectively. Classyfire produces a hierarchical list of ontologies. In this case, the parent ontology was kept as the representative category for each molecule. iFragment on the other hand, produces a list of KEGG [15 (link)] metabolic pathways ordered by the probability of the input compound to belong to that particular pathway. The three pathways with the lowest p-value were selected. Using the KEGG restful API (https://www.kegg.jp/kegg/rest/keggapi.html), the parent ontology was extracted for each pathway and assigned as the final metabolic category.
Free full text: Click here