The fourth module is used to blast the input contigs/singlets against the Swiss-Prot database to retrieve the corresponding UniProt Accession numbers of the organism of interest, or search the input lists of proteins/genes, metabolites and drugs that are already linked to their own or related UniProt Accession numbers, and use them as queries in our Global Protein-Metabolite-Gene-Drug Interaction Database (GPMGDID) to build the networks (Figure 1.4). The latter was constructed in a MySQL structure by grouping more than 1 million interactions from nine public available databases: BioGRID [9] (link), Intact [10] , DIP [11] (link), MINT [13] (link), HPRD [14] , DrugBank [15] , HMDB [17] , YMDB [18] , and ECMDB [19] , all of them queried monthly for updates. There are five parameters classes to select in this module: the organism, the network configuration, the score cutoff, the two-hybrid parameters and the expression analysis. IIS works with diverse organism datasets that can be chosen independently for the input dataset (project) and the GPMGDID, enabling also the construction of networks with interactions between different organisms (e.g. host-pathogen interactions) or using ortholog relationship. The network configuration parameter considers the interaction level of expansion from first to third neighbors, the addition or not of metabolites and drugs from GPMGDID in the network expansion, the deletion of nodes with connectivity degree of 0 and 1 (yielding a more connected network), and the selection of the background organism for the enrichment analysis. The score cutoff parameters can be used to filter the network for more confident interactions by three types of score: the Class score, the FSW score and the p-value, which are described in more details in the following sections. The order considered in the algorithm to reduce the network size by filters is: (i) Class score, (ii) p-value, (iii) deletion of nodes with connectivity degree of 0 and 1, and (iv) FSW score. In the two-hybrid parameters, if the user is working with two-hybrid or immunoprecipitation techniques and has a bait of interest to connect with the identified novel preys, it can be done using this option. Finally, in the expression analysis parameters, if working with omics datasets, the user can set cutoff values to color the input nodes as up- or down-regulated and change the node sizes according to their fold change in expression/concentration levels. Regarding the enrichment analysis, the program calculates the enrichment for the GO biological processes and KEGG pathways in the generated network using the hypergeometric distribution [45] (link). The exact and approximated hypergeometric distributions were implemented in the interactome algorithm using gamma and log-gamma function, respectively, to calculate factorial number. The second one was necessary to avoid stack overflow related to large factorial numbers [46] (the empirical tests showed that the transition from exact to approximated function occurs for GO term or KEGG pathway with more than 1,800 related proteins in the GPMGDID database).
This module generates a XGMML file containing all annotations and metrics described below that can be directly visualized on the website using Cytoscape web [47] (link) from our web server (Figure 1.4) or can be imported into Cytoscape platform [48] (link). The Cytoscape platform is an open source software that enables the visualization of all interactions (or defined subgroups of interactions) and the analysis and correlation of node and edge properties with topological network statistics using a set of core modules and external plugins. The information available in the XGMML file has been standardized in order to communicate with these plugins.
Free full text: Click here