The goal of (unsupervised) gene regulatory network inference is to recover the network solely from measurements of the expression of the genes in various conditions. Given the dynamic and combinatorial nature of genetic regulation, measurements of different kinds can be obtained, including steady-state expression profiles resulting from the systematic knockout or knockdown of genes or time series measurements resulting from random perturbations. In this paper, we focus on multifactorial perturbation data as generated for the DREAM4 In Silico Size 100 Multifactorial subchallenge. Multifactorial expression data are static steady-state measurements obtained by (slightly) perturbing all genes simultaneously. Multifactorial data might correspond for example to expression profiles obtained from different patients or biological replicates. Such data are easier and less expensive to obtain than knockout/knockdown or time series data and are thus more common in practice. They are however also less informative for the prediction of edge directionality [3] (link), [26] (link), [27] (link) and therefore make the regulatory network inference task more challenging.
In what follows, we define a (multifactorial) learning sample from which to infer the network as a sample of N measurements: where is a vector of expression values of all p genes in the kth experiment:
From this learning sample, the goal of network inference algorithms is to make a prediction of the underlying regulatory links between genes. Most network inference algorithms work first by providing a ranking of the potential regulatory links from the most to the less significant. A practical network prediction is then obtained by setting a threshold on this ranking. In this paper, we focus only on the first task, which is also targeted by the evaluation procedure of the DREAM4 challenge. The question of the choice of an optimal confidence threshold, although important, will be left open.
A network inference algorithm is thus defined in this paper as a procedure that exploits a LS to assign weights to putative regulatory links from any gene i to any gene j, with the aim of yielding large values for weights which correspond to actual regulatory interactions.