We compared five gene-set activation metrics. Given a gene
g, let
Xtg be the expression value (log10 fold change, relative to background) for gene
g in tissue
t. Let
S be the set of genes in a pathway. For tissue
t, if <
XtS > and <
Xt > are the mean of
Xtg over the genes in
S and all the genes on the microarray, respectively, and σ
t is the standard deviation of
Xtg over all the genes on the microarray, then the Z-score activation metric used to measure the relative expression level of pathway
S in tissue
t is:
where |
S| is the number of genes in
S. The value of
Z is expressed in units of standard deviation and is a measure of violation of the null hypothesis that the genes in
S are independently sampled from a distribution similar to that of all the genes on the microarray. If the null hypothesis is valid, then
Z will have approximately a standard normal distribution, and so a large positive value of
Zt suggests collective upregulation of the genes in
S (which we consider to represent 'activation' of
S) in tissue
t; a large negative value suggests collective downregulation. The normalization by
makes comparison of different-sized gene sets possible and reflects the fact that, for larger gene sets, even a slight collective shift in fold change can be significant.
Because the
Z-statistic essentially measures a shift in location (mean expression) for the genes in
S, we compared its sensitivity to several other possible signed measures of location shift, which were created by modifying, where necessary, standard statistics with a sign to indicate the direction of expression change. The Wilcoxon
Z statistic is a well-known statistic that is calculated according to a similar formula, but using the ranks of the
Xtg among all genes in tissue
t, rather than the actual fold changes. To calculate a signed KS statistic, we computed each of the two one-sided KS statistics, comparing the distribution of the expression values in
S with the distribution of the genes on the microarray as a whole, and took the larger of the two statistics, with the appropriate sign. To calculate a hypergeometric
p value, we used a threshold of two-fold differential expression (other threshold values showed qualitatively similar results, data not shown) to define an induced or repressed gene, and then calculated the probability that the relative enrichment of differentially expressed genes observed in a gene set in a particular tissue could have been observed by chance, using the hypergeometric distribution. To provide a sign for the hypergeometric
p value, the calculation was done separately for the induced and repressed genes in each set, and the smaller of the two
p value was used, as well as its 'sign' (negative if repressed genes were more enriched in the gene set than induced genes, positive otherwise). The relative insensitivity of the HG metric was little changed by varying the differential expression threshold. Finally, for the PCA statistic, we calculated
PC1, the first principal component of the expression values of the genes in
S across all tissues, and used the projection (scalar product) of the expression values in a tissue with
PC1 as a measure of activation of the gene set in that tissue.
Levine D.M., Haynor D.R., Castle J.C., Stepaniants S.B., Pellegrini M., Mao M, & Johnson J.M. (2006). Pathway and gene-set activation measurement from mRNA expression data: the tissue distribution of human pathways. Genome Biology, 7(10), R93.