Our approach uses machine learning within a novel iterative framework to predict genes with cell-lineage–specific expression on the whole-genome scale based on gene expression data from tissue homogenates. This problem is especially challenging because, in order to work for cell lineages that are infeasible to microdissect experimentally such as the podocytes, our approach must function without example expression profiles of the lineage of interest.
Intuitively, our method leverages patterns of expression of cell-lineage–specific genes that it discovers from whole-genome expression compendia not resolved to the cell lineage of interest. These patterns are specific for each cell lineage and generally only found in a small subset of experimental conditions, which may include genetic, physiological, pathophysiological, environmental, or experimental states/perturbation (e.g., biopsy specimens from different patients). To discover these cell-lineage–specific expression patterns as well as the subsets of conditions that are informative for a given cell lineage, our approach uses a machine learning approach in an iterative probabilistic framework to combine an expert-provided standard of known cell-lineage–specific genes (positives) as well as example genes that are expressed in other cell lineages (negatives). However, most solid-tissue cell lineages cannot be studied experimentally in high-throughput, and thus only few cell-lineage–specific genes are often known with high accuracy (e.g., from IHC). The additional challenge here is that these standards are often limited in size (especially for cell lineages not amenable to experimental micro dissection) and can be of varying specificity (e.g., specific to cell lineage within the immediate structure or whole organ or defined by different experimental approaches).
Because it is experimentally infeasible to obtain pure example expression profiles for cell lineages from solid human tissues, our method must perform well even while available standards are often very limited in size and can be of highly varying specificity. This paucity of high-quality standards and the need to effectively leverage lower-quality or less specific examples severely limits the direct application of traditional machine learning approaches (e.g., SVM performance outside of the iterative framework is shown in Supplemental Fig. 5).
To address these challenges, we developed an iterative classification approach that continually refines both the predictive cell-lineage–specific patterns and informative conditions based on statistical scoring and refinement (through informative subset selection) of the provided standard. This iterative approach allows the user to provide tiered standards, i.e., the investigator identifies only the relative specificity of evidence tiers (i.e., low-throughput high specificity approaches are more reliable as compared to high-throughput experimental platforms with lower specificity). The in silico nanodissection method is then able to make high-accuracy predictions of cell-lineage–specific genes on the whole-genome scale and, within the tiered standard constraint, is robust to variable specificity of example cell-lineage–specific genes. The iterative strategy is necessary to allow investigators to add standards of questionable quality without dramatically compromising the quality of cell-lineage predictions. A linear SVM without this iterative approach fails when standards of lower quality are added to high-quality standards (Supplemental Fig. 5).
The researcher defines standards within tiers. Tiers represent levels of specificity (i.e., in descending order: double immunofluorescence, annotated in literature curated database, high-throughput protein expression). For each tier, nanodissection calculates the sum of the ranks of genes from the classifier (for the case of SVM, this is the ranked distance from the SVM hyperplane) for each positive example, , (here podocyte genes) against each of negative standards, , (e.g., glomerular, mesangial, tubular) as , where represented the number of positives and ranks were calculated from only the positive examples and the negative examples from standard . It then computes a test statistic for this individual separation, for each negative standard as , where
This is normalized by converting it to a
z-score by using the mean and standard deviation through
The scores for the individual separations are then combined to provide a final score for this tier of standards
Nanodissection automatically selects the standards resulting in the lowest (which ranges from zero to one), i.e., that which corresponds to a better separation of positives from each negative standard.
In certain cases, an additional (and optional) external validation gene set may be available. Because nanodissection can be applied where experimental microdissection was insufficient, these standards may represent both positives and negatives (e.g., in this case where additional microarray measurements of the renal glomerulus were available as validation). We termed genes in this standard as “high-throughput-validating” genes and other genes as “nonvalidating” genes. Nanodissection can use this validation set to identify the set of standards providing the best separation of validating genes by calculating , where is the rank of the absolute value of the distance to the hyperplane of the validating gene in a list containing the validating genes and the nonvalidating genes. It then calculates as , where
which is then converted to a
z-score
Finally, for validating versus nonvalidating is calculated as . Selecting the standard tier that provided the lowest
p results in the standard where validating genes were most extreme (i.e., best separated from each other). Our results demonstrate that this approach enables us to use a non-cell-lineage–specific validation (i.e., glomerular) gene set to grade our separation of putative cell-lineage (podocyte) –specific genes by selecting that standard that leads to example genes on the extremes (in our example, this has potential podocytes at the top of the list and potential nonpodocyte glomerular genes at the bottom). In the case where there exists a validation standard of high-quality specific to our cell lineage of interest, we instead use directly instead of . In that case, this value would represent the one-sided Wilcoxon rank-sum
p-value for a comparison of validating and nonvalidating genes. Because this iterative nanodissection approach relies on genome-scale data obtained from the surrounding compartment and because this evaluation was used to identify the optimum standards, this provides a quality measure for the resulting standard. Thus nanodissection allows us to obtain cell-lineage–specific signal from in vivo human data.
The nanodissection algorithm therefore proceeds as follows (for pseudocode, see Supplemental Fig. 7 ). Given user-supplied standards in tiers of increasing specificity, for each standard-level,
k, combine standards of that level with all standards of higher specificity levels. Apply the selected classification algorithm (here we applied SVM from the SVM
perf package [Joachims 2006 ] using the Sleipnir library [Huttenhower et al. 2008 (
link)]) and generate a ranked list of predictions. Score the predictions for
k as described above to calculate
p for the
kth level of specificity. Select the level of specificity providing the lowest
p.
In this work, standards were obtained from expert literature review. The positive podocyte-specific standard genes were required to have at least one of the following levels of evidence: immunofluorescence staining, in situ hybridization, or electron microscopy image of immuno-gold staining of podocytes in vivo. Two levels of specificity were evaluated. The most stringent level contained genes specifically expressed only in podocytes and no other cell types in the human kidney, referred to as podocyte-specific in kidney (as an example, see nephrin staining pattern in
Fig. 3A, I). The less stringent level contained all of the above, as well as genes expressed in podocytes and no other cell types in glomeruli, but did contain genes detected in extraglomerular cells of the kidney (synaptopodin [SYNPO] and CD2AP staining in
Fig. 3A, II and III). For the majority of selected genes, evidence for disease association in human glomerular failure or murine model systems was also available. Application of nanodissection resulted in the use of both tiers of standards, which corresponded to a total of 46 genes that were both podocyte-specific and present in the gene expression data set.
Ju W., Greene C.S., Eichinger F., Nair V., Hodgin J.B., Bitzer M., Lee Y.S., Zhu Q., Kehata M., Li M., Jiang S., Rastaldi M.P., Cohen C.D., Troyanskaya O.G, & Kretzler M. (2013). Defining cell-type specificity at the transcriptional level in human disease. Genome Research, 23(11), 1862-1873.