Since some aspects of transcriptional heterogeneity can be driven by genes that are poorly represented or not at all described by the annotated pathways, PAGODA incorporates into the overall analysis de novo gene sets that group genes showing correlated patterns of expression across the cells measured in a particular dataset. By default, PAGODA, implements a straightforward clustering procedure: a hierarchical clustering is performed using Ward method (as implemented by the hclust package in R) using a Pearson correlation distance on the normalized expression matrix (that is used for the weighted PCA step described above). The resulting dendrogram is cut to obtain a pre-defined number of de novo gene clusters (the results shown use 150 clusters). As there are many alternative methods for clustering co-expressed genes, PAGODA implementation provides parameters to use alternative clustering procedures.
Since de novo gene clusters are by purposefully selected to contain genes with correlated expression profiles, the amount of variance explained by the first principal component (magnitude of λ1) will be higher than expected from random matrices, and cannot be modeled by the same Trace-Window F1 distribution as previously-annotated gene set. To evaluate statistical significance of overdispersion, a background distribution of λ1 was generated by performing the same hierarchical clustering and weighted PCA procedure on randomized matrices (where cell order was randomized for each gene independently, 100 randomizations). The λ1 values were normalized relative to Tracy-Widom F1 expectation as , where and are the mean and variance of λ1 predicted by the Tracy-Window F1 distribution, and coefficients a and b are determined by the linear model . This standardized residual was modeled using Gumbel extreme value distribution, the parameters of which were fit using extRemes package in R. The overdispersion P value for each de novo gene set were determined from the tails of that distribution. The subsequent procedures treated de novo gene sets and annotated gene sets in the same way.
Since de novo gene clusters are by purposefully selected to contain genes with correlated expression profiles, the amount of variance explained by the first principal component (magnitude of λ1) will be higher than expected from random matrices, and cannot be modeled by the same Trace-Window F1 distribution as previously-annotated gene set. To evaluate statistical significance of overdispersion, a background distribution of λ1 was generated by performing the same hierarchical clustering and weighted PCA procedure on randomized matrices (where cell order was randomized for each gene independently, 100 randomizations). The λ1 values were normalized relative to Tracy-Widom F1 expectation as , where and are the mean and variance of λ1 predicted by the Tracy-Window F1 distribution, and coefficients a and b are determined by the linear model . This standardized residual was modeled using Gumbel extreme value distribution, the parameters of which were fit using extRemes package in R. The overdispersion P value for each de novo gene set were determined from the tails of that distribution. The subsequent procedures treated de novo gene sets and annotated gene sets in the same way.