All linkage and distance calculations were performed after transformation.
The starting point of the dendrogram construction was the 265 clusters. For each gene, we computed average expression, trinarization with f = 0.2, trinarization with f = 0.05 and enrichment score. For each cluster we also know the number of cells, annotations, tissue distribution and samples of origin.
We defined major classes of cell types based on prior knowledge: neurons, astroependymal, oligodendrocytes, vascular (without VLMC), immune cells and neural crest-like. For each class, we defined pan-enriched genes based on the trinarization 5% score. Each class (except neurons) was tested against neurons, to find all the genes where the fraction of clusters with trinarization score = 1 in the class was greater than the fraction of clusters with trinarization score > 0.9 among neurons.
In order to suppress batch effects (mainly due to ambient oligodenderocyte RNA in hindbrain and spinal cord samples), we collected the unique set of genes pan-enriched in the non-neuronal clusters, as well as a set of non-neuronal genes that we believe to have tendency to appear in floating RNA (Trf, Plp1, Mog, Mobp, Mfge8, Mbp, Hbb-bs, H2-DMb2) and a set of immediate early genes (Fos, Jun, Junb, Egr1). These genes were set to zero within the neuronal clusters to avoid any batch effect when clustering the neuronal clusters. We further removed sex specific genes (Xist, Tsix, Eif2s3y, Ddx3y, Uty, and Kdm5d) and immediate early genes Egr1 and Jun from all clusters.
We bounded the number of detected genes in each cluster to the top 5000 genes expressed, followed by scaling the total sum of each cluster profile to 10,000.
Next, we selected genes for linkage analysis: from each cluster select the top N = 28 enriched genes (based on pre-calculated enrichment score), perform initial clustering using linkage (Euclidean distance, Ward in MATLAB), and cut the tree based on distance criterion 50. This clustering aimed to capture the coarse structure of the hierarchy. For each of the resulting clusters, we calculated the enrichment score as the mean over the cluster divided by the total sum and selected the 1.5N top genes. These were added to the previously selected genes.
Finally, we built the dendrogram using linkage (correlation distance and Ward method).
The starting point of the dendrogram construction was the 265 clusters. For each gene, we computed average expression, trinarization with f = 0.2, trinarization with f = 0.05 and enrichment score. For each cluster we also know the number of cells, annotations, tissue distribution and samples of origin.
We defined major classes of cell types based on prior knowledge: neurons, astroependymal, oligodendrocytes, vascular (without VLMC), immune cells and neural crest-like. For each class, we defined pan-enriched genes based on the trinarization 5% score. Each class (except neurons) was tested against neurons, to find all the genes where the fraction of clusters with trinarization score = 1 in the class was greater than the fraction of clusters with trinarization score > 0.9 among neurons.
In order to suppress batch effects (mainly due to ambient oligodenderocyte RNA in hindbrain and spinal cord samples), we collected the unique set of genes pan-enriched in the non-neuronal clusters, as well as a set of non-neuronal genes that we believe to have tendency to appear in floating RNA (Trf, Plp1, Mog, Mobp, Mfge8, Mbp, Hbb-bs, H2-DMb2) and a set of immediate early genes (Fos, Jun, Junb, Egr1). These genes were set to zero within the neuronal clusters to avoid any batch effect when clustering the neuronal clusters. We further removed sex specific genes (Xist, Tsix, Eif2s3y, Ddx3y, Uty, and Kdm5d) and immediate early genes Egr1 and Jun from all clusters.
We bounded the number of detected genes in each cluster to the top 5000 genes expressed, followed by scaling the total sum of each cluster profile to 10,000.
Next, we selected genes for linkage analysis: from each cluster select the top N = 28 enriched genes (based on pre-calculated enrichment score), perform initial clustering using linkage (Euclidean distance, Ward in MATLAB), and cut the tree based on distance criterion 50. This clustering aimed to capture the coarse structure of the hierarchy. For each of the resulting clusters, we calculated the enrichment score as the mean over the cluster divided by the total sum and selected the 1.5N top genes. These were added to the previously selected genes.
Finally, we built the dendrogram using linkage (correlation distance and Ward method).
Full text: Click here