We extensively mined clusters obtained in preliminary analyses and found that they largely corresponded to known and putative cell types, broadly consistent with previous data. Some clusters were also clearly derived from doublets, expressing contradictory markers e.g., from neurons and vascular cells.
With any type of clustering the choice of feature space is crucial. For preliminary clustering, we used genes informative across the entire set of cells, projected by PCA. This would be expected to be suitable for finding major cell types, but would not be optimal for finding finer subdivisions among cells of the same kind (e.g., interneurons in a dataset containing both neurons, vascular cells and glia). For example, running Louvain clustering on the full dataset resulted in only 44 clusters, compared to the 265 found by the multi-level, iterative approach described below.
We decided to first split cells by major class. In order to split the data, and to reject many doublets, we trained a classifier to automatically detect the major class of each single cell, as well as classes representing doublets. We first manually annotated clusters to indicate major classes of cells: Neurons, Oligodendrocytes, Astrocytes, Bergman glia, Olfactory ensheathing cells, Satellite glia, Schwann cells, Ependymal, Choroid, Immune, and Vascular. For some of these classes, we distinguished proliferating cells (e.g., Cycling oligodendrocytes, i.e., OPCs). We also manually identified clusters that were clearly doublets between these major classes (e.g., Vascular-Neurons) as well as clusters that were of poor quality.
We then trained a support vector classifier to discriminate all of these labels, using the training set of preliminary clusters manually annotated with class labels. We sampled 100 cells per cluster and used 80% of this dataset to optimize the classifier, and the remaining 20% to assess performance. On average, the classification accuracy was 93% for non-cycling cells. The precision and recall for neurons was 93% and 99%, respectively. That is, 99% of all neurons were classified correctly, and 93% of all cells classified as neurons were actually neurons. The classifier struggled to distinguish cycling cells, presumably because they shared most gene expression with their non-cycling counterparts. For this reason, we always pooled cycling and non-cycling cells after classification. The table below shows the accuracy for all major classes of interest:
PrecisionRecall
astrocyte87%96%
astrocyte, cycling59%38%
bergmann-glia100%97%
blood77%65%
ependymal98%97%
immune96%98%
neurons93%99%
neurons, cycling63%54%
oec100%95%
oligos91%97%
oligos, cycling39%19%
satellite-glia90%95%
satellite-glia, cycling91%88%
schwann100%100%
choroid100%80%
vascular87%97%
vascular, cycling100%25%

average (non-cycling)93%93%

We used this classifier to individually assess the class identity of each cell in each dataset, and to pool cells by major class into new files (with neurons further separated by tissue).
Free full text: Click here