Starting from BCL files obtained from Illumina sequencing, we ran cellranger mkfastq to extract sequence reads in FASTQ format, followed by cellranger count to generate gene-count matrices from the FASTQ files. Since our data are from single nuclei, we built and aligned reads to genome references with pre-mRNA annotations, which account for both exons and introns. Pre-mRNA annotations improve the number of detected genes significantly compared to a reference with only exon annotations15 (link). For human and mouse data, we used the GRCh38 and mm10 genome references, respectively. To compare samples of interest (e.g., different loading concentrations), we pooled their gene-count matrices together, and filtered out low-quality nuclei identified based on any one of the following criteria: (1) a total number of expressed genes <200; (2) a total number of expressed genes > = 6000; or (3) a percentage of RNA UMIs from mitochondrial genes > = 10%. We then normalized and transformed the filtered count matrix to natural log space as follows: (1) selected genes that were expressed in at least 0.05% of all remaining nuclei; (2) normalized the count vector of each nucleus such that the total sum of normalized counts from selected genes is equal to 100,000 (transcripts per 100 K, TP100K); (3) transformed the normalized matrix into the natural log space by replacing each normalized count c with log(c+1) (log(TP100K+1)). We performed dimensionality reduction, clustering and visualization on the log-transformed matrix using a standard procedure16 (link),17 (link). Specifically, we selected highly variable genes18 (link) with a z-score cutoff at 0.5, performed PCA on the standardized sub-matrix consisting of only highly variable genes and selected the top 50 principal components (PCs)19 (link), clustered the data based on the 50 selected PCs using the Louvain community detection algorithm20 (link) with a resolution at 1.3. We identified cluster-specific gene expression by differential expression analyses between nuclei within the cluster and outside of the cluster16 (link) using Welch’s t test and Fisher’s exact test; controlled false discovery rates (FDR) at 5% using the Benjamini–Hochberg procedure21 , and annotated putative cell types based on legacy signatures of human and mouse brain cells. We visualized the reduced dimensionality data using tSNE22 with a perplexity at 30. Note that in experiments 1 and 4 (Supplementary Data 1), we identified one cluster that did not express any known cell-type markers and had the lowest median number of RNA UMIs among all clusters. We removed it from further analysis, and repeated the above analysis workflow, except the low-quality nucleus filtration step.
Free full text: Click here