Treatment-naïve and neoadjuvant-treated specimens were aggregated into a single dataset. The log2(TP10K+1) expression matrix was constructed for downstream analyses. We identified the top 2,000 highly-variable genes across the entire dataset using the Scanpy 1.7.291 (link)highly_variable_genes function with sample ID as input for the batch. We then performed a Principal Component Analysis (PCA) over the top 2,000 highly variable genes and identified the top 40 principal components (PCs) beyond which negligible additional variance was explained in the data. Subsequently, we performed batch correction using Harmony-Pytorch v0.1.792 (link) and built a k-nearest neighbors graph of nuclei profiles (k = 10) based on the top 40 batch corrected components and performed community detection on this neighborhood graph using the Leiden graph clustering method93 (link) with resolution set to 1 to identify distinct cell population clusters. Individual nucleus profiles were visualized using the Uniform Manifold Approximation and Projection (UMAP)94 . Doublets were identified and removed in part using Scrublet v0.2.3. Distinct cell populations identified from the previous steps were annotated using known cell type-specific gene expression signatures and representative gene markers19 (link),26 (link),95 (link)–97 (link). We used the Adjusted Mutual Information (AMI) score to quantify the similarity in single cell assignments between the partitions imposed by the Leiden clustering labels and patient ID labels. The AMI was computed using the adjusted_mutual_info_score function in the scikit-learn v0.22.2 package.
While earlier scRNA-seq studies in PDAC did not fully capture the stromal milieu and necessitated enrichment strategies for CAFs such as fluorescence-activated cell sorting41 (link),72 (link),98 (link),99 (link), they were well-represented in our samples. Specifically, our snRNA-seq had a higher yield of high quality nuclei per patient in the untreated group (6,054 ± 1,529) than a recent scRNA-seq study of primary untreated PDAC72 (link) (1,718 ± 773), despite comparable quantities of loaded cells/nuclei (p = 1.92 x 10−9, Mann-Whitney U test; Extended Data Figure 10), recovered six additional cell types absent in scRNA-seq, and captured significantly higher proportions of CAFs, pericytes, and endocrine cells and lower proportions of vascular smooth muscle cells, myeloid cells, lymphoid cells, and endothelial cells (p < 0.05; Mann-Whitney U Test; comparable results using Dirichlet-multinomial regression; Extended Data Figure 10).