Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g., Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non-T cell genes (e.g., Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS14 (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat12 (link). For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal, and non-coding genes, as well as genes expressed in <0.1% or >90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We calculated pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat, providing the anchor set identified by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets. Similarly, to construct the LCMV reference map, we split the datasets into five batches that displayed strong technical differences, and applied STACAS to mitigate their confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat function with the guide tree suggested by STACAS.
Both for the TIL and LCMV atlases, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method64 (link) implemented in Seurat with parameters {resolution = 0.6, reduction = “umap”, k.param = 20} for the TIL atlas and {resolution = 0.4, reduction = “pca”, k.param = 20} for the LCMV atlas. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: (i) average expression of key marker genes in individual clusters; (ii) gradients of gene expression over the UMAP representation of the reference map; (iii) gene-set enrichment analysis to identify over- and under- expressed genes per cluster using MAST65 (link). In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
Free full text: Click here