Classification of tumor and normal cells was performed in two steps. We assumed that the major genetic distance among the cell populations is the difference between diploid and aneuploid genomes and therefore forced the single cells into two major clusters using hierarchical clustering with Ward linkage and Euclidean distance. To determine the identities of each clusters, we integrated the clustering results with the predefinition of the ‘confident normal cells’ that are defined by a very stringent criteria (seeOnline Methods section on Estimating Copy Number Baseline Values in Diploid Cells ).The cluster that has significantly higher enrichment of predefined normal cells is defined as the normal diploid cell cluster. In cases where there is no significant difference in the enrichment test, we switch to the ‘GMM definition’ approach to determine if the consensus profiles of each cluster pass the ‘normal cell criteria’, where at least 95% of the regions fall into the neutral distribution. In some challenging samples that have aneuploidy too close to 2N, we use an alternative slower approach by predicting the cells one-by-one using the ‘GMM definition’ approach and ‘normal cell criteria’.
To evaluate the accuracy of this copy number-based classification of tumor and normal cells, we applied an empirical approach to decide tumor and normal cells based on clustering and expression of cancer-specific marker genes. We first clustered all single cells within a tumor using ‘SNN’ method in R package ‘Seurat’42 . Next we obtained the expression levels of a panel of four epithelial markers (EPCAM, KRT19, KRT18, and KRT8). We calculated the average expression values of this epithelial markers panel as a consolidated epithelial score in each cell. Single cell gene expression clusters with high epithelial scores (kernel density center is above 0) were labeled as putative tumor cell clusters. In tumors that have both normal epithelial and tumor epithelial cell clusters, we further applied evaluated cancer type specific markers, including KRT19 for PDAC tumor epithelial cells, KRT8 for ATC, EPCAM for TNBC and IBC, and EGFR for GBM cancer cells. Furthermore, expression clusters that expressed immune cells markers (CD45, CD3, CD4, CD8) or fibroblast markers (ACTA2, FN1) were classified as normal cells. Single cells that had consistent aneuploid prediction results in both CopyKAT and by gene expression clusters with high epithelial score were considered to be tumor cells. The prediction accuracy of CopyKAT using aneuploid copy number profiles alone was then calculated as the number of cells with the correct prediction divided by the total number of single cells in the analysis.
To evaluate the accuracy of this copy number-based classification of tumor and normal cells, we applied an empirical approach to decide tumor and normal cells based on clustering and expression of cancer-specific marker genes. We first clustered all single cells within a tumor using ‘SNN’ method in R package ‘Seurat’42 . Next we obtained the expression levels of a panel of four epithelial markers (EPCAM, KRT19, KRT18, and KRT8). We calculated the average expression values of this epithelial markers panel as a consolidated epithelial score in each cell. Single cell gene expression clusters with high epithelial scores (kernel density center is above 0) were labeled as putative tumor cell clusters. In tumors that have both normal epithelial and tumor epithelial cell clusters, we further applied evaluated cancer type specific markers, including KRT19 for PDAC tumor epithelial cells, KRT8 for ATC, EPCAM for TNBC and IBC, and EGFR for GBM cancer cells. Furthermore, expression clusters that expressed immune cells markers (CD45, CD3, CD4, CD8) or fibroblast markers (ACTA2, FN1) were classified as normal cells. Single cells that had consistent aneuploid prediction results in both CopyKAT and by gene expression clusters with high epithelial score were considered to be tumor cells. The prediction accuracy of CopyKAT using aneuploid copy number profiles alone was then calculated as the number of cells with the correct prediction divided by the total number of single cells in the analysis.