The datasets were retrieved from published and unpublished datasets in multiple human tissues, including airways26 ,27 , cornea (personal communication; Lako lab, Newcastle), skeletal muscle (personal communication, Teichmann lab, Wellcome Sanger Insitute and Zhang lab, Sun-Yat-Sen University, Guangzhou, China), ileum28 , colon29 , pancreas30 , liver31 , gallbladder (personal communication; Vallier lab, University of Cambridge), heart (Teichmann lab, Hubner lab/Berlin, Seidmanns/Harvard, and Noseda lab/Imperial College London), kidney32 , placenta/decidua33 , testis34 , prostate gland35 , brain36 , skin37 , retina38 , spleen39 , esophagus39 , and fetal tissues40 ,41 . Raw expression values were normalized and log transformed. We retained the cell clustering based on the original studies when available.
For each dataset where per-cell annotation is not available, we re-processed the data from raw or normalized (whichever was deposited alongside the original publication) quantification matrix. The standard scanpy (version 1.4.3) clustering procedure was followed. When batch information is available, harmony package was used to correct batch effects in the PC space and the corrected PCs were used for computing nearest neighbour graphs. To re-annotate the cells, multiple clusterings of different resolutions were generated among which the one best matching the published clustering was picked and manual annotation was undertaken using marker genes described in the original publication. Full details can be found in analysis notebooks available at github.com/Teichlab/covid19_MS1.
Illustration of the results was generated using scanpy and Seurat (version 3.1). For correlation analysis with ACE2, we performed the Spearman’s correlation with statistical tests using the R Hmisc package (version 4.3-1) and the p values were adjusted with Benjamini-Hochberg method with the R stats package (version 3.6.1) on the Vieira Braga, Kar et al. airway epithelial dataset and the Deprez et al. airway dataset. We also tested multiple additional approaches, including Kendall’s correlation, data transformation by sctransform function in the Seurat package, and data imputation by the Markov Affinity-based Graph Imputation of Cells (MAGIC) algorithm, to compare correlation results. While the imputation significantly improved the correlations, the top genes correlated with ACE2 are largely the same as the analysis done on un-imputed data. With the uncertainty of the extent imputation artificially distorted the data, we reported the results with no imputation even though the correlations are low. The correlation coefficients for all genes are included as Supplementary Data 1. The top 50 genes in each dataset were characterized based on Gene Ontology classes from the Gene Ontology (GO) database and associated pathways in PathCards from the Pathway Unification database.