Two public gene expression repositories (NCBI GEO, EMBL-EBI ArrayExpress) were searched for all clinical-gene expression microarray or next-generation sequencing (NGS/RNAseq) datasets that matched any of the following search terms: sepsis, SIRS, trauma, shock, surgery, infection, pneumonia, critical, ICU, inflammatory, nosocomial. Clinical studies of acute infection and/or sepsis using whole blood were retained. Datasets that utilized endotoxin or lipopolysaccharide infusion as a model for inflammation or sepsis were excluded. Datasets derived from sorted cells (e.g., monocytes, neutrophils) were also excluded.
Overall, 16 studies containing 17 different cohorts were included (Table
1a, b). These 16 studies include expression profiles from both adult
15 (link),17 (link),19 (link),43 (link)–52 (link) and pediatric
48 (link),53 (link)–56 (link) cohorts. In these cases, the gene expression data were publicly available. When mortality and severity phenotypes were unavailable in the public data, the data contributors were contacted for this information. This included datasets E-MTAB-1548 (refs.
13 (link),57 (link)), GSE10474 (ref.
44 (link)), GSE21802 (ref.
50 (link)), GSE32707 (ref.
47 (link)), GSE33341 (ref.
51 (link)), GSE63042 (ref.
19 (link)), GSE63990 (ref.
52 (link)), GSE66099 (ref.
56 (link)), and GSE66890 (ref.
49 (link)). Furthermore, where longitudinal data were available for patients admitted with sepsis, we only included data derived from the first 48 h after admission. The E-MTAB-4421 and E-MTAB-4451 cohorts both came from the GAinS study
15 (link), used the same inclusion/exclusion criteria, and were processed on the same microarray type. Thus, after re-normalizing from raw data, we used ComBat normalization
58 (link) to co-normalize these two cohorts into a single cohort, which we refer to as E-MTAB-4421.51. For this study, data were included only for patients sampled on the day of hospital admission. In addition to the above 17 datasets, we identified four additional privately held datasets (Table
1c) representing patients with HAI. In-depth summaries of each HAI cohort can be found in the supplementary text.
We selected cohorts as either discovery or validation based on their availability. Studies for which outcome data were readily available were included as discovery cohorts. Only GSE54514 (ref.
17 (link)) was initially held out for validation given its large size and representative patient characteristics. After we had trained the models some outcomes data became newly available, so these were added as validation cohorts
15 (link),50 (link)–52 (link). Additionally, given the known differences in sepsis pathophysiology and gene expression profiles as compared to patients with community-acquired sepsis
56 (link),59 (link), the HAI datasets were set aside as a second validation cohort. The validation cohorts were not matched to the discovery cohort on any particular criteria but rather provide a validation opportunity across a heterogeneous range of clinical scenarios.