We performed a systematic search for cohorts from the Gene Expression Omnibus (GEO) (Edgar et al. 2002 (link)) database satisfying the following inclusion criteria: (i) cohorts with bacterial infections or viral infections; (ii) cohorts with hospitalization and clinical information; and (iii) cohorts with whole blood samples. Samples were excluded due to (i) sarcoid and cancer; (ii) unknown pathogens; and (iii) coinfection with bacteria and virus (Supplementary Fig. S1). From the 3203 samples across 16 cohorts, we filtered 2680 samples for subsequent analysis (Supplementary Fig. S1 and TableĀ 1). Patients with Escherichia coli, methicillin-resistant Staphylococcus aureus, tuberculosis, influenza A virus subtype H1N1 etc. were included in the current study to identify bacterial infection and viral infection.
The samples of 14 cohorts in the discovery set were divided into 70% (1876) for training and 30% (804) for testing (TableĀ 1). The training set is applied to extract biomarkers and further train classifiers while the test set is employed to evaluate the performance and determine the hyperparameters of bvnGPS. To verify the generalization ability of bvnGPS, 147 patients in GSE21802 and GSE57065 were used for external validation. To demonstrate the robustness and simplicity of the GPS procedure, no additional preprocessing was performed on the raw expression cohorts.
Free full text: Click here