Reference strains (n = 91) for all serotypes (excluding 6D) were acquired from Statens Serum Institut. Reference strain for 6D was kindly provided by The National Institute for Health and Welfare (THL), Finland. A total of 926 clinical isolates were selected from the archives of the Public Health England (PHE) National Reference Lab as a test cohort; for serotypes found to belong to a genogroup, at least 10 isolates were selected where available. Post genomic-sequence data cleansing (to remove repeat isolates from the same patient, mixed cultures, other species and MLST partial profiles or failures) resulted in 871 isolates (Development Set in Table 1 ). In addition, 2079 prospective or research-related isolates were sequenced as part of the UK validation cohort. This cohort covers 72 of the commonly circulating serotypes (including all vaccine serotypes), and includes prospective isolates received by PHE during 2015, isolates selected as part of research projects and epidemiological investigations (15A (n = 196) and 19A (n = 249), respectively) and archived isolates for rarer serotypes. Post genomic-sequence data cleansing of this dataset resulted in a total of 2065 isolates (Validation Set in Table 1 ). All isolates were serotyped on receipt as part of the PHE enhanced surveillance programme using slide agglutination with Statens Serum Institut typing sera.
Genomic data for non-UK isolates were obtained from Streptococcus pneumoniae isolate database hosted in BIGSdb (http://pubmlst.org/software/database/bigsdb/ ) (Jolley & Maiden, 2010 (link)) and the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena ). Specifically, three collections were used; a set of 2531 isolates from Thailand initially described by Chewapreecha et al. (2014) (link), an Icelandic panel of 252 serogroup 6 isolates described in Van Tonder et al. (2015) (link) and a USA panel of 181 invasive isolates available in ENA as study SRP059723 .
Genomic data for non-UK isolates were obtained from Streptococcus pneumoniae isolate database hosted in BIGSdb (