We obtained a total of 2600 Kp genomes (2021 publicly available and 579 novel genomes from a diverse set collected in Australia). Sequence reads were generated locally or obtained from the European Nucleotide Archive (accession numbers are listed in Table S1, available with the online Supplementary Material); 916 genomes that were publicly available as assembled contigs only were downloaded from PATRIC (Wattam et al., 2014 ) and the NCTC3000 Project (Wellcome Trust Sanger Institute – http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/). For isolates sequenced in this study (n=579), DNA was extracted and libraries prepared using the Nextera XT 96 barcode DNA kit and 125 bp paired-end reads were generated on the Illumina HiSeq 2500 platform.
All paired-end read sets were filtered for a mean Phred quality score ≥30, then assembled de novo using SPAdes v3 (Bankevich et al., 2012 (link)). Genomes were excluded from the study if they were duplicate samples, or if there was evidence of contamination or mixed culture measured by: (i) <50 % reads mapping to the NTUH-K2044 reference chromosome (accession number: AP006725.1); (ii) the ratio of heterozygous/homozygous single nucleotide polymorphism (SNP)calls compared to the reference chromosome exceeding 20 %; (iii) the total assembly length being >6.5 Mb, or >6.0 Mb with evidence of >1 % non-Klebsiella read contamination as determined by MetaPhlAn (Segata et al., 2012 (link)); or (v) the assembly being low quality, i.e. total length <5 Mb.
Free full text: Click here