The main dataset consisted of 27,476 genomes from NCBI RefSeq (April 2018) belonging to the 53 genera defined as Enterobacteriaceae by the UK Standards for Microbiology Investigations [64 ]:
Arsenophonus,
Biostraticola,
Brenneria,
Buchnera,
Budvicia,
Buttiauxella,
Calymmatobacterium,
Cedecea,
Citrobacter,
Cosenzaea,
Cronobacter,
Dickeya,
Edwardsiella,
Enterobacter,
Erwinia,
Escherichia,
Ewingella,
Gibbsiella,
Hafnia,
Klebsiella,
Kluyvera,
Leclercia,
Leminorella,
Levinea,
Lonsdalea,
Mangrovibacter,
Moellerella,
Morganella,
Obesumbacterium,
Pantoea,
Pectobacterium,
Phaseolibacter,
Photorhabdus,
Plesiomonas,
Pragia,
Proteus,
Providencia,
Rahnella,
Raoultella,
Saccharobacter,
Salmonella,
Samsonia,
Serratia,
Shigella,
Shimwellia,
Sodalis,
Tatumella,
Thorsellia,
Trabulsiella,
Wigglesworthia,
Xenorhabdus,
Yersinia and
Yokenella. Note that the definition of Enterobacteriaceae has now been updated to include a subset of these genera, with the rest assigned to new families within the order Enterobacteriales [38 (
link)]; hence our analysis can be considered a screen of Enterobacteriales, which uncovered SP loci in the related families Enterobacteriaceae, Erwinaceae and Yersinaceae. False species assignment was corrected using BacSort (github.com/rrwick/Bacsort)—a method that constructs a neighbour-joining tree of all isolates and manually curates monophyletic clades at the species level. Genetic distance was calculated as one minus average nucleotide identity (1 − ANI) for all pairs of genomes, where ANI was estimated using kmer-db [65 (
link)] with ‘-f 0.02’ option (which, for a genome size of 5 Mb, corresponds to Mash [66 (
link)] with sketch size 10
5), following by neighbour-joining tree construction using rapidNJ [67 ]. We removed (i) isolates belonging to genera of
Arsenophonus and
Sodalis as these genera were rare and did not form monophyletic clades (
n = 6); (ii) isolates with a temporary genus name
Candidatus that could not be curated using BacSort (
n = 6); (iii) isolates which could not be assigned to any of the 53 genera. The resulting 27,383 isolates were classified into 45 genera, and assigned into 39 monophyletic genus-groups with the following joint groups:
Buchnera/Wigglesworthia,
Erwinia/Pantoea,
Escherichia/Shigella,
Klebsiella/Raoultella,
Proteus/Cosenzaea and
Serratia/Gibbsiella. For some isolates, especially those descending from rare species in the dataset, species name-reconciliation was problematic. Hence, new species categories (i.e., operational taxonomic units) were defined based on the structure of the distance tree. Monophyletic species groups retained the original species names (e.g.,
K. pneumoniae), while polyphyletic groups within a genus were split into monophyletic clades with a new unique name (e.g.,
Citrobacter unknown C1; in this paper we refer to these as species groups, genome assignments are given in Supplementary Table
S1). The remaining isolates (
n = 52) were assigned to the category ‘Other’.