To cover a large portion of the known bacterial diversity within this species (Table S2), a total of 462 E. coli strains from multiple healthy and diseased sources were investigated. We scored as pathogenic those bacteria isolated from diseased hosts or with known virulence determinants (see bottom of Table S2) and all others as non-pathogens. One focus of the collection consisted of pathogens from both humans and domesticated animals that had been classified as EHEC (41 isolates), EPEC (20), EAEC (9), or ETEC (20) on the basis of virulence determinants (Nataro and Kaper, 1998 (link)) or APEC (13) on the basis of typical disease in domesticated animals. To add geographical as well as host diversity, and to expand the numbers of non-pathogens, the collection included all 72 isolates from the ECOR collection (Ochman and Selander, 1984 (link)), 15 isolates that represent the known diversity of E. coli from healthy wild mammals in Australia (Gordon et al., 2002 (link)) and 114 isolates from patients with diarrhoea in Ghana plus their close contacts including food handlers. We also included 61 Shigella from all known serotypes and species, 38 EIEC of different serotypes and 46 isolates from a variety of clonal groupings that express the K1 capsular polysaccharide (Achtman and Pluschke, 1986 (link)). Additional details including geographic origin are in Table S2.
Sequence-based phylogenetic analysis showed that two E. coli isolates (isolates RL325/96 and Z205 from a dog and a parrot respectively) differed markedly from the remaining isolates (Fig. 2). These strains clearly belong to E. coli according to biochemical, serological and metabolic typing schemes and by 16S rDNA sequences. Based on the MLST data, they represent the deepest known evolutionary lineages in this species. Because of their extensive sequence divergence from the vast majority of E. coli strains, they were excluded from subsequent analysis.