The expanded H. pylori dataset consists of 3,406 bp of unique, concatenated sequences of fragments of atpA, efp, mutY, ppa, trpC, ureI, yphC from 769 H. pylori isolates (Table S1). The dataset includes 347 novel isolates in addition to data from 422 other strains that have been described previously2 (link), 3 (link),23 . The new bacteria were isolated from 25 additional ethnic sources in Asia (8 countries), Europe (4, including Basques), Africa and the Middle East (9) and South America (2), for a total of 51 ethnic sources (Table S1). The forward and reverse strands were sequenced as described1 (link). Almost half (1,522 sites, 45%) of the nucleotides are polymorphic, resulting in a nucleotide diversity (π) of 4.2% for the entire data set.
The non-migrant dataset excluded bacteria that were isolated from the following migrant human populations: Europeans and Cape Coloureds from Cape Town; Mestizos from Colombia and Venezuela; Whites and African Americans from the USA; isolates in Thailand from Chinese or without ethnic association. hpAfrica2 isolates from Xhosas near Pretoria were excluded because they were a selective subset rather than a population-wide sample. The Philippines were also removed because almost all bacterial populations were found there, probably due to their colonial history. For isolates from Native Americans, only hspAmerind strains were considered non-migrant. The dataset was further restricted to geographic samples with at least four isolates, to avoid statistical noise, which resulted in the elimination of all Jewish and Russian isolates and singletons from locations in China and Japan.