Raw read data for 8,136 strains were downloaded from the SRA (see SRA accessions in Supplementary Table 1). Reads were mapped onto a reference strain of H37Rv (GenBank accession number CP003248.2) using BWA version 0.7.1068 (link). Variants were identified using Pilon version 1.11 as described 67 (link). The global M. tuberculosis lineage designations used in our analysis, as well as each strain’s spoligotype, were predicted using digital spoligotyping as in Cohen et al., 20156 .
We eliminated 824 strains that did not pass our quality control filters: average sequencing depth of coverage >20X; fraction of long insertions <0.2; ambiguity rate <0.5 (to remove samples that were suspected to represent mixes of different genotypes); number of low coverage bases <250,000; and having a single match to one lineage in our lineage-prediction algorithm. We also eliminated strains for which Pilon failed three times. Of the remaining 7,312 samples, we removed 1,970 strains with no “country” metadata or description in a publication; 19 strains with substantial non-tuberculous mycobacteria contamination; as well as 13 additional duplicate patient samples. These filters resulted in a final set of 5,310 strains for analysis.
Emu69 was run to canonicalize variants. We conducted phylogenetic analyses for the entire set of 5,310 strains, as well as for a subset corresponding to each lineage and each United Nations geographical subregion23 with >30 strains (Supplementary Table 3). For each set, all sites with unambiguous single nucleotide polymorphisms (SNPs) in at least one strain were combined into a concatenated alignment. Ambiguous positions were treated as missing data. The concatenated alignment was then were used to generate a midpoint rooted phylogenetic tree using FastTree70 (link) version 2.1.8.