After consensus genomes were combined, we used snp-sites v2.5.1 to extract the variant positions, and then generated a neighbour-joining tree of all 6037 samples with IQ-TREE v2.1.4-beta [34 (
link)]. The tb-profiler results were combined with the neighbour-joining tree and the L4 genomes identified. A maximum-likelihood phylogenetic tree of the L4 genomes was then derived using IQ-TREE with built-in model selection, and the inclusion of the number of invariant sites, as identified using snp-sites. TreeBreaker v1.1 [35 (
link)] was used to identify internal nodes of the tree where there was a change in the distribution of phenotypes of interest at the tips that descended from that internal node. The TreeBreaker command line used was ‘treeBreaker -x 5000000 -y 5000000 -z 10 000 input.tree phenotype.txt output_prefix’. The phenotype of interest was the geographical location. To enable easy interpretation, separate TreeBreaker runs were carried out for Vietnam, Indonesia, China and Thailand, and all the preceding countries combined into a single category (i.e. a single ‘phenotype’ of belonging to either Vietnam, Indonesia, China or Thailand). TreeBreaker outputs a text file that, on the last line of the file, has a newick format phylogenetic tree with the results annotated onto the internal nodes. This newick tree was extracted from the text file and saved as a tree file. It was then converted to a nexus format tree using FigTree (ensuring to include annotations) for reading into dendropy v4.5.2 [36 (
link)]. The nexus format tree was then parsed using a script (
https://gist.github.com/flashton2003/50d645a60219c0e381874a1dd4355646) to produce sub-trees and summary information for nodes above the 0.5 posterior probability threshold. Example input and output files for TreeBreaker analysis can be accessed from
https://doi.org/10.6084/m9.figshare.21378312. As TreeBreaker produces results annotated onto the nodes of the input phylogenetic tree, and we used the same input tree for all analyses, we could combine the results from these different runs based on the identifiers of the internal nodes. As we were using TreeBreaker as a screening tool, to identify nodes for further analysis using SIMMAP, we filtered for nodes with a posterior probability threshold of 0.5 and at least five descendent leaves. All SIMMAP analysis [37 (
link)] was carried out using the make.simmap function from PhyTools [38 (
link)] in the R statistical language [39 ] using RStudio [40 ]. The fit of each model type (all rates different, symmetrical and equal rates) was assessed using the fitMk function, and the model with the best fit was used for the SIMMAP analysis. We ran 1000 simulations within SIMMAP. Nodes that were identified as being associated with changes by TreeBreaker were targeted for investigation in the output of SIMMAP. Trees (newick format) were visualized with iTOL [41 (
link)], and graphs drawn with ggplot2 [42 ]. The files for replicating the iTOL trees can be downloaded from
https://doi.org/10.6084/m9.figshare.21378330.v1. Phylotemporal analysis was carried out using TreeTime v0.9.0 [43 (
link)] with a substitution rate and standard deviation of 0.000000061643 and 0.0000000385, respectively. These values were obtained from the estimates of the ‘BEAST constant population size, uniform prior on clock rate’ analysis of Menardo
et al. [44 (
link)]. The command line used was ‘treetime –clock-rate 0.000000061643 –tree input.tree –dates input_dates.csv –outdir my_analysis –sequence-length 4411532 –confidence –clock-std-dev 0.0000000385’. Input data for TreeTime analysis can be found at
https://doi.org/10.6084/m9.figshare.21401307.v1.