The Global Initiative on Sharing Avian Influenza Data (GISAID) was founded in 2006, and, since 2010, has been hosted by the German Federal Ministry of Food, Agriculture and Consumer Protection. GISAID has also become a coronavirus repository since December 2019. As of 4 March 2020, the cutoff point for our phylogenetic analysis, the GISAID database (https://www.gisaid.org/) had compiled 254 coronavirus genomes, isolated from 244 humans, nine Chinese pangolins, and one bat Rhinolophus affinis (BatCoVRaTG13 from Yunnan Province, China). The sequences have been deposited by 82 laboratories listed in Dataset S1, Supplementary Table 1. Although SARS-CoV-2 is an RNA virus, the deposited sequences, by convention, are in DNA format. Our initial alignment confirmed an earlier report by Zhou et al. (7 (link)) that the pangolin coronavirus sequences are poorly conserved with respect to the human SARS-CoV-2 virus, while the bat coronavirus yielded a sequence similarity of 96.2% in our analysis, in agreement with the 96.2% published by Zhou et al. We discarded partial sequences, and used only the most complete genomes that we aligned to the full reference genome by Wu et al. (8 (link)) comprising 29,903 nucleotides. Finally, to ensure comparability, we truncated the flanks of all sequences to the consensus range 56 to 29,797, with nucleotide position numbering according to the Wuhan 1 reference sequence (8 (link)). The laboratory codes of the resulting 160 sequences and the bat coronavirus sequences are listed in Dataset S1, Supplementary Table 2 (Coronavirus Isolate Labels).
The 160 human coronavirus sequences comprised exactly 100 different types. We added to the data the bat coronavirus as an outgroup to determine the root within the phylogeny. Phylogenetic network analyses were performed with the Network 5011CS package, which includes, among other algorithms, the median joining network algorithm (3 (link)) and a Steiner tree algorithm to identify most-parsimonious trees within complex networks (9 ). We coded gaps of adjacent nucleotides as single deletion events (these deletions being rare, up to 24 nucleotides long, and mostly in the amino acid reading frame) and ran the data with the epsilon parameter set to zero, and performed an exploratory run by setting the epsilon parameter to 10. Both settings yielded a low-complexity network. The Steiner tree algorithm was then run on both networks and provided the identical result that the most-parsimonious trees within the network were of length 229 mutations. The structures of both networks were very similar, with the epsilon 10 setting providing an additional rectangle between the A and B clusters. The network output was annotated using the Network Publisher option to indicate geographic regions, sample collection times, and cluster nomenclature.
Free full text: Click here