The DDH benchmark data set was extended compared to previous studies aiming at an increased precision and significance of the ranking of the genome-to-genome distance methods and the models for the conversion to DDH values. In detail, the here used data set (henceforth called “DS1”) comprised 156 unique genome pairs along with their respective DDH values: 62 from Goris et al. [6 (
link)], 31 from the GOLD database [25 (
link)], and 63 from Richter et al. [7 (
link)]. Only the first two sources had been considered in a previous publication on
GBDP as DDH replacement [8 (
link)].
If several DDH/ANIb/ANIm/Tetra values were present for a single genome pair, they were averaged. A single genome pair showed a DDH value above 100% similarity (i.e., 100.9% between
Escherichia coli O157:H7 EDL933 and
Escherichia coli O157:H7 Sakai). As it biologically made not much sense this value was set to 100% to maintain proper input data for some of the statistical models (see below). Another genome pair (
Thermotoga maritima MSB8 and
Thermotoga petrophila RKU-1) had a contradicting relation between its DDH value (16.9%) and the genome based distance/similarity measures (
GBDP, ANI, ANIb, ANIm and Tetra) on the other hand [7 (
link)]. Following [7 (
link)], this questionable data point was excluded from the correlation analyses. The full list of genome pairs used in this study is found in the Additional file
1.
To detect significant deviations, if any, between the new and the previous
GBDP implementation, the data subset “DS2” was created, containing only the previously available data points [8 (
link)]. For comparing
GBDP with the first ANI implementation, data subset “DS3” comprised the 62 data points in common between [6 (
link),8 (
link)]; for comparison with the
JSpecies study, subset “DS4” contained only the 98 DDH values in common between [7 (
link),8 (
link)].