To identify systematic changes between datasets, for each locus the allele sizes of one dataset were translated by a constant and the G test statistic of independence between allele frequencies and dataset (older HGDP–CEPH dataset versus newly genotyped dataset) was then computed [23 ]. Considering all possible constants for translation of allele sizes, the one that minimized the G statistic was determined. In implementing the G test, two groups of comparisons were performed. In the first group of comparisons, the constant of translation was determined by comparing 80 Jewish individuals genotyped simultaneously with the Native Americans to all 255 individuals from Europe and the Middle East in the HGDP–CEPH H1048 dataset [109 (link)], excluding Mozabites. The second group of comparisons involved 346 Native American individuals from Central and South America in this newer dataset (all 336 sampled Central and South Americans excluding Ache, and ten additional individuals who were later excluded) and 63 Native American individuals from the Maya, Pima, and Piapoco populations in the older H1048 dataset (the Piapoco population is described as “Colombian” in previous analyses of these data). The constants expected based on the two G tests—labeled S1 for the comparison of the Jewish populations to European and Middle Eastern populations and S2 for the Native American comparison—were then compared with the constant of translation expected from consideration of three additional sources of information available for the two datasets: the genotypes of a Mammalian Genotyping Service size standard (S3), a code letter provided by the Mammalian Genotyping Service indicating the nature of the change in primers (S4), and the locations of the primers themselves in the human genome sequence (S5).
Among the 693 markers, 687 had the same optimal constant of translation (that is, the constant that minimizes the G statistic) in the two different sets of population comparisons (S1 = S2). The remaining six markers with different optimal constants of translation in the two G tests were compared with the value expected from the locations of the old and new primers in the human genome (S5). In all six cases, the optimal constant for the comparison of the Jewish and European/Middle Eastern datasets agreed with the value based on the primer locations (S1 = S5). As real population differences between datasets are more likely in Native Americans due to the larger overall level of genetic differentiation in the Americas, we used the constant obtained based on the Jewish and European/Middle Eastern comparison (S1) for allele size calibration.
Of the remaining 687 markers, 638 had an optimal constant of translation that agreed with the value expected based on the code letter provided by the Mammalian Genotyping Service (S1 = S2 = S4). Thus, there were 49 markers for which the code letter was either uninformative or produced a constant of translation that disagreed with S1 and S2. For 35 of these markers, the constant of translation based on the size standard (S3) agreed with S1 and S2. For eight of the remaining 14 markers, the constant of translation based on the primer sequences (S5) agreed with S1 and S2. The six markers with disagreements (AAT263P, ATT070, D15S128, D6S1021, D7S817, and TTTAT002Z), having S1 ≠ S5, were then discarded. For the remaining 687 markers that were not discarded, 685 had G < 48 in both G tests, while the other two markers (D14S587 and D15S822) had G > 91 in the Jewish versus European/Middle Eastern comparison. These two extreme outliers, which also had the highest G values for the Native American comparison, were then excluded (
To further eliminate loci with extreme genotyping errors, we performed Hardy-Weinberg tests [110 (link)] within individual populations for the 685 remaining markers. This analysis, performed using PowerMarker [111 (link)], used only the 44 populations in which all 685 markers were polymorphic. We calculated the fraction of populations with a significant p-value (<0.05) for the Hardy-Weinberg test (