For 390k analysis, we restricted to reads that not only mapped to the
human reference genome hg19 but that also overlapped the
354,212 autosomal SNPs genotyped on the Human Origins array4 (link). We trimmed the last two nucleotides from
each sequence because we found that these are highly enriched in ancient DNA
damage even for UDG-treated libraries. We further restricted analyses to sites
with base quality≥30.
We made no attempt to determine a diploid genotype at each SNP in each
sample. Instead, we used a single allele – randomly drawn from the two
alleles in the individual – to represent the individual at that
site20 (link),39 (link). Specifically, we made an allele call at
each target SNP using majority rule over all sequences overlapping the SNP. When
each of the possible alleles was supported by an equal number of sequences, we
picked an allele at random. We set the allele to “no call” for
SNPs at which there was no read coverage.
We restricted population genetic analysis to libraries with a minimum of
0.06-fold average coverage on the 390k SNP targets, and for which there was an
unambiguous sex determination based on the ratio of X to Y chromosome reads
(SI4) (Online Table 1).
For individuals for whom there were multiple libraries per sample, we performed
a series of quality control analysis. First, we used the ADMIXTURE
software40 (link),41 (link) in supervised mode, using Kharia, Onge,
Karitiana, Han, French, Mbuti, Ulchi and Eskimo as reference populations. We
visually inspected the inferred ancestry components in each individual, and
removed individuals with evidence of heterogeneity in inferred ancestry
components across libraries. For all possible pairs of libraries for each
sample, we also computed statistics of the form D(Library1,
Library2; Probe, Mbuti)
, where
Probe is any of a panel of the same set of eight reference
populations), to determine whether there was significant evidence of the
Probe population being more closely related to one library
from an ancient individual than another library from that same individual. None
of the individuals that we used had strong evidence of ancestry heterogeneity
across libraries. For samples passing quality control for which there were
multiple libraries per sample, we merged the sequences into a single BAM.
We called alleles on each merged BAM using the same procedure as for the
individual libraries. We used ADMIXTURE41 (link) as well as PCA as implemented in EIGENSOFT42 (link) (using the lsqproject:
YES
option to project the ancient samples) to visualize the genetic
relationships of each set of samples with the same culture label with respect to
777 diverse present-day West Eurasians4 (link). We visually identified outlier individuals, and renamed
them for analysis either as outliers or by the name of the site at which they
were sampled (Extended Data Table 1). We
also identified two pairs of related individuals based on the proportion of
sites covered in pairs of ancient samples from the same population that had
identical allele calls using PLINK43 (link). From each pair of related individuals, we kept the one
with the most SNPs.