Two EHH-derived statistics, the intra-population Integrated Haplotype Score (
iHS)26 (
link) and inter-population
Rsb27 (
link), were applied using the
rehh package28 (
link) for R software. In the
iHS analysis, the natural log of the ratio between the integrated EHH for the ancestral (
iHHA) and derived allele (
iHHD) was calculated for each genotyped SNP with MAF ≥ 0.5% in EASZ. As the standardised
iHS values are normally distributed (
Supplementary Fig. S1), a two-tailed Z-test was applied to identify statistically significant SNPs under selection with either an unusual extended haplotype of ancestral (positive
iHS value) or derived alleles (negative
iHS value). Two-sided
P-values were derived as −log
10(1-2|Ф(
iHS)-0.5
|), where Ф
(iHS) represents the Gaussian cumulative distribution function. The ancestral and derived alleles of each SNP were inferred in two ways: (i) the ancestral allele was inferred as the most common allele within a dataset of 13 Bovinae species29 (
link); (ii) for SNPs with no information available in Decker
et al.29 (
link), the ancestral allele were inferred as the most common allele in the complete dataset (EASZ and reference populations), consistent with the observation that in humans, the SNP alleles with higher frequency were likely to represent the ancestral allele30 (
link).
Inter-population
Rsb analyses were conducted between the EASZ and each continental reference (European (Holstein-Friesian and Jersey), African (N’Dama) and Asian (Nellore)) population as well as with all the reference populations combined. The integrated EHHS (site-specific EHH) for each SNP in each population (
iES) was calculated, and the
Rsb statistics between populations were defined as the natural log of the ratio between
iESpop1 and
iESpop2. As the standardised
Rsb values are normally distributed (
Supplementary Fig. S1), a Z-test was applied to identify statistically significant SNPs under selection in EASZ (positive
Rsb value). One-sided
P-values were derived as −log
10(1-Ф(
Rsb)), where Ф
(Rsb) represents the Gaussian cumulative distribution function. A Z-test was not applied to BTA X
Rsb values due to their non-normal distribution (Shapiro-Wilk test;
P-value <2.2 × 10
−16,
Supplementary Fig. S1). In both
iHS and
Rsb, −log
10 (
P-value) = 4, equivalent to a
P-value of 0.0001, was used as a threshold to define significant
iHS and
Rsb values. Candidate regions were retained if two SNPs separated by ≤1 Mb passed this threshold. In case of
Rsb analysis, the combined reference analysis was considered to define the candidate regions. A distance of 0.5 Mb in both directions from the most significant SNP within the
iHS and
Rsb candidate regions was used to define the candidate genome region interval. This distance was chosen based on the rate of change in the mean pairwise linkage disequilibrium statistic (r
2), calculated by the
r2fast function of the GenABEL package, binned over distance across the EASZ autosomes (
Supplementary Fig. S2). Indeed, at larger distances we reach the r
2 plateau. This extent of LD has been confirmed in eight cattle breeds (taurine and zebu) in a previous study31 (
link).
As a prerequisite for these two statistics, haplotypes were reconstructed through phasing the genotyped SNPs
via fastPHASE software version 1.432 (
link), using the criteria K10 and T10, as in Utsunomiya
et al.33 (
link), to reduce computation time. Population label information was used to estimate the phased haplotypes population background.
Bahbahani H., Clifford H., Wragg D., Mbole-Kariuki M.N., Van Tassell C., Sonstegard T., Woolhouse M, & Hanotte O. (2015). Signatures of positive selection in East African Shorthorn Zebu: A genome-wide single nucleotide polymorphism analysis. Scientific Reports, 5, 11729.