Let X be a n × p genetic data matrix with n individuals in rows and p relative frequencies of alleles in columns. For example, in the case of a locus with three alleles (A1, A2, A3), a homozygote genotype A1/A1 is coded as [1, 0, 0], while a heterozygote A2/A3 is coded as [0, 0.5, 0.5]. We denote Xj the jth allele-column of X. Missing data are replaced with the mean frequency of the corresponding allele, which avoids adding artefactual between-group differentiation. Without loss of generality, we assume that each column of X is centred to mean zero. Classical (linear) discriminant analysis seeks linear combinations of alleles with the form:
(v = [v1...vp]T being a vector of p alleles loadings, known as 'discriminant coefficients'), showing as well as possible the separation between groups as measured by the F statistic (Equation 3). That is, the aim of DA is to choose v so that F(Xv) is maximum.
Linear combinations of alleles (Equation 5) optimizing this criterion are called principal components, which in the case of the discriminant analysis are also called discriminant functions. Discriminant functions are found by the eigenanalysis of the D-symmetric matrix [51 ]:
where P is the previously defined projector onto the dummy vectors of H, and W is the matrix of covariances within groups, computed as:
This solution requires W to be invertible, which is not the case when the number of alleles p is greater than the number of individuals n. Moreover, this inverse is numerically unstable ('ill-conditioned') whenever variables are correlated, which is always the case in allele frequencies and can be worsened by the presence of linkage disequilibrium.
To circumvent this issue, DAPC uses a data transformation based on PCA prior to DA. Rather than analyzing directly X, we first compute the principal components of PCA, XU, verifying:
where U is a p × r matrix of eigenvectors (in columns) of XTDX, and Λ the diagonal matrix of corresponding non-null eigenvalues. Note that when the number of alleles (p) is larger than the number of individuals (n), we can alternatively proceed to the eigenanalysis of XXTD to obtain U and Λ [55 ], which can save considerable computational time. By definition, the number of principal components (r) cannot exceed the number of individuals or alleles (r ≤ min(n, p)), which solves the issue relating to the number of variables used in DA. Moreover, principal components are, by construction, uncorrelated, which solves the other issue pertaining to the presence of collinearity among allele frequencies.
DA is then performed on the matrix of principal components. At this step, less-informative principal components may be discarded, although this is not mandatory. Replacing X with XU into Equation 6, the solution of DAPC is given by the eigenanalysis of the D-symmetric matrix:
The first obtained eigenvector v maximizes b(XUv) under the constraint that w(XUv) = 1, which amounts to maximizing the F-statistic of XUv. This maximum is attained for the eigenvalue γ associated to v (i.e., F(XUv) = γ). In other words, the loadings stored in the vector v can be used to compute the linear combinations of principal components of PCA (XU) which best discriminate the populations in the sense of the F-statistic.
However, it can be noticed that these linear combinations of principal components ((XU)v) can also be interpreted as linear combinations of alleles (X(Uv)), in which the allele loadings are the entries of the vector Uv. This has the advantage of allowing one to quantify the contribution of a given allele to a particular structure. Denoting zj the loading of the jth allele (j = 1,...,p) for the discriminant function XUv, the contribution of this allele can be computed as:
Free full text:
Click here
Jombart T., Devillard S, & Balloux F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11, 94.