Discriminant Analysis of Principal Components

Let X be a n × p genetic data matrix with n individuals in rows and p relative frequencies of alleles in columns. For example, in the case of a locus with three alleles (A₁, A₂, A₃), a homozygote genotype A₁/A₁is coded as [1, 0, 0], while a heterozygote A₂/A₃is coded as [0, 0.5, 0.5]. We denote X^jthe j^thallele-column of X. Missing data are replaced with the mean frequency of the corresponding allele, which avoids adding artefactual between-group differentiation. Without loss of generality, we assume that each column of X is centred to mean zero. Classical (linear) discriminant analysis seeks linear combinations of alleles with the form:

f (v) = \sum_{j = 1}^{p} X^{j} v_{j} = X v

(v = [v₁...v_p]^Tbeing a vector of p alleles loadings, known as 'discriminant coefficients'), showing as well as possible the separation between groups as measured by the F statistic (Equation 3). That is, the aim of DA is to choose v so that F(Xv) is maximum.
Linear combinations of alleles (Equation 5) optimizing this criterion are called principal components, which in the case of the discriminant analysis are also called discriminant functions. Discriminant functions are found by the eigenanalysis of the D-symmetric matrix [51 ]:

P X {(W)}^{- 1} X^{T} P^{T} D

where P is the previously defined projector onto the dummy vectors of H, and W is the matrix of covariances within groups, computed as:

W = X^{T} {(I - P)}^{T} D (I - P) X

This solution requires W to be invertible, which is not the case when the number of alleles p is greater than the number of individuals n. Moreover, this inverse is numerically unstable ('ill-conditioned') whenever variables are correlated, which is always the case in allele frequencies and can be worsened by the presence of linkage disequilibrium.
To circumvent this issue, DAPC uses a data transformation based on PCA prior to DA. Rather than analyzing directly X, we first compute the principal components of PCA, XU, verifying:

X^{T} D X U = U Λ

where U is a p × r matrix of eigenvectors (in columns) of X^TDX, and Λ the diagonal matrix of corresponding non-null eigenvalues. Note that when the number of alleles (p) is larger than the number of individuals (n), we can alternatively proceed to the eigenanalysis of XX^TD to obtain U and Λ [55 ], which can save considerable computational time. By definition, the number of principal components (r) cannot exceed the number of individuals or alleles (r ≤ min(n, p)), which solves the issue relating to the number of variables used in DA. Moreover, principal components are, by construction, uncorrelated, which solves the other issue pertaining to the presence of collinearity among allele frequencies.
DA is then performed on the matrix of principal components. At this step, less-informative principal components may be discarded, although this is not mandatory. Replacing X with XU into Equation 6, the solution of DAPC is given by the eigenanalysis of the D-symmetric matrix:

P X U {(U^{T} W U)}^{- 1} U^{T} X^{T} P^{T} D

The first obtained eigenvector v maximizes b(XUv) under the constraint that w(XUv) = 1, which amounts to maximizing the F-statistic of XUv. This maximum is attained for the eigenvalue γ associated to v (i.e., F(XUv) = γ). In other words, the loadings stored in the vector v can be used to compute the linear combinations of principal components of PCA (XU) which best discriminate the populations in the sense of the F-statistic.
However, it can be noticed that these linear combinations of principal components ((XU)v) can also be interpreted as linear combinations of alleles (X(Uv)), in which the allele loadings are the entries of the vector Uv. This has the advantage of allowing one to quantify the contribution of a given allele to a particular structure. Denoting z_jthe loading of the j^thallele (j = 1,...,p) for the discriminant function XUv, the contribution of this allele can be computed as:

\frac{z_{j}^{2}}{\sum_{j = 1}^{p} z_{j}^{2}}

Free full text: Click here

Jombart T., Devillard S, & Balloux F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11, 94.

Publication 2010

A genetic Allele Dapc Genotype Heterozygote Homozygote Populations Vector

Corresponding Organization :

Other organizations : Imperial College London, Université Claude Bernard Lyon 1, Laboratoire de Biométrie et Biologie Evolutive

Top 5 similar protocols

Protocol cited in 27 other protocols

Variable analysis

independent variables

X
Allele frequencies

dependent variables

Linear combinations of alleles (Equation 5)
Discriminant functions
F statistic (Equation 3)

control variables

Missing data are replaced with the mean frequency of the corresponding allele
Each column of X is centred to mean zero

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!