Genotype probabilities for repeats of size up to the read length are calculated using a similar model as the one used for SNPs (Li et al. 2009 (link)). Namely, P(G|R) = P(R|G) · P(G)/P(R) where the genotype G is a tuple of repeat sizes with the number of entries equal to the ploidy of the chromosome containing the repeat. The probability P(R|G) is expressed in terms of the probabilities P(ri|Hi) for individual reads ri and repeat alleles Hi as described (Li et al. 2009 (link)).
If ri is a spanning read containing m repeat units, P(ri|Hi = n) = π · f(m| p, n, s), where π is defined as above (“Repeat size estimation from IRRs”). The frequency function f is defined by f(m|p, n, s) ∼ p(1 − p)d, where m, n, s are non-negative integers bounded by the maximum number of repeat units in a read which we denote by u, p ∈ (0, 1) corresponds to the proportion of molecules with repeat of the expected size, and d = |nm| if |nm| < s and d = s otherwise. Note that f is defined similarly to the geometric frequency function with parameter d representing the deviation from n, the expected repeat size (which can be at most s). If ri is a flanking or in-repeat read containing m repeat units, P(ri|Hi=n)=πi=muf(i|p,n,s) . In all our analyses, the parameters p and s were set to 0.97 and 5. The values were chosen to maximize Mendelian consistency of genotype calls in Platinum Genome pedigree samples (Eberle et al. 2017 (link)) on an unrelated set of repeats.
We use read-length-sized repeats as a stand-in for repeats longer than the read length. If only one allele is expanded, we estimate the full size of the repeat as described above. If both alleles are expanded, the size intervals are estimated similarly by assuming that between 0 and 50% of in-repeat reads come from the short allele and between 50% and 100% of in-repeat reads come from the long allele.
Free full text: Click here