Because allele sequences may only be partially available (e.g., exons
only), HISAT-genotype first identifies two alleles based on the sequences
commonly available for all alleles, e.g. exons. For example, the IMGT/HLA
database includes many sequences for some key exons of HLA genes, but it
contains far fewer complete sequences comprising all exons, introns, and UTRs of
the genes. So far 3,644 alleles have been classified for HLA-A. Although all
alleles of HLA-A have known sequences for exons 2 and 3, only 383 alleles have
full-length sequences available. The sequences for the remaining 3,261 alleles
include either all 8 exons or a subset of them. HLA-B has 4,454 alleles, of
which 416 have full sequences available. HLA-C has 3,290 alleles, with only 590
fully sequenced, HLA-DQA1 has 76 alleles with 53 fully sequenced, HLA-DQB1 has
978 alleles with 69 fully sequenced, and HLA-DRB1 has 1,972 alleles, with only
43 fully sequenced. During this step, HISAT-genotype first chooses
representative alleles from groups of alleles that have the same exon sequences.
Next it identifies alleles in the representative alleles that are highly likely
to be present in a sequenced sample. Then the other alleles from the groups with
the same exons as the representatives are selected for assessment during the
next step. Second, HISAT-genotype further identifies candidate alleles based on
both exons and introns. HISAT-genotype applies the following statistical model
in each of the two steps to find maximum likelihood estimates of abundance
through an Expectation-Maximization (EM) algorithm39 . We previously implemented an EM
solution in our Centrifuge system40 , and we used a similar algorithm in HISAT-genotype, with
modifications to the variable definitions as follows.
only), HISAT-genotype first identifies two alleles based on the sequences
commonly available for all alleles, e.g. exons. For example, the IMGT/HLA
database includes many sequences for some key exons of HLA genes, but it
contains far fewer complete sequences comprising all exons, introns, and UTRs of
the genes. So far 3,644 alleles have been classified for HLA-A. Although all
alleles of HLA-A have known sequences for exons 2 and 3, only 383 alleles have
full-length sequences available. The sequences for the remaining 3,261 alleles
include either all 8 exons or a subset of them. HLA-B has 4,454 alleles, of
which 416 have full sequences available. HLA-C has 3,290 alleles, with only 590
fully sequenced, HLA-DQA1 has 76 alleles with 53 fully sequenced, HLA-DQB1 has
978 alleles with 69 fully sequenced, and HLA-DRB1 has 1,972 alleles, with only
43 fully sequenced. During this step, HISAT-genotype first chooses
representative alleles from groups of alleles that have the same exon sequences.
Next it identifies alleles in the representative alleles that are highly likely
to be present in a sequenced sample. Then the other alleles from the groups with
the same exons as the representatives are selected for assessment during the
next step. Second, HISAT-genotype further identifies candidate alleles based on
both exons and introns. HISAT-genotype applies the following statistical model
in each of the two steps to find maximum likelihood estimates of abundance
through an Expectation-Maximization (EM) algorithm39 . We previously implemented an EM
solution in our Centrifuge system40 , and we used a similar algorithm in HISAT-genotype, with
modifications to the variable definitions as follows.