Software was developed to deal with SLAF-seq data. Procedures are shown in Figure S1 . All SLAF pair-end reads with clear index information were clustered based on sequence similarity. To reduce computing requirements, identical reads were merged together, and sequence similarity was detected using one-to-one alignment by BLAT [23] (link) (-tileSize = 10 -stepSize = 5). Sequences with over 90% identity were grouped in one SLAF locus.
Alleles were defined in each SLAF using the MAF evaluation. To prevent false positive results, the sequence error rate was estimated using the rice data as a control. These were obtained using the same sequencing scheme as that used with common carp (Figure S1B ). True genotypes had markedly higher MAF values than genotypes containing sequence errors. Tags with sequence errors were corrected to the most similar genotype to improve data efficiency. In mapping populations of diploid species, one locus can contain at most 4 genotypes, so the groups containing more than 4 tags were filtered out as repetitive SLAFs. SLAFs with sequence depth less than 213 were defined as low-depth SLAFs and were filtered out of the following analysis. Only groups with suitable depth and fewer than 4 seed tags were identified as high-quality SLAFs, and SLAFs with 2–4 tags were identified as polymorphic SLAFs.
To evaluate the accuracy of our genotyping objectively, a Bayesian approach was proposed. Using the coverage of each allele and the number of single-nucleotide polymorphism, we calculated a posteriori conditional probability that a given individual would have a specific genotype at a corresponding locus. We proceeded as follows. Supposing there were alleles at any given locus, denoted as . For a diploid species, the number of all possible genotypes was equal to and is less than five regardless of the type of segregation of the loci. We assign a priori probability to each genotype according to the theoretical frequencies with which these genotypes would occur in such a finite probability space. For a homozygous genotype, this priori probability would equal , but it would be double that for a heterozygous genotype. Consider a pair of distinguished alleles and , the probability of sequencing one allele to another can be calculated using the following formula: Here is the average ratio of sequencing error. In our model it took on a value of 0.015 for the Illumina sequencing platform, and we used to represent the length of reads and for number of single-nucleotide polymorphisms. Based on this, we obtained the probability of allele conditioned on the genotype. , denoted as . The depth observation of allele was assumed to be , and the conditional probability of observation of each genotype can be illustrated as follows: In this way, we determined the probability of assigned genotype conditioned on the following coverage observation: The probability was translated to a genotyping quality score finally using: The final genotyping quality score value indicated the confidence with which the genotype had been called. In particular, when the difference in depth between both alleles exceeded 1∶5, the score value could be modified directly using formula (1) due to systematic bias. The upper bound of the score is 30.
This genotyping quality score was used to select qualified markers and individuals for subsequent analysis. This was a dynamic optimization process. Briefly, we counted low-quality markers for each SLAF marker and for each individual and deleted the worst markers or individuals. We repeated this process, deleting one individual or marker each time. We ceased when the average genotyping quality score of all SLAF markers reached the cutoff value, which was 13.
Alleles were defined in each SLAF using the MAF evaluation. To prevent false positive results, the sequence error rate was estimated using the rice data as a control. These were obtained using the same sequencing scheme as that used with common carp (
To evaluate the accuracy of our genotyping objectively, a Bayesian approach was proposed. Using the coverage of each allele and the number of single-nucleotide polymorphism, we calculated a posteriori conditional probability that a given individual would have a specific genotype at a corresponding locus. We proceeded as follows. Supposing there were alleles at any given locus, denoted as . For a diploid species, the number of all possible genotypes was equal to and is less than five regardless of the type of segregation of the loci. We assign a priori probability to each genotype according to the theoretical frequencies with which these genotypes would occur in such a finite probability space. For a homozygous genotype, this priori probability would equal , but it would be double that for a heterozygous genotype. Consider a pair of distinguished alleles and , the probability of sequencing one allele to another can be calculated using the following formula: Here is the average ratio of sequencing error. In our model it took on a value of 0.015 for the Illumina sequencing platform, and we used to represent the length of reads and for number of single-nucleotide polymorphisms. Based on this, we obtained the probability of allele conditioned on the genotype. , denoted as . The depth observation of allele was assumed to be , and the conditional probability of observation of each genotype can be illustrated as follows: In this way, we determined the probability of assigned genotype conditioned on the following coverage observation: The probability was translated to a genotyping quality score finally using: The final genotyping quality score value indicated the confidence with which the genotype had been called. In particular, when the difference in depth between both alleles exceeded 1∶5, the score value could be modified directly using formula (1) due to systematic bias. The upper bound of the score is 30.
This genotyping quality score was used to select qualified markers and individuals for subsequent analysis. This was a dynamic optimization process. Briefly, we counted low-quality markers for each SLAF marker and for each individual and deleted the worst markers or individuals. We repeated this process, deleting one individual or marker each time. We ceased when the average genotyping quality score of all SLAF markers reached the cutoff value, which was 13.
Full text: Click here