Although the parameter C facilitates the identification of clusters representing tandemly repeated genomic sequences, it does not efficiently discriminate clusters derived from satellite DNA from those representing other types of tandem repeats. Therefore, an additional cluster characteristic providing a proportion of broken read pairs is calculated. A typical feature of satellite repeats is that they occur in long contiguous arrays of monomers ranging up to megabases in length, whereas other tandem repeats form arrays in a range of hundreds to thousands of bp. Consequently, clusters of satDNA contain low proportions of broken read pairs, because most sequenced DNA fragments are entirely made of the same repeat. On the other hand, the proportions of broken pairs are much higher in tandem repeats scattered in the genome in a high number of short arrays, because many sequenced fragments span the junctions between a tandem repeat array and its neighboring genomic sequences. This is evaluated as the pair completeness index P using a formula:
where NC is the number of complete read pairs in the cluster N and NI is the number of broken pairs. Both criteria, C and P, are then used simultaneously to detect putative satellite repeats, which have expected values close to 1 for both. Estimation of the threshold values of C and P suitable for sensitive yet reliable identification of putative satellite repeats was performed by re-analyzing 2968 manually annotated clusters from 11 plant species selected from the dataset published by Macas et al. (30 (link)). The estimation was done using discriminant analysis based on a Gaussian finite mixture model (37 ) as implemented in the R package mclust.
where NC is the number of complete read pairs in the cluster N and NI is the number of broken pairs. Both criteria, C and P, are then used simultaneously to detect putative satellite repeats, which have expected values close to 1 for both. Estimation of the threshold values of C and P suitable for sensitive yet reliable identification of putative satellite repeats was performed by re-analyzing 2968 manually annotated clusters from 11 plant species selected from the dataset published by Macas et al. (30 (link)). The estimation was done using discriminant analysis based on a Gaussian finite mixture model (37 ) as implemented in the R package mclust.