Evaluating SIFT's Performance on Mutation Databases
The human variation (HumVar) and human divergence (HumDiv) data sets used to assess SIFT’s performance were obtained from UniProtKB (20 (link)). Adzhubei et al. (20 (link)) compiled the HumDiv deleterious list using mutations annotated to cause Mendelian diseases in humans. They created the HumDiv neutral data set by comparing human proteins to their homologs in closely related mammals, and identifying amino acids that are different. For the HumVar deleterious data set, the authors included any mutation annotated to cause human disease, regardless of whether they are Mendelian in origin or not. The HumVar neutral data set is made up of nonsynonymous polymorphisms not annotated as disease causing. We mapped the HumVar and HumDiv data to Ensembl, RefSeq and UCSC Known ids using the UniProtKB id mapping tool (http://www.uniprot.org/help/uniprotkb). Not all mutations from the data sets could be mapped. Hence, the final number of mutations used is less than that of the original dataset (Table 1). True positives (TP) are defined as disease-causing mutations correctly predicted to affect protein function, and false negatives (FN) are those incorrectly predicted to be tolerated. True negatives (TN) are neutral variations correctly predicted as tolerated and false positives (FP) are neutral variations incorrectly predicted to affect protein function.
Number of HumDiv and HumVar data points used to assess SIFT’s performance
*Lookups to the SIFT database required Ensembl, RefSeq and UCSC Known protein identifiers and the chromosome associated with the given identifier. Not all data points could be mapped to these types of protein identifiers using UniProtKB’s ID mapping tool. Furthermore, we were not able to map some proteins to their chromosomes.
**Coverage = (Number with predictions/Number of data points tested)
The various statistics are computed as follows:
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Negative predictive value (NPV) = TN / (TN + FN)
Matthews correlation coefficient (MCC) = X / Y
where X = [(TP × TN) – (FP × FN)] and Y = SQRT[(TP + FP) (TP + FN) (TN + FP) (TN + FN)]. We generated receiver operating characteristic (ROC) curves for each protein database by computing the SIFT score for each substitution and categorizing them as tolerated or deleterious using different threshold values. For each threshold, the true positive rate (sensitivity) and false positive rate (1 – specificity) are then computed and plotted in Figure 1.
Performance statistics of SIFT predictions on PolyPhen-2’s (a) HumVar and (b) HumDiv data sets when using various protein databases. ROC curves on the (c) HumVar and (d) HumDiv data sets. Although UniRef-100 shows slightly better performance than UniRef-90, it has lower coverage.
Partial Protocol Preview
This section provides a glimpse into the protocol. The remaining content is hidden due to licensing restrictions, but the full text is available at the following link:
Access Free Full Text.
Sim N.L., Kumar P., Hu J., Henikoff S., Schneider G, & Ng P.C. (2012). SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Research, 40(Web Server issue), W452-W457.
Publication 2012
Corresponding Organization :
Other organizations :
Genome Institute of Singapore, J. Craig Venter Institute, Franklin & Marshall College, Howard Hughes Medical Institute, Fred Hutch Cancer Center, Bioinformatics Institute
Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.
As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.
About PubCompare
Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.
We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.
However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.
Ready to
get started?
Sign up for free.
Registration takes 20 seconds.
Available from any computer
No download required