Evaluating SIFT's Performance on Mutation Databases

The human variation (HumVar) and human divergence (HumDiv) data sets used to assess SIFT’s performance were obtained from UniProtKB (20 (link)). Adzhubei et al. (20 (link)) compiled the HumDiv deleterious list using mutations annotated to cause Mendelian diseases in humans. They created the HumDiv neutral data set by comparing human proteins to their homologs in closely related mammals, and identifying amino acids that are different. For the HumVar deleterious data set, the authors included any mutation annotated to cause human disease, regardless of whether they are Mendelian in origin or not. The HumVar neutral data set is made up of nonsynonymous polymorphisms not annotated as disease causing. We mapped the HumVar and HumDiv data to Ensembl, RefSeq and UCSC Known ids using the UniProtKB id mapping tool (http://www.uniprot.org/help/uniprotkb). Not all mutations from the data sets could be mapped. Hence, the final number of mutations used is less than that of the original dataset (Table 1). True positives (TP) are defined as disease-causing mutations correctly predicted to affect protein function, and false negatives (FN) are those incorrectly predicted to be tolerated. True negatives (TN) are neutral variations correctly predicted as tolerated and false positives (FP) are neutral variations incorrectly predicted to affect protein function.
Table 1.

Number of HumDiv and HumVar data points used to assess SIFT’s performance

Data set	Number of data points			Coverage** (%)
	From original dataset (20 (link))	Used in evaluating SIFT*	With SIFT predictions
HumDiv neutral	6027	5816	5582	96.0
HumDiv deleterious	3055	2893	2791	96.5
HumVar neutral	8638	7475	7178	96.0
HumVar deleterious	12 598	11 982	11 561	96.5

*Lookups to the SIFT database required Ensembl, RefSeq and UCSC Known protein identifiers and the chromosome associated with the given identifier. Not all data points could be mapped to these types of protein identifiers using UniProtKB’s ID mapping tool. Furthermore, we were not able to map some proteins to their chromosomes.

**Coverage = (Number with predictions/Number of data points tested)

The various statistics are computed as follows:

Sensitivity = TP/(TP + FN)

Specificity = TN/(TN + FP)

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Negative predictive value (NPV) = TN / (TN + FN)

Matthews correlation coefficient (MCC) = X / Y

where X = [(TP × TN) – (FP × FN)] and Y = SQRT[(TP + FP) (TP + FN) (TN + FP) (TN + FN)].
We generated receiver operating characteristic (ROC) curves for each protein database by computing the SIFT score for each substitution and categorizing them as tolerated or deleterious using different threshold values. For each threshold, the true positive rate (sensitivity) and false positive rate (1 – specificity) are then computed and plotted in Figure 1.
Figure 1.

Performance statistics of SIFT predictions on PolyPhen-2’s (a) HumVar and (b) HumDiv data sets when using various protein databases. ROC curves on the (c) HumVar and (d) HumDiv data sets. Although UniRef-100 shows slightly better performance than UniRef-90, it has lower coverage.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Sim N.L., Kumar P., Hu J., Henikoff S., Schneider G, & Ng P.C. (2012). SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Research, 40(Web Server issue), W452-W457.

Publication 2012

Corresponding Organization :

Other organizations : Genome Institute of Singapore, J. Craig Venter Institute, Franklin & Marshall College, Howard Hughes Medical Institute, Fred Hutch Cancer Center, Bioinformatics Institute

Top 5 similar protocols

Protocol cited in 522 other protocols

Variable analysis

independent variables

None explicitly mentioned

dependent variables

Sensitivity
Specificity
Accuracy
Precision
Negative predictive value (NPV)
Matthews correlation coefficient (MCC)

control variables

None explicitly mentioned

positive controls

None mentioned

negative controls

None mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!