For a given positive integer N≥10, we consider all the possible confusion matrices for a dataset with N samples and, for each matrix, compute the accuracy, MCC and F1 score and then the Pearson correlation coefficient for the three set of values. MCC and accuracy resulted strongly correlated, while the Pearson coefficient is less than 0.8 for the correlation of F1 with the other two measures (Table
Correlation between MCC, accuracy, and F1 score values
N | PCC (MCC, F1 score) | PCC (MCC, accuracy) | PCC (accuracy, F1 score) |
---|---|---|---|
10 | 0.742162 | 0.869778 | 0.744323 |
25 | 0.757044 | 0.893572 | 0.760708 |
50 | 0.766501 | 0.907654 | 0.769752 |
75 | 0.769883 | 0.912530 | 0.772917 |
100 | 0.771571 | 0.914926 | 0.774495 |
200 | 0.774060 | 0.918401 | 0.776830 |
300 | 0.774870 | 0.919515 | 0.777595 |
400 | 0.775270 | 0.920063 | 0.777976 |
500 | 0.775509 | 0.920388 | 0.778201 |
1 000 | 0.775982 | 0.921030 | 0.778652 |
Pearson correlation coefficient (PCC) between accuracy, MCC and F1 score computed on all confusion matrices with given number of samples N
Relationship between MCC and F1 score. Scatterplot of all the 21 084 251 possible confusion matrices for a dataset with 500 samples on the MCC/ F1 plane. In red, the (−0.04, 0.95) point corresponding to use case A1
Note that a large portion of the above variability is due to the fact that F1 is independent from TN: in general, all matrices have the same value regardless of the value of x, while the corresponding MCC values range from for x=0 to the asymptotic for x→∞. For example, if we consider only the 63 001 confusion matrices of datasets of size 500 where TP=TN, the Pearson correlation coefficient between F1 and MCC increases to 0.9542254.
Overall, accuracy, F1, and MCC show reliable concordant scores for predictions that correctly classify both positives and negatives (having therefore many TP and TN), and for predictions that incorrectly classify both positives and negatives (having therefore few TP and TN); however, these measures show discordant behaviors when the prediction performs well just with one of the two binary classes. In fact, when a prediction displays many true positives but few true negatives (or many true negatives but few true positives) we will show that F1 and accuracy can provide misleading information, while MCC always generates results that reflect the overall prediction issues.