After having introduced the statistical background of Matthews correlation coefficient and the other two measures to which we compare it (accuracy and F1 score), we explore here the correlation between these three rates. To explore these statistical correlations, we take advantage of the Pearson correlation coefficient (PCC) [100 (link)], which is a rate particularly suitable to evaluate the linear relationship between two continuous variables [101 (link)]. We avoid the usage of rank correlation coefficients (such as Spearman’s ρ and Kendall’s τ [102 ]) because we are not focusing on the ranks for the two lists.
For a given positive integer N≥10, we consider all the possible N+33 confusion matrices for a dataset with N samples and, for each matrix, compute the accuracy, MCC and F1 score and then the Pearson correlation coefficient for the three set of values. MCC and accuracy resulted strongly correlated, while the Pearson coefficient is less than 0.8 for the correlation of F1 with the other two measures (Table 3). Interestingly, the correlation grows with N, but the increments are limited.

Correlation between MCC, accuracy, and F1 score values

NPCC (MCC, F1 score)PCC (MCC, accuracy)PCC (accuracy, F1 score)
100.7421620.8697780.744323
250.7570440.8935720.760708
500.7665010.9076540.769752
750.7698830.9125300.772917
1000.7715710.9149260.774495
2000.7740600.9184010.776830
3000.7748700.9195150.777595
4000.7752700.9200630.777976
5000.7755090.9203880.778201
1 0000.7759820.9210300.778652

Pearson correlation coefficient (PCC) between accuracy, MCC and F1 score computed on all confusion matrices with given number of samples N

Similar to what Flach and colleagues did for their isometrics strategy [66 ], we depict a scatterplot of the MCCs and F1 scores for all the 21 084 251 possible confusion matrices for a toy dataset with 500 samples (Fig. 1). We take advantage of this scatterplot to overview the mutual relations between MCC and F1 score.

Relationship between MCC and F1 score. Scatterplot of all the 21 084 251 possible confusion matrices for a dataset with 500 samples on the MCC/ F1 plane. In red, the (−0.04, 0.95) point corresponding to use case A1

The two measures are reasonably concordant, but the scatterplot cloud is wide, implying that for each value of F1 score there is a corresponding range of values of MCC and vice versa, although with different width. In fact, for any value F1=ϕ, the MCC varies approximately between [ϕ−1,ϕ], so that the width of the variability range is 1, independent from the value of ϕ. On the other hand, for a given value MCC=μ, the F1 score can range in [0,μ+1] if μ≤0 and in [μ,1] if μ>0, so that the width of the range is 1−|μ|, that is, it depends on the MCC value μ.
Note that a large portion of the above variability is due to the fact that F1 is independent from TN: in general, all matrices M=αβγx have the same value F1=2α2α+β+γ regardless of the value of x, while the corresponding MCC values range from βγ(α+β)(α+γ) for x=0 to the asymptotic a(α+β)(α+γ) for x. For example, if we consider only the 63 001 confusion matrices of datasets of size 500 where TP=TN, the Pearson correlation coefficient between F1 and MCC increases to 0.9542254.
Overall, accuracy, F1, and MCC show reliable concordant scores for predictions that correctly classify both positives and negatives (having therefore many TP and TN), and for predictions that incorrectly classify both positives and negatives (having therefore few TP and TN); however, these measures show discordant behaviors when the prediction performs well just with one of the two binary classes. In fact, when a prediction displays many true positives but few true negatives (or many true negatives but few true positives) we will show that F1 and accuracy can provide misleading information, while MCC always generates results that reflect the overall prediction issues.
Free full text: Click here