In all our experiments we evaluate U-Sleep as stated in11 . The model scores the full PSG, without considering the predicted class on a segment with a label different from the five sleep stages (e.g., segment labeled as ’UNKNOWN’ or as ’MOVEMENT’). The final prediction is the results of all the possible combinations of the available EEG and EOG channels for each PSG. Hence, we use the majority vote, i.e., the ensemble of predictions given by the multiple combination of channels in input.
The unweighted F1-score metric59 (link) is computed on all the testing sets to evaluate the performance of the model on all the experiments. We compute the F1-score for all the five classes, we then combine them by calculating the unweighted mean. Note that the unweighted F1-scores reduce the absolute scores due to lower performance on less abundant classes such as sleep stage N1. For this reason, we also report in Supplementary Table 10, Supplementary Table 11, and Supplementary Table 12 the results achieved in terms of weighted F1-score - i.e., the metric is weighted by the number of true instances for each label, so as to consider the high imbalance between the sleep stages. In that case, the absolute scores significantly increases on all the experiments. In Supplementary Table 10, Supplementary Table 11, and Supplementary Table 12 we also report the Cohen’s kappa metric, given its valuable property of correcting the chance of agreement between the automatic sleep scoring algorithm, i.e., overall predicted sleep stages, and the ground truth, i.e., the sleep labels given by the physicians.
* The Bern Sleep Data Base BSDB registry usage was ethically approved in the framework of the E12034 - SPAS (Sleep Physician Assistant System) Eurostar-Horizon 2020 program (Kantonale Ethikkommission Bern, 2020-01094).
Free full text: Click here