We used multiple performance measures to evaluate model performance based on previously published recommendations for reporting on external validation studies.10 (link) These included: calibration plot (calibration-in-the-large) and model intercept, calibration slope, discrimination with concordance statistic and clinical usefulness with decision curve analysis.
As recommended by Steyerberg et al,12 (link) we used the scaled Brier score as a combined measure of model discrimination and calibration instead of the goodness-of-fit (Hosmer-Lemeshow) test.17 18
Sensitivity and specificity rates were calculated for all models. Negative and positive predictive values strongly depend on delirium incidence and were therefore not reported.
Calculations were performed semi-automatically using R-based validation software V.2.18 (available at