Figure 1 visualises the simulation design. The simulation procedure generated both Normally distributed continuous predictors and Bernoulli distributed binary predictors, each within clusters of serially correlated variables to represent multiple risk factors that measure similar patient characteristics. Such data were row partitioned into M = 5 distinct subsets of size nexist = 5000 representing five “existing populations”, and one subset of size nlocal representing the “local population”. The M = 5 existing populations were each used to fit an existing logistic regression CPMs representing those available from the literature, with each CPM including a potentially overlapping subset of risk predictors (see Additional file 1 : Table S1 for details of predictor selection for the existing CPMs). The single local population was randomly split into a training and validation set, of sizes ntrain and nvalidate, respectively (i.e. nlocal = ntrain + nvalidate). The training set was used for model aggregation using SR, PCA and PLS in addition to redevelopment using AIC and ridge regression. Datasets frequently only collect a subset of the potential risk factors; to recognise this, exactly those predictors that were included in any of the five existing CPMs were considered candidates during redevelopment. Between simulations ntrain was varied through (150, 250, 500, 1000, 5000, 10000); the validation set was reserved only to validate the models with nvalidate fixed at 5000 observations. Whilst it is unlikely that local populations would have access to such a large validation set, this was selected here to give sufficient event numbers for an accurate assessment of model performance [21 (link)–23 (link)]. Additionally, although bootstrapping methods are preferable to assess model performance in real-world datasets, the split-sample method was employed here for simplicity and clear illustration of the methods [24 ].![]()
Binary responses were simulated in all populations with probability calculated from a population-specific generating logistic regression model, which included a subset of the simulated risk predictors. The coefficients of each population-specific generating model were sampled from a normal distribution, with a common mean across populations and variance σ. Here, higher values of σ induced greater differences in predictor effects across populations and thus represented increasing between-population-heterogeneity. For each of the aforementioned values of ntrain, simulations were run with σ values of (0, 0.125, 0.25, 0.375, 0.5, 0.75, 1).
Across every combination of σ and ntrain, the simulation was repeated over 1000 iterations as a compromise between estimator accuracy and computational time. The simulations were implemented using R version 3.2.5 [25 ]. The following packages were used in the simulation: “pROC” [26 (link)] to calculate the AUC of each model, “plsRglm” [27 ] to fit the PLS models and the “cv.glmnet” function within the “glmnet” package for deriving a new model by cross-validated ridge regression [28 (link)]. The authors wrote all other code, which is available in Additional file1 .
Simulation Procedure: A pictorial representation of the simulation procedure for a given value of population heterogeneity, σ, and a given development sample size, ntrain. This process was then repeated across all combinations of σ and ntrain
Across every combination of σ and ntrain, the simulation was repeated over 1000 iterations as a compromise between estimator accuracy and computational time. The simulations were implemented using R version 3.2.5 [25 ]. The following packages were used in the simulation: “pROC” [26 (link)] to calculate the AUC of each model, “plsRglm” [27 ] to fit the PLS models and the “cv.glmnet” function within the “glmnet” package for deriving a new model by cross-validated ridge regression [28 (link)]. The authors wrote all other code, which is available in Additional file