We divided our data into training (67%) and testing (33%) data sets by using the R package caTools17 and developed a random forest algorithm via the R package randomForest18 on the training data set for each of the 2 outcome variables (HAPIs ≥ stage 2 and HAPIs ≥ stage 1). We used the training data set to develop the random forest model, and then we tested the model’s performance with the testing (or held out) data set.
We determined that 4 was the best number of features to be used for each tree (where M = total number of features and m = best number of features for each tree, m=M or 4.47=20 [rounded to 4]). We determined that the optimal number of iterations (or trees in the forest) was 500, because after that value, the estimated “out-of-bag” error rate was sufficiently stabilized. We included all of the predictor variables except vasopressin and sampled participants with replacement. We set the cutoff value at 0.5 so that each tree “voted” and a simple majority won. After building the model with the training set, we applied the algorithm to the data in the testing data set. Next, we used the R package randomForest18 to rank importance of each variable; we then constructed visual representations of relationships between variables to assess directionality. Finally, we used the R package ROCR19 (link) to assess receiver operating characteristic curves (ROC curves) and the area under the curve for each of our models by using the testing data set.