In each simulated data set, we estimated the propensity score using a logistic regression model to regress treatment status on the 10 baseline covariates. Propensity-score matching was used to construct a matched sample consisting of pairs of treated and untreated subjects. We used greedy nearest neighbor matching on the logit of the propensity score using a caliper of width equal to , where is the variance of the logit of the propensity score in the ith treatment group. This caliper width was used as it has been shown to result in optimal estimation of risk differences in a variety of settings 10 .
In the propensity-score matched sample, the absolute risk reduction was estimated as the sample difference of the proportion of treated subjects in whom the outcome occurred and the proportion of untreated subjects in whom the outcome occurred in the propensity-score matched sample. When the true absolute risk reduction was 0 (the null hypothesis), the statistical significance of the estimated risk difference was assessed using two different methods. First, using methods for independent samples, the Pearson Chi-squared was used to assess the statistical significance of the difference in the probability of the outcome occurring between treatment groups 13 . Second, using methods for paired samples, McNemar's test was used for this comparison.
The variance of the difference in proportions was estimated in two different methods. First, using methods for independent samples, let pT and pC denote the observed probability of the outcome occurring in treated and untreated subjects, respectively, in the propensity-score matched sample. Furthermore, assume that there are N propensity score matched pairs. Then the standard error of the estimated risk difference is given by 13 . Second, using methods for paired samples, we assume that in the matched sample there were a pairs in which both the treated and untreated subjects experienced the event; b pairs in which the treated subject experienced the event while the untreated subject does not; and c pairs in which the untreated subject experienced the event while the treated subject did not. Then, the variance of the difference in proportions was estimated by ((b+ c)−(cb)2/n)/n214 (link). In both cases, 95 per cent confidence intervals were estimated as pTpC±1.96 × se(pTpC), where se(pTpC) denotes the estimated standard error of the risk difference.
For each of the 100 scenarios (2 treatment−selection models × 2 probabilities of outcomes × 5covariate scenarios × 5 absolute risk reductions), we simulated 1825 data sets. The above analyses were conducted using each of the 1825 simulated data sets. In the 20 scenarios in which the true risk difference was 0, we estimated the empirical type I error rate as the proportion of simulated data sets in which the null hypothesis of no-treatment effect was rejected with a significance level of less than 0.05. Owing to our use of 1825 simulated data sets, an empirical type I error rate that was less than 0.04 or greater than 0.06 would be classified as being statistically significantly different from 0.05. For each of the 100 scenarios, we determined the proportion of estimated 95 per cent confidence intervals that contained the true risk difference. As above, due to the use of 1825 simulated data sets, empirical coverage rates that are less than 0.94 or that exceed 0.96 are statistically significantly different from the advertised coverage rate of 0.95. We also determined the mean width of the estimated 95 per cent confidence intervals across the 1825 simulated data sets. Finally, we compared the standard deviation of the empirical sampling distribution of the estimated treatment effects (i.e. the standard deviation of the 1825 estimated risk differences across the simulated data sets) with the mean of the estimated standard errors of the estimated treatment effect.
Free full text: Click here