A wide range of databases of health care utilization data (“claims”) is available for use in pharmacoepidemiology.
3 (link) Each database is arranged in specific ways using a variety of classifications to code diagnoses (e.g. International Classification of Diseases [ICD]-8 through ICD-10), procedures (e.g. Current Procedural Terminology, Canadian Classification of Proceddures, ICD-9-Clinical Modification), or medications (e.g. National Drug Codes, American Hospital Formulary Services, Anatomical Therapeutic Chemical Classification). Beyond these basic data dimensions and coding systems, many more data dimensions can be found in such databases. Some databases provide additional dimensions such as laboratory results, other electronic medical record information, and accident registries.
We propose an algorithm that is independent of the specific data source as long as the source’s data dimensions can be identified. In
Figure 2 we provide a flow diagram using a typical example of data dimensions available in US Medicare claims data linked to medication use data. First, a temporal window must be defined in which baseline covariates will be identified. A frequent choice is 6 or 12 months preceding the initiation of the study or comparison drug.
2 (link) The recording of diagnoses and procedures is correlated with the frequency of health care encounters. Therefore, longer baseline periods increase the number of encounters and therefore yield more covariate information.
2 (link)The most basic patient information always available to typical databases is age, sex and calendar time. We assume that given their ubiquity, these demographic covariates will always be adjusted for.
Additional covariates can then be identified from the various data dimensions, but it is first necessary to identify variables that should not be part of covariate adjustment. While it is generally recommended to include many covariates in a propensity score regression model, in specific cases researchers may exclude variables from covariate adjustment.
17 (link) Surrogates for the exposure (i.e. covariates that are strong correlates of the study exposure but not associated with the outcome) will not only increase standard errors but may also increase bias—and should therefore not be included in propensity score analyses.
18 ,19 (link) Bias can also occur through the inclusion of so-called “collider” variables, although this bias is generally thought to be weak.
20 (link),21 (link) In our example study comparing statin initiation with glaucoma drug initiation, diagnostic codes for glaucoma should not be included in a propensity score because of their close correlation with treatment choice.
22 (link),23 At this stage of the procedure, such codes can be identified and removed from the dimension data input to the algorithm. We have developed a screening tool for such covariates as part of the algorithm that will help investigators identify and remove such covariates (
eAppendix 1,
http://links.lww.com).
Schneeweiss S., Rassen J.A., Glynn R.J., Avorn J., Mogun H, & Brookhart M.A. (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology (Cambridge, Mass.), 20(4), 512-522.