We utilized a two pronged approach to covariate selection: 1) covariates were selected based on prior knowledge
1 (link),4 (link),6 (link),7 (link),9 (link),10 (link),16 (link)–20 (link),24 (link),26 (link),44 (link),51 (link),52 , 2) in recognition that our knowledge of COVID-19 is evolving, we also employed an algorithmic approach to identify covariates in data domains consisting of diagnoses, medications and laboratory test results. Pre-defined and algorithmically selected covariates were used in modeling and were assessed in the year prior to T
0.
Pre-defined covariates consisted of age, race (white, black, and other), sex, ADI, body mass index, smoking status (current, former, and never), and measures of healthcare utilization (number of outpatient encounters as well as long-term care utilization
1 (link),16 (link),18 ). Additionally, several comorbidities including cancer, cardiovascular disease, chronic kidney disease, chronic lung disease, diabetes, and hypertension were used as pre-defined covariates. Laboratory values consisting of estimated glomerular filtration rate, systolic, and diastolic pressure were also used as pre-defined covariates. Continuous variables were transformed into restricted cubic spline functions to account for possible non-linear relationships.
To supplement our pre-defined covariates, we utilized algorithmically selected covariates from high dimensional data domains consisting of diagnoses, medications, and laboratory test results
53 (link). Data from patient encounter, prescription, and laboratory domains collected in the year prior to T
0 were organized into 540 diagnostic groups, 543 medication types, and 62 laboratory test abnormalities. From these three domains (diagnoses, medications, and laboratory test results) we selected variables which occurred in at least 100 participants within each exposure group in acknowledgment of the fact that exceedingly rare variables (those that occurred in fewer than 100 participants in these cohorts) may not substantially influence the examined associations. Univariate relative risks between each variable and exposure was estimated and 100 variables with the highest relative risks were selected for use in statistical analyses
54 (link). The algorithmic selection process described above was used to independently select high dimensional covariates in each comparison (for example, the COVID-19 vs contemporary control and the COVID-19 vs historical control analyses to assess incident GERD).
Xu E., Xie Y, & Al-Aly Z. (2023). Long-term gastrointestinal outcomes of COVID-19. Nature Communications, 14, 983.