Example 2
Analysis Populations
The “Discovery-Full Analysis Set” (“Discovery FAS”) consisted of pilot study patients with clinical data and a CT-based designation of either Revascularization CAD case, Native CAD case, or Control (N=748 for the Discovery-FAS group).
The “Discovery-Native CAD Set” was the subset of the Discovery-FAS with Native CAD as verified by CT, who had analyte (metabolomic) data (N=366 for the Discovery-Native CAD Set). These were subjects without previous revascularization procedures, such as percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG).
The “Discovery-Revasc CAD Set” was the subset of the Discovery-FAS who had undergone previous revascularization, such as percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), and who had analyte data (N=44).
The “Discovery-All CAD Set” was the union of the Discovery-Native CAD Set and the Discovery-Revasc CAD Set (N=410).
The “Discovery-Control Set” was the subset of Discovery-FAS who had a calcium score of zero and were designated a Control after inspection of CT data, and who had analyte data. (N=338 for the Discovery-Control Set.)
The “Validation-Full Analysis Set” (“Validation-FAS”) consisted of pilot study patients with clinical data and a CT-based designation of either Revascularization CAD case, Native CAD case, or Control (N=348 for the Validation-FAS group).
The “Validation-Native CAD Set” was the subset of the Validation-FAS with Native CAD as verified by CT, who had analyte (metabolomic) data (N=207 for the Validation-Native CAD Set). These were subjects without previous revascularization procedures, such as percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG).
The “Validation-Revasc CAD Set” was the subset of the Validation-FAS who had undergone previous revascularization, such as percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), and who had analyte data (N=15).
The “Validation-All CAD Set” was the union of the Validation-Native CAD Set and the Validation-Revasc CAD Set (N=222).
The “Validation-Control Set” was the subset of Validation-FAS who had a calcium score of zero and were designated a Control after inspection of CT data, and who had analyte data. (N=126 for the Validation-Control Set)
It is noted that by design, the only racial group represented in the study was White. Therefore, race-based sub-populations were not defined.
A. Study Endpoints
For the GLOBAL Pilot Discovery Cohort, there were four primary endpoints in the analysis: (1) Native CAD; (2) All CAD (Native or Revascularization); (3) 50% Stenosis without Revascularization; (4) 50% Stenosis or Revascularization. All analyses were applied to all primary endpoints.
B. Statistical Hypothesis
The null hypothesis of no association, between the metabolite or lipid and the endpoint, was tested against the two-sided alternative that association exists.
C. Multiple Comparisons and Multiplicity
False discovery rate (FDR) q-values were calculated (Benjamini and Hochberg, 1995). Associations with FDR q<0.05 were considered preliminary associations. In some circumstances, test results with raw p<0.05 were reported as well.
D. Missing Data
Endpoint data was not imputed. Potential covariates with more than 5% missing data were excluded. Potential covariates with less than 5% missing data were imputed to the mean.
Metabolites with more than 10% missing data were excluded from the main analyses. Missing values for metabolites and lipids with less than 10% missing were imputed to the observed minimum after normalization.
E. Analysis of Subgroups
The first and third primary endpoints were addressed using a subset of the FAS. Specifically the Native CAS Set and the Control Set were considered to the exclusion of the Revasc. CAD Set. For the purposes of discovery, further subsets were created on the basis of participants' fasting status, where patients were categorized as Fasting if they had not eaten for eight or more hours. The remainder, either known not to be fasted, or with unknown fasting status were categorized as ‘Non-Fasting’. See
I. Demographic and Baseline Characteristics
The baseline and demographic characteristics of patients in the pilot study were tabulated. Continuous variables were summarized by the mean and standard error; binary variables were summarized by the count and percentage.
Table 27 shows general patient characteristics for the Discovery Set by clinical group (Revasc CAD vs. Native CAD vs. Control). A Kruskall-Wallis test was performed to investigate homogeneity of continuous measures; a Pearson's chi-squared test was conducted for binary measures; unadjusted p-values are reported.
Table 28 shows general patient characteristics for the Validation Set by clinical group (Revasc CAD vs. Native CAD vs. Control). A Kruskall-Wallis test was performed to investigate homogeneity of continuous measures; a Pearson's chi-squared test was conducted for binary measures; unadjusted p-values are reported.
Sample preparation and mass spectrometry analyses were conducted by Metabolon, Inc. The raw data contained a total of 1088 analytes, measured for 1096 pilot study participants.
Of the 1088 analytes (including unnamed metabolites and complex lipids), 481 named metabolites had less than 10% missing data. All 1096 patients had less than 10% missing data for these metabolites. Statistical analyses were therefore applied to 481 analytes and 1096 patients. The data was normalized in advance of receipt. A logarithm (base 2) transformation was applied and histograms were created to show the distribution of expression by analyte (data not shown).
The metabolomics data were generated in multiple batches; however, a principal components analysis (PCA) showed no evidence of any systematic site effects.
III. Prediction Modeling for Primary Endpoints
Methods. Patients in the Discovery-FAS Set were categorized according to whether they had fasted for at least eight hours. By this criteria, a total of 377 participants were Fasted and 371 were Non-Fasted. Association testing, with adjustment for age and gender was conducted for the four primary endpoints, and nominal associations were defined in three ways as follows:
-
- 1 Significant in Fasting and Non-Fasting combined
- 2 Significant in Fasting and Non-Fasting independently
- 3 Significant in Fasting alone
It is emphasized that, at this stage, ‘significant’ pertains to any association with raw, unadjusted p<0.05.
In this way, twelve scenarios were considered as follows:
-
- a) Atherosclerosis in Native CAD—AnCAD
- c. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- c. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- c. Significant in Fasting—[Figure (not displayed)]
- b) Atherosclerosis in All CAD (including revascularization)—AaCAD
- c. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- c. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- c. Significant in Fasting—[Figure (not displayed)]
- c) 50% stenosis in Native CAD—SnCAD
- c. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- c. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- c. Significant in Fasting—[Figure (not displayed)]
- d) 50% stenosis in ALL CAD (including revascularization)—SaCAD
- c. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- c. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- c. Analytes Significant in Fasting—[Figure (not displayed)].
- a) Atherosclerosis in Native CAD—AnCAD
When more than 9 variables had p<0.05, Age and Gender were added to the variables, and gradient boosting (see below) was applied to select 9 predictors.
Twelve prediction models were obtained by generalized linear (logistic) regression as follows. When fewer than nine variables had p<0.05, Age and Gender were added to the variables, and the full model was fitted. Otherwise, the nine variables selected by gradient boosting variables were combined with Age and Gender in a generalized linear (logistic) model.
Gradient boosting is an approach to determine a regression function that minimizes the expectation of a loss function. (Freidman J H (2001) and Friedman J H (2002)) It is an iterative method, in which the negative gradient of the loss function is calculated, a regression model is fitted, the gradient descent step size is selected, and the regression function is updated. The gradient is approximated by means of a regression tree, which makes use of covariate information, and at each iteration the gradient determines the direction in which the function needs to move, in order to improve the fit to the data.
The loss function was assumed Bernoulli, due to the binary nature of the primary endpoints. A learning rate (λ) was introduced to dampen proposed moves and to protect against over-fitting. The optimal number of iterations, given by T, was determined by 5-fold cross-validation. The minimum number of observations in each terminal node was 10. Two-way interactions were allowed. Random sub-sampling, without replacement, of half of the observations was applied to achieve variance reduction in gradient estimation.
For current purposes, 50 rounds of gradient boosting were run for each scenario, and the nine variables most often showing highest estimated relative influence were taken forwards to generalized linear modeling.
The twelve models were used to generate probability predictions for each patient in the Validation-FAS. For each model, the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated for the range of predicted probability thresholds. A Receiver Operating Characteristic (ROC) curve was generated to plot sensitivity as a function of (1-specificity). The optimal classification threshold was determined on the basis of accuracy, defined as the proportion of correct predictions. In addition, the Area Under the Curve (AUC) and accuracy was estimated (Tables 27, 28, 29, 30 for the four primary endpoints, respectively).
The performance of model-based predictions were compared to the performance of probability predictions obtained by Diamond-Forrester scoring. (Diamond and Forrester (1979)).
Detailed Results for Native CAD
The results show that the Diamond-Forrester score provides poor prediction of the GLOBAL phenotypes (
Metabolomics Model
I. Atherosclerosis in Native CAD—A nnCAD
-
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- i. Of the 481 analytes measured, 83 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 29 provides a list of the 83 metabolomic variables for [Figure (not displayed)].
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
-
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 4 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 31 provides a list of the 4 metabolomic variables for [Figure (not displayed)].
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
Of the 4 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)]; a panel of all four metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 32 provides the relative influence of the four metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
-
- c. Significant in Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 34 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 33 provides a list of the 34 metabolomic variables for [Figure (not displayed)].
- c. Significant in Fasting—[Figure (not displayed)]
Of the 34 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 34 provides the relative influence of the eight metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
II. Atherosclerosis in All CAD (inc revasc)—AaCAD
-
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- i. Of the 481 analytes measured, 92 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 41 provides a filtered list of the 92 metabolomic variables for [Figure (not displayed)].
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
Of the 92 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 36 provides the relative influence of the eight metabolomic variables combined with age and gender for the Metabolomics Model of [Figure (not displayed)].
-
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 6 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 37 provides a list of the 6 metabolomic variables for [Figure (not displayed)].
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
Of the 6 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of all six metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 38 provides the relative influence of the six metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
-
- c. Significant in Fasting —[Figure (not displayed)]
- i. Of the 481 analytes measured, 48 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 39 provides a list of the 48 metabolomic variables for [Figure (not displayed)].
- c. Significant in Fasting —[Figure (not displayed)]
Of the 48 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of seven metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 40 provides the relative influence of the seven metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
III. 50% stenosis in Native CAD—SnCAD
-
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- i. Of the 481 analytes measured, 49 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 41 provides a list of the 49 metabolomic variables for [Figure (not displayed)].
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
Of the 49 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 42 provides the relative influence of the eight metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
-
- b. Independently Significant in Fasting and Non-Fasting —[Figure (not displayed)]
- i. Of the 481 analytes measured, 2 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 43 provides a list of the 2 metabolomic variables for [Figure (not displayed)].
- b. Independently Significant in Fasting and Non-Fasting —[Figure (not displayed)]
Of the 2 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of both variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 44 provides the relative influence of the two metabolomic variables in combination with age and gender for the Metabolomics Model of [Figure (not displayed)].
-
- c. Significant in Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 28 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 45 provides a filtered list of the 28 metabolomic variables for [Figure (not displayed)].
- c. Significant in Fasting—[Figure (not displayed)]
Of the 28 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; they were combined with age and gender in a prediction model for CAD. Table 46 provides the relative influence of the eight metabolomic variables, in combination with age and gender, for the Metabolomics Model of [Figure (not displayed)].
IV. 50% stenosis in ALL CAD (inc revasc)—SaCAD
-
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
- i. Of the 481 analytes measured, 72 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 47 provides a list of the 72 metabolomic variables for [Figure (not displayed)].
- a. Significant in Fasting & Non-Fasting Combined—[Figure (not displayed)]
Of the 72 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 48 provides the relative influence of the eight metabolomic variables in combination with age and gender for the Metabolomics Model of [Figure (not displayed)].
-
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 5 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 49 provides a filtered list of the 5 metabolomic variables for [Figure (not displayed)].
- b. Independently Significant in Fasting and Non-Fasting—[Figure (not displayed)]
Of the 5 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of all five metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 50 provides the relative influence of the five metabolomic variables in combination with age and gender for the Metabolomics Model of [Figure (not displayed)].
-
- c. Analytes Significant in Fasting—[Figure (not displayed)]
- i. Of the 481 analytes measured, 40 metabolomic variables exhibited a nominal univariate association (raw p<0.05) for [Figure (not displayed)]. Table 51 provides a filtered list of the 40 metabolomic variables for [Figure (not displayed)].
- c. Analytes Significant in Fasting—[Figure (not displayed)]
Of the 40 metabolomic variables exhibiting a nominal univariate association for [Figure (not displayed)], a panel of eight metabolomic variables were selected as best predictors; these were combined with age and gender in a prediction model for CAD. Table 52 provides the relative influence of the eight metabolomic variables in combination with age and gender for the Metabolomics Model of [Figure (not displayed)].
For each model below, the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated for the range of predicted probability thresholds (Tables 53, 54, 55, 56). A Receiver Operating Characteristic (ROC) curve was generated to plot sensitivity as a function of (1-specificity). The optimal classification threshold was determined on the basis of accuracy, defined as the proportion of correct predictions. In addition, the Area Under the Curve (AUC) and accuracy was estimated (Tables 53, 54, 55, 56 for Native CAD, All CAD, 50% stenosis in Native CAD, and 50% stenosis in All CAD, respectively). The first row for each model indicates the performance of the maximum accuracy threshold, the optimal balance between sensitivity and specificity. Those models with a second row were optimized for a high negative predictive value (NPV).