If the goal is to evaluate pesticide use and analyte levels in carpet dust, represented by the β parameters, then the Tobit regression of Equation 1 is sufficient and no imputation is required. For further analysis or for graphical display, it is useful to generate values for measurements below DLs. We consider several different approaches, including inserting DL/2, inserting E[Z|Z < DL], or using a single or multiple imputation (Little and Rubin 1987 ).
A multiple imputation procedure is carried out as follows. Using all data (measured concentrations, missing data types I–III, and covariates), we create the log-likelihood function 1, solve for the MLEs of β and σ2 (denoted β̂ and ς̂2), and impute a value by randomly sampling from a log-normal distribution with the estimated parameters. However, in selecting fill-in values we cannot ignore that β̂ and ς̂2 are themselves estimates with uncertainties. We therefore do not use β̂ and ς̂2 for the imputation, but rather β̃ and σ̃2, which are estimated from a bootstrap sample of the data (Efron 1979 ). Bootstrap data are generated as described below by sampling with replacement, and represent a sample from the same universe as the original data. We repeat the process to create multiple data sets, which are then independently analyzed and combined in a way that accounts for the imputation. Differences in regression results in the multiple data sets reflect variability due to the imputation process.
This procedure, however, omits a source of variability. We have tacitly assumed that the LB and UB are fixed and known in advance. When there are no interfering compounds (missing type I), the assumption is justified because the DL is determined before the GC/MS dust analysis. When there are interfering compounds (missing types II and III), the assumption cannot be fully justified because the bounds depend on the amount of interference and therefore are random. In the NHL data, we assume this uncertainty is small relative to other uncertainties. The imputation proceeds as follows:
Step 1: Create a bootstrap sample and obtain estimates β̃ and σ̃2 based on Equation 2. Bootstrap data are generated by sampling with replacement n times from the n subjects. Sampling “with replacement” selects one record at random and then “puts it back” and selects a second record. After n repetitions, some subjects are selected multiple times, whereas other subjects are not selected at all. If wi is the number of times the ith subject is sampled, then the log-likelihood function for the bootstrap data is
Step 2: Impute analyte values based on sampling from LN (β̃tX, σ̃2). For the ith subject, assign the value
This quantity consists of various elements. F(LBi; β̃tX, σ̃2) and F(UBi; β̃tX, σ̃2) are the cumulative probabilities at ULi and UBi, respectively, based on parameters β̃, σ̃2. Both values lie between zero and one. Select randomly from a uniform distribution on the interval [a, b], denoted Unif[a, b], in particular the interval [F(LBi; β̃tXi, σ̃2), F(UBi; β̃tXi, σ̃2)]. The inverse cumulative distribution function, F−1(•), is the required imputed value in original units between LBi and UBi. Repeat using the same β̃, σ̃2 for each missing value. Detected values are not altered.
Step 3: Repeat steps 1 and 2 to create M plausible (or “fill-in”) data sets. Remarkably, M need not be large, and a recommended value is between 3 and 5, with larger values if greater proportions of data are missing (Little and Rubin 1987 ; Rubin 1987 ). We select M = 10 to fully account for the variance from the imputation.
Step 4: Fit a regression model to each of the M data sets and obtain M sets of parameter estimates and covariance matrices. Combine the M sets of estimates to account for the imputation (Little and Rubin 1987 ; Schafer 1997 ). The imputation procedure results in confidence intervals (CIs) that are wider than the single-imputation, fill-in approach.
A multiple imputation procedure is carried out as follows. Using all data (measured concentrations, missing data types I–III, and covariates), we create the log-likelihood function 1, solve for the MLEs of β and σ2 (denoted β̂ and ς̂2), and impute a value by randomly sampling from a log-normal distribution with the estimated parameters. However, in selecting fill-in values we cannot ignore that β̂ and ς̂2 are themselves estimates with uncertainties. We therefore do not use β̂ and ς̂2 for the imputation, but rather β̃ and σ̃2, which are estimated from a bootstrap sample of the data (Efron 1979 ). Bootstrap data are generated as described below by sampling with replacement, and represent a sample from the same universe as the original data. We repeat the process to create multiple data sets, which are then independently analyzed and combined in a way that accounts for the imputation. Differences in regression results in the multiple data sets reflect variability due to the imputation process.
This procedure, however, omits a source of variability. We have tacitly assumed that the LB and UB are fixed and known in advance. When there are no interfering compounds (missing type I), the assumption is justified because the DL is determined before the GC/MS dust analysis. When there are interfering compounds (missing types II and III), the assumption cannot be fully justified because the bounds depend on the amount of interference and therefore are random. In the NHL data, we assume this uncertainty is small relative to other uncertainties. The imputation proceeds as follows:
Step 1: Create a bootstrap sample and obtain estimates β̃ and σ̃2 based on Equation 2. Bootstrap data are generated by sampling with replacement n times from the n subjects. Sampling “with replacement” selects one record at random and then “puts it back” and selects a second record. After n repetitions, some subjects are selected multiple times, whereas other subjects are not selected at all. If wi is the number of times the ith subject is sampled, then the log-likelihood function for the bootstrap data is
Step 2: Impute analyte values based on sampling from LN (β̃tX, σ̃2). For the ith subject, assign the value
This quantity consists of various elements. F(LBi; β̃tX, σ̃2) and F(UBi; β̃tX, σ̃2) are the cumulative probabilities at ULi and UBi, respectively, based on parameters β̃, σ̃2. Both values lie between zero and one. Select randomly from a uniform distribution on the interval [a, b], denoted Unif[a, b], in particular the interval [F(LBi; β̃tXi, σ̃2), F(UBi; β̃tXi, σ̃2)]. The inverse cumulative distribution function, F−1(•), is the required imputed value in original units between LBi and UBi. Repeat using the same β̃, σ̃2 for each missing value. Detected values are not altered.
Step 3: Repeat steps 1 and 2 to create M plausible (or “fill-in”) data sets. Remarkably, M need not be large, and a recommended value is between 3 and 5, with larger values if greater proportions of data are missing (Little and Rubin 1987 ; Rubin 1987 ). We select M = 10 to fully account for the variance from the imputation.
Step 4: Fit a regression model to each of the M data sets and obtain M sets of parameter estimates and covariance matrices. Combine the M sets of estimates to account for the imputation (Little and Rubin 1987 ; Schafer 1997 ). The imputation procedure results in confidence intervals (CIs) that are wider than the single-imputation, fill-in approach.
Full text: Click here