Single-value imputation refers to replacing missing values by a constant or a randomly selected value. These simple replacement procedures have been shown in microarray-based gene expression analyses to result in low performances when compared with other more advanced approaches;20 (link) however, these approaches may perform well in the presence of largely left-censored missing values and thus are evaluated here. Left-censoring means the values are missing from the low intensity (i.e., left tail) across the full distribution of possible measured intensity values. When data is censored in such a way, it is considered to be NMAR.
One approach to selecting a replacement value for a dataset is to use some minimal observed values estimated as the limit of detection (LOD). Half of the global minimum and half of the peptide minimum are common approaches currently used in the proteomics community to fill in missing values.40 ,41 (link) Half of the global minimum is defined as the minimal observed intensity value (not on the log scale) among all peptides (LOD1). The peptide minimum is the lowest intensity value observed for an individual peptide, and half of this value is referred to as LOD2. Random tail imputation (RTI) is based on the assumption that the entire proteomics dataset can be modeled by a single distribution and that the majority of the missing data are left-censored and can be drawn from the tail of the distribution.42 (link),43 (link) RTI computes the global mean and standard deviation of all observed values within the proteomics dataset, μ and σ, respectively. Peptide intensities are plotted as frequency histograms, and the missing values are then drawn from a truncated normal distribution to obtain values that are within with the left tail of the distribution, N(μ,σ) – k. The parameter k is selected as a maximum value that allows the imputed data to merge into the left tail of the base distribution N(μ,σ) without yielding a bimodal distribution. The parameter selection of k is based on recursive visualization of the imputed data at various values of k using histograms until a suitable value is achieved.