Applying a threshold on the p-value will identify barcodes that have count profiles that are significantly different from the ambient pool of RNA. We assume that this will be the case for most cell-containing droplets, as the ambient pool is formed from many (lysed) cells and is unlikely to be representative of any single cell. However, it is possible for some cell-containing droplets to have ambient-like expression profiles. This can occur if the cell population is highly homogeneous or if one cell subpopulation contributes disproportionately to the ambient pool, e.g., if it is more prone to lysis. Sequencing errors in the cell barcodes may also bias the estimates of the ambient proportions, by misassigning counts from cell-containing droplets to barcodes with low UMI totals. This may result in spurious similarities between cells and the estimated ambient profile.
To avoid incorrectly calling ambient-like cells as empty droplets, we combine our procedure with a conventional threshold on the total UMI count. We rank all barcodes in order of decreasing tb, and consider log(tb) as a function f(.) of the log-transformed rank, i.e., log(tb)=f(logrb) where rb is the rank of b in the ordered sequence of barcodes. The first “knee” point in this function corresponds to a transition between a distinct subset of barcodes with large totals and the majority of barcodes with smaller totals. This is defined at the log-rank that minimizes the signed curvature
f′′(1+f2)1.5, and represents the point at which f(.) begins to drop rapidly, marking the start of the transition between large and small totals. In practice, we obtain f(.) by fitting a smooth spline to log(tb) against the log-rank in the interval containing the knee point. The derivatives of f(.) are then obtained by differentiation of the spline basis functions. This avoids multiplication of errors during numerical differentiation, which would lead to instability in the curvature values and inaccurate estimates of the knee point.
Our assumption is that any barcode with a large total count must represent a cell-containing droplet, regardless of whether its count profile resembles the ambient pool. This is based on the expectation that the distribution of the sizes of empty droplets should be unimodal, with a monotonic decreasing probability density as tb increases past the mode. A distinct peak of large totals would not be consistent with this expected distribution. We define the upper threshold U as the tb at the knee point and retain all barcodes with tbU, irrespective of their Pb. This guarantees recovery of any barcodes with large total counts that potentially represent cell-containing droplets. We use the knee point rather than the inflection point as the tb of the former is larger, providing a more conservative threshold that avoids retention of empty droplets.
We stress that, despite the use of a threshold on tb, our approach is different from existing methods due to the testing procedure. Barcodes with tb below the knee point can still be retained if the count profile is significantly different from the ambient pool. This is not possible with existing methods that would simply discard these barcodes. Users can also set U manually if automatic detection of the knee point fails for complex f(.). Alternatively, this mechanism can be disabled completely in favor of detecting cells solely based on their p-values. This is more statistically rigorous as it avoids the selection of an ad hoc threshold, but may result in the failure to detect large cells.
Free full text: Click here