Threshold-Based Cell Identification in scRNA-seq

Applying a threshold on the p-value will identify barcodes that have count profiles that are significantly different from the ambient pool of RNA. We assume that this will be the case for most cell-containing droplets, as the ambient pool is formed from many (lysed) cells and is unlikely to be representative of any single cell. However, it is possible for some cell-containing droplets to have ambient-like expression profiles. This can occur if the cell population is highly homogeneous or if one cell subpopulation contributes disproportionately to the ambient pool, e.g., if it is more prone to lysis. Sequencing errors in the cell barcodes may also bias the estimates of the ambient proportions, by misassigning counts from cell-containing droplets to barcodes with low UMI totals. This may result in spurious similarities between cells and the estimated ambient profile.
To avoid incorrectly calling ambient-like cells as empty droplets, we combine our procedure with a conventional threshold on the total UMI count. We rank all barcodes in order of decreasing t_b, and consider log(t_b) as a function f(.) of the log-transformed rank, i.e., log(t_b)=f(logr_b) where r_b is the rank of b in the ordered sequence of barcodes. The first “knee” point in this function corresponds to a transition between a distinct subset of barcodes with large totals and the majority of barcodes with smaller totals. This is defined at the log-rank that minimizes the signed curvature

\frac{f^{′′}}{{(1 + f^{' 2})}^{1.5}},

and represents the point at which f(.) begins to drop rapidly, marking the start of the transition between large and small totals. In practice, we obtain f(.) by fitting a smooth spline to log(t_b) against the log-rank in the interval containing the knee point. The derivatives of f(.) are then obtained by differentiation of the spline basis functions. This avoids multiplication of errors during numerical differentiation, which would lead to instability in the curvature values and inaccurate estimates of the knee point.
Our assumption is that any barcode with a large total count must represent a cell-containing droplet, regardless of whether its count profile resembles the ambient pool. This is based on the expectation that the distribution of the sizes of empty droplets should be unimodal, with a monotonic decreasing probability density as t_b increases past the mode. A distinct peak of large totals would not be consistent with this expected distribution. We define the upper threshold U as the t_b at the knee point and retain all barcodes with t_b≥U, irrespective of their P_b. This guarantees recovery of any barcodes with large total counts that potentially represent cell-containing droplets. We use the knee point rather than the inflection point as the t_b of the former is larger, providing a more conservative threshold that avoids retention of empty droplets.
We stress that, despite the use of a threshold on t_b, our approach is different from existing methods due to the testing procedure. Barcodes with t_b below the knee point can still be retained if the count profile is significantly different from the ambient pool. This is not possible with existing methods that would simply discard these barcodes. Users can also set U manually if automatic detection of the knee point fails for complex f(.). Alternatively, this mechanism can be disabled completely in favor of detecting cells solely based on their p-values. This is more statistically rigorous as it avoids the selection of an ad hoc threshold, but may result in the failure to detect large cells.

Free full text: Click here

Lun A.T., Riesenfeld S., Andrews T., Dao T.P., Gomes T, & Marioni J.C. (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biology, 20, 63.

Publication 2019

Based p Cells Derivatives Knee Retention Rigorous Subpopulation

Corresponding Organization :

Other organizations : Cancer Research UK, University of Cambridge, Broad Institute, Wellcome Sanger Institute, Memorial Sloan Kettering Cancer Center

Top 5 similar protocols

Protocol cited in 212 other protocols

Variable analysis

independent variables

Applying a threshold on the p-value

dependent variables

Identification of barcodes that have count profiles that are significantly different from the ambient pool of RNA

control variables

Threshold on the total UMI count

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!