After standard quality control procedures, the first step of existing single-cell RNA-seq processing pipelines [1 (
link)–3 (
link)] is to extract cell barcode and UMI sequences and to add this information to the header of the sequenced read or save it in temporary files. This approach, while versatile, can create many intermediate files on disk for further processing, which can be time- and space-consuming.
Alevin begins with sample-demultiplexed FASTQ files. It quickly iterates over the file containing the barcode reads and tallies the frequency of all observed barcodes (regardless of putative errors). We denote the collection of all observed barcodes as
. Whitelisting involves determining which of these barcodes may have derived from a valid cell. When the data has been previously processed by another pipeline, a whitelist may already be available for alevin to use. When a whitelist is not available, alevin uses a two-step procedure for calculating one. An initial draft whitelist is produced using the procedure explained below, to select CBs for initial quantification. This list is refined after per-cell-level quantification estimates are available (see “
Final whitelisting (optional)” section) to produce a final whitelist.
To generate a putative whitelist, we follow the approach taken by other dscRNA-seq pipelines by analyzing the cumulative distribution of barcode frequencies and finding the knee in this curve [1 (
link), 2 (
link)]. Those barcodes occurring after the knee constitute the whitelist, denoted
. We use a Gaussian kernel to estimate the probability density function for the barcode frequency and select the local minimum corresponding to the “knee.” In the case of a user-provided whitelist, the provided
is used as the fixed final whitelist.
Next, we consider those barcodes in
to determine, for each non-whitelisted barcode, whether (a) its corresponding reads should be assigned to some barcode in
or (b) this barcode represents some other type of noise or error (e.g., ambient RNA, lysed cell) and its associated reads should be discarded. The approach of alevin is to determine, for each barcode
, the set of whitelisted barcodes with which
hj could be associated. We call these the putative labels of
hj—denoted as
ℓ(
hj). Following the criteria used by previous pipelines [1 (
link)], we consider a whitelisted barcode
wi to be a putative label for some erroneous barcode
hj if
hj can be obtained from
wi by a substitution, by a single insertion (and clipping of the terminal base) or by a single deletion (and the addition of a valid nucleotide to the end of
hj). Rather than applying traditional algorithms for computing the all-versus-all edit-distances directly, and then filtering for such occurrences, we exploit the fact that barcodes are relatively short. Therefore, we can explicitly iterate over all of the valid
and enumerate all erroneous barcodes for which this might be a putative label. Let
Q(
wi,
H) be the set of barcodes from
that adhere to the conditions defined above; then, for each
hj∈
Q(
wi,
H), we append
wi as putative label for the erroneous barcode
hj.
Once all whitelisted barcodes have been processed, each element in
will have zero or more putative labels. If an erroneous barcode has more than one putative label, we prioritize substitutions over insertions and deletions. If this does not yield a single label, ties are broken randomly. If no candidate is discovered for an erroneous barcode, then this barcode is considered “noise,” and its associated reads are simply discarded. Note that, although adopted from existing methods, the alevin initial whitelisting process is designed to output a larger number of CBs.
Srivastava A., Malik L., Smith T., Sudbery I, & Patro R. (2019). Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology, 20, 65.