Related reads from each sample were clustered using parameters that best clustered together reads arising from the same cell. To be considered part of the same cluster, reads were required to have the same V and J gene annotation, the same length CDR3, and a similar CDR3 AA sequence, with one AA mismatch per 12 AAs allowed (length < 12 AA = 1 mismatch, 12 AA <= length < 24 AA = 2 mismatches, etc). The same D gene annotation was not required for inclusion in the same cluster, as somatic hypermutation in this region makes accurate assignment of a gene segment difficult. We focused on the CDR3 region of the sequence, as it is highly variable, and has the dominant role in determining antigenic specificity of the sequence.18 (link) Clusters were iteratively defined using an approach to identify cluster centers that gave the largest possible clusters. Briefly, clustering started with a set of unique sequences of the same length U(L) and an empty set of clusters C(L). The first cluster center is defined as the sequence x∈U(L) that had the most neighbor’s |N(x)|; the set N(x) is then added to C(L), with cluster center x. N(x) is then removed from U(L), and the process repeated until U(L) is empty and C(L) is full. Clusters were then collapsed into single reads, with each cluster represented by its cluster center sequence, V D and J gene usage, isotype subclass, and average V gene mutation. Where a single cluster contained multiple reads with different isotype subclass or D gene usage, the most frequent usage within the cluster was used for cluster annotation. These collapsed clusters were used in all subsequent analysis, and defined as “sequences”.