By default, Mash uses 32-bit hashes for k-mers where |Σ|
k ≤ 2
32 and 64-bit hashes for |Σ|
k ≤ 2
64. Thus, to minimize the resulting size of the all-RefSeq sketches,
k = 16 was chosen along with a sketch size
s = 400. While not ideal for large genomes (due to the small
k) or highly divergent genomes (due to the small sketch), these parameters are well suited for determining species-level relationships between the microbial genomes that currently constitute the majority of RefSeq. For similar genomes (e.g. ANI >95 %), sketches of a few hundred hashes are sufficient for basic clustering. As ANI drops further, the Jaccard index rapidly becomes very small and larger sketches are required for accurate estimates. Confidence bounds for the Jaccard estimate can be computed using the inverse cumulative distribution function for the hypergeometric or binomial distributions (Additional file
1: Figure S1). For example, with a sketch size of 400, two genomes with a true Jaccard index of 0.1 (
x = 40) are very likely to have a Jaccard estimate between 0.075 and 0.125 (
P >0.9, binomial density for 30 ≤
x ≤ 50). For
k = 16, this corresponds to a Mash distance between 0.12 and 0.09.
RefSeq Complete release 70 was downloaded from NCBI FTP (
ftp://ftp.ncbi.nlm.nih.gov). Using FASTA and Genbank records, replicons and contigs were grouped by organism using a combination of two-letter accession prefix, taxonomy ID, BioProject, BioSample, assembly ID, plasmid ID, and organism name fields to ensure distinct genomes were not combined. In rare cases this strategy resulted in over-separation due to database mislabeling. Plasmids and organelles were grouped with their corresponding nuclear genomes when available; otherwise they were kept as separate entries. Sequences assigned to each resulting “organism” group were combined into multi-FASTA files and chunked for easy parallelization. Each chunk was sketched with:
mash sketch -s 400 -k 16 -f -o chunk *.fasta
This required 26.1 CPU h on a heterogeneous cluster of AMD processors. (Note: option -f is not required in Mash v1.1.) The resulting, chunked sketch files were combined with the Mash
paste function to create a single “refseq.msh” file containing all sketches. Each chunked sketch file was then compared against the combined sketch file, again in parallel, using:
mash dist -t refseq.msh chunk.msh
This required 6.9 CPU h to create pairwise distance tables for all chunks. The resulting chunk tables were concatenated and formatted to create a PHYLIP formatted distance table.
For the ANI comparison, a subset of 500 Escherichia genomes was selected to present a range of distances yet bound the runtime of the comparatively expensive ANI computation. ANI was computed using the MUMmer v3.23 “dnadiff” program and extracting the 1-to-1 “AvgIdentity” field from the resulting report files [49 (
link)]. The corresponding Mash distances were taken from the all-vs-all distance table as described above.
For the primate phylogeny, the FASTA files were sketched separately, in parallel, taking an average time of 8.9 min each and a maximum time of 11 min (Intel Xeon E5-4620 2.2 GHz processor and solid-state drive). The sketches were combined with Mash
paste and the combined sketch given to
dist. These operations took insignificant amounts of time, and table output from
dist was given to PHYLIP v3.695 [50 ]
neighbor to produce the phylogeny. Accessions for all genomes used are given in Additional file
1: Table S1. The UCSC tree was downloaded from [51 ].
Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S, & Phillippy A.M. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17, 132.