As a Canu job progresses, summary statistics are updated in a set of plaintext and HTML reports. The primary data interchange between stages is FASTA or FASTQ inputs, but for efficiency, each stage stores input reads in an indexed database, after which the original input is no longer needed. Each of the three stages begins by identifying overlaps between all pairs of input reads. Although the overlapping strategy varies for each stage, each counts k-mers in the reads, finds overlaps between the reads, and creates an indexed store of those overlaps. By default, the correction stage uses MHAP (Berlin et al. 2015 (link)), and the remaining stages use overlapInCore (Myers et al. 2000 (link)). From the input reads, the correction stage generates corrected reads; the trimming stage trims unsupported bases and detects hairpin adapters, chimeric sequences, and other anomalies; and the assembly stage constructs an assembly graph and contigs. The individual stages can be run independently or in series.
For distributed jobs, local compute resources are polled to build a list of available hosts and their specifications. Next, based on the estimated genome size, Canu will choose an appropriate range of parameters for each algorithm (e.g., number of compute threads to use for computing overlaps). Finally, Canu will automatically choose specific parameters from each allowed range so that usage of available resources is maximized. As an example, for a mammalian-sized genome, Canu will choose between one and eight compute threads and 4- to 16-GB memory for each overlapping job. On a grid with 10 hosts, each with 18 cores and 32 GB of memory, Canu will maximize usage of all 180 cores by selecting six threads and 10 GB of memory per job. This process is repeated for each step and allows automated deployment across varied cluster and host configurations, simplifying usage and maximizing resource utilization.