The six bacterial isolates used in the real-read tests each belong to a different species: Acinetobacter baumannii, Citrobacter koseri, Enterobacter kobei, an unnamed Haemophilus species (given the placeholder name Haemophilus sp002998595 in GTDB R202 [28 , 29 ]), Klebsiella oxytoca and Klebsiella variicola. Sequencing was previously described in Wick et al. (2021) [19 ]. Briefly, isolates were cultured overnight at 37°C in Luria-Bertani broth and DNA was extracted using GenFind v3 according to the manufacturer’s instructions (Beckman Coulter). The same DNA extract was used to sequence each isolate using three different approaches: ONT ligation, ONT rapid and Illumina (S11(A) Fig ). For ONT ligation, we followed the protocol for the SQK-LSK109 ligation sequencing kit and EXP-NBD104 native barcoding expansion (Oxford Nanopore Technologies). For ONT rapid, we followed the protocol for the SQK-RBK004 rapid barcoding kit (Oxford Nanopore Technologies). All ONT libraries were sequenced on MinION R9.4.1 flow cells. ONT read sets were basecalled and demultiplexed with Guppy v5.0.7, using the super-accuracy model. For Illumina, we followed a modified Illumina DNA Prep protocol (catalogue number 20018705), whereby the reaction volumes were quartered to conserve reagents. Illumina libraries were sequenced on the NovaSeq 6000 using SP reagent kits v1.0 (300 cycles, Illumina Inc.), producing 150 bp paired-end reads with a mean insert size of 331 bp. The resulting Illumina read pairs were shuffled and evenly split into two separate read sets, which were combined with the ONT read sets to produce two independent hybrid read sets (S11(B) Fig ). We repeated this process (from culture to sequencing) to generate another two hybrid read sets for a total of four hybrid read sets per isolate. All reads are available in the manuscript’s data repository (bridges.monash.edu/articles/dataset/Polypolish_paper_dataset/16727680 ).
For each hybrid read set, we performed a long-read-only assembly using Trycycler v0.5.0 [3 ] and Medaka v1.4.3 [8 ], following the instructions in Trycycler’s documentation (S11(C) Fig and S6 Table ). One of the ONT read sets for K. oxytoca MSB1_2C had very low depth (10×) and was therefore not able to yield a high-quality long-read-only assembly, leaving only three assemblies for this genome. We were able to produce four complete (circularised) long-read-only assemblies for the other five genomes, giving a total of 23 assemblies which served as the ‘unpolished’ assemblies in our real-read tests.
Polished genome sequences were generated by running the short-read polishers as described above (S11(D) Fig ). For the single-tool tests, each polisher was run consecutively three times on each assembly. For the greedy combination tests, each polisher (excluding the hybrid polishers and wtpoa which performed poorly in the single-tool tests) was run once on each genome, and the best-performing polisher was defined as the one with the smallest total pairwise distance in its output assemblies. The best-performing polisher’s output was then used as input for another round of polishing until there were no more improvements. The greedy combination tests were then performed again with Polypolish excluded.
To assess the quality of real-read assemblies, we used the edlib [26 ] library to perform a global alignment of the chromosome sequences for all pairwise combinations within each genome (S11(E) Fig ). The total distance was used as a metric of assembly quality, with lower values being better and a value of zero indicating that all assemblies for the genome were identical. We also ran ALE [18 ] on each real-read assembly using short-read alignments from BWA-MEM [16 ].
For each hybrid read set, we performed a long-read-only assembly using Trycycler v0.5.0 [3 ] and Medaka v1.4.3 [8 ], following the instructions in Trycycler’s documentation (
Polished genome sequences were generated by running the short-read polishers as described above (
To assess the quality of real-read assemblies, we used the edlib [26 ] library to perform a global alignment of the chromosome sequences for all pairwise combinations within each genome (
Full text: Click here