Genomic DNA for E. coli EC958 was prepared using the Qiagen DNeasy Blood and Tissue kit, as per manufacturer's instructions. The genome of E. coli EC958 was sequenced by generating a total of 601,224 pre-filtered reads with an average length of 1,600 bp, from six SMRT cells on a PacBio RS I sequencing instrument, using an 8–12 kilobase (kb) insert library, generating approximately 200-fold coverage (GATC Biotech AG, Germany).
De novo genome assemblies were produced using PacBio's SMRT Portal (v2.0.0) and the hierarchical genome assembly process (HGAP) [23] (link), with default settings and a seed read cut-off length of 5,000 bp to ensure accurate assembly across E. coli rRNA operons. Assemblies were performed multiple times using different combinations of between one and six SMRT cells of read data. The best assembly results were obtained with six SMRT cells which yielded approximately 547 Mb of sequence from 190,145 post-filtered reads (Table 1 ). The average read length was found to be 2,875 bp with an average single pass accuracy of 86.5%. During the preassembly stage 190,145 long reads were converted into 23,772 high quality, preassembled reads with an average length of 4,573 bp. Assembly of these reads returned seven contigs, three were greater than 500 kb. Furthermore, the largest contig (∼3.8 Mb) was estimated to contain 74.5% of the chromosome of EC958. For all other assemblies total contig numbers exceeded 10 (Table 1 ). However, for assemblies using two or three SMRT cells, assembly metrics could be improved >2-fold by reducing the seed read length (Table 1 ).
To determine their correct order and orientation, contigs from our six SMRT cell assembly were aligned to the complete genome of E. coli SE15 using Mauve v. 2.3.1 [24] (link). Contig ordering was confirmed by PCR. Overlapping but un-joined contigs, a characterised artefact of the HGAP assembly process [23] (link), were manually trimmed based on sequence similarity and joined. All joins were manually inspected using ACT [25] (link) and Contiguity (http://mjsull.github.io/Contiguity/ ).
A single contig representing the EC958 large plasmid pEC958 was identified and isolated by BLASTn comparison against the previous draft assembly of EC958 (NZ_CAFL00000000.1) [7] (link). Overlapping sequences on the 5′ and 3′ ends of the plasmid contig were then manually trimmed based on sequence similarity. Although the EC958 small plasmid (pEC958B) was too small to be assembled as part of the main assembly, 25 unassembled PacBio reads, with an average length of 2,031 bp, were found to align to the small 4,080 bp plasmid contig that had previously been assembled from 454 GS-FLX reads (emb|CAFL01000138).
To determine if reads containing unremoved adapter sequence have had an impact on the assembly of EC958 we first screened the filtered subreads for adapter sequence using BBMap version 31.40 (http://sourceforge.net/projects/bbmap/ ). A high level of adapter contamination would likely pose some risk of misassembly. Additionally, to eliminate the possibility that aberrant reads have resulted in the inclusion of assembly artefacts in the EC958 genome assembly, contig-ends were screened for hairpin artefacts using MUMmer version 3.23 [26] (link).
De novo genome assemblies were produced using PacBio's SMRT Portal (v2.0.0) and the hierarchical genome assembly process (HGAP) [23] (link), with default settings and a seed read cut-off length of 5,000 bp to ensure accurate assembly across E. coli rRNA operons. Assemblies were performed multiple times using different combinations of between one and six SMRT cells of read data. The best assembly results were obtained with six SMRT cells which yielded approximately 547 Mb of sequence from 190,145 post-filtered reads (
To determine their correct order and orientation, contigs from our six SMRT cell assembly were aligned to the complete genome of E. coli SE15 using Mauve v. 2.3.1 [24] (link). Contig ordering was confirmed by PCR. Overlapping but un-joined contigs, a characterised artefact of the HGAP assembly process [23] (link), were manually trimmed based on sequence similarity and joined. All joins were manually inspected using ACT [25] (link) and Contiguity (
A single contig representing the EC958 large plasmid pEC958 was identified and isolated by BLASTn comparison against the previous draft assembly of EC958 (NZ_CAFL00000000.1) [7] (link). Overlapping sequences on the 5′ and 3′ ends of the plasmid contig were then manually trimmed based on sequence similarity. Although the EC958 small plasmid (pEC958B) was too small to be assembled as part of the main assembly, 25 unassembled PacBio reads, with an average length of 2,031 bp, were found to align to the small 4,080 bp plasmid contig that had previously been assembled from 454 GS-FLX reads (emb|CAFL01000138).
To determine if reads containing unremoved adapter sequence have had an impact on the assembly of EC958 we first screened the filtered subreads for adapter sequence using BBMap version 31.40 (
Full text: Click here