For each strain, a list of raw reads for
tal gene regions was generated by using
blasr (Chaisson & Tesler, 2012 (
link)) to align reads to the BLS256
tal gene sequences, following the PacBio
hgap Whitelisting protocol (PacBio, 2013a ). Next, a modification of the RS_PreAssembler protocol included in SMRTAnalysis 2.0 was run on these reads. In this modification, which we designated the RS_PreAssembler_TALs protocol, the ‘whiteList’ parameter for the filtering step was set to the
tal gene read list. The minimum read-length cut-off was set to 4000, the seed read-length cut-off was set to 16000 to ensure that short-read to long-read alignments used for correction would be long enough to be unambiguous and the maxLCPLength was set to 14, as recommended for data using the XL-C2 enzyme and chemistry (PacBio, 2013b ). Specifically, the
blasr options string was changed to ‘-minReadLength 4000 -maxScore −1000 -bestn 24 -maxLCPLength 14 -nCandidates 24’.
After preassembly, corrected reads were trimmed to estimated QV50 windows and filtered to those > 4000 bp using the SMRTAnalysis 2.0 trimFastqByQVWindow.py utility. Based on comparison with the reference genomes, these reads are typically 97% accurate. Reads were assembled using the Minimo assembler of
amos 3.1.0 (Treangen
et al., 2011 ), using NUCmer 3.1 (Kurtz
et al., 2004 (
link)) for the overlap step, for all 16 combinations of a 500, 1000, 2000 and 3000 minimum overlap length, and 91, 93, 95 and 97 minimum overlap per cent identity. Contig sets generated by each of these assemblies were polished separately with the RS_Resequencing protocol included in SMRTAnalysis 2.0. This protocol aligns reads to the assembled regions and uses the Quiver algorithm to call the consensus, regularly achieving 99.999% accuracy in regions with ≥60× coverage (Chin
et al., 2013 (
link)). For this, read filtering settings were set to those used for preassembly, the ‘Place Repeats Randomly’ option was unchecked and all other settings were left at defaults.
RVD sequences were determined from the 16 polished
tal gene assemblies using a consensus approach. For each contig across all polished assemblies, encoded TAL effector CRRs were extracted and split into RVD sequences by conserved boundaries. Inspecting a sorted list of unique RVD sequences and the number of times they were encountered in the 16 assemblies (e.g.
File S1, available in the online Supplementary Material), sequences ending in frameshifts or other anomalies that were prefixes of other sequences that occurred more often were discarded. The resulting list was retained as the correct RVD sequences. As an additional measure in case any
tal genes were incompletely assembled before polishing, assemblies of the polished contigs in each set were carried out, again with Minimo, and the RVD sequence consensus process repeated. In all cases the results were identical.
This workflow for assembly of
tal genes and extraction of encoded RVD sequences, which we have named the
pbx toolkit, is automated and available on GitHub (
https://github.com/boglab/pbx). The only required input is the path to a folder containing bas.h5 and bax.h5 files of raw sequence reads. Additional options allow specifying the sequences to use for identifying
tal gene reads and the conserved repeat boundaries to use for RVD sequence determination. This enables the workflow to be easily adapted for use with other
Xanthomonas genomes.