Repeats in the genome assembly of
P. sorghi were defined with RepeatModeler v1.73 (Smit and Hubley 2008 ) and masked with RepeatMasker v4.0.9 (Smit
et al. 2013 ). The same library was used to identify repeats in the transcriptome assembly. Gene models were annotated in the genome assembly using MAKER (Cantarel
et al. 2008 (
link)), with additional putative effectors identified using hidden Markov models (HMM) with HMMER (Eddy 2011 (
link))and regular expression string searches of ORFs (Fletcher and Michelmore 2018 ). The MAKER pipeline was provided with the RepeatModeler profile as well as assembled transcripts and translated ORFs from the transcriptome of
P. sorghi, all described above, plus ESTs (option: altest) and protein sequences of other oomycete species available from NCBI. MAKER was initially run without a SNAP HMM, inferring genes using est2genome and protein2genome. These predictions were used to train a SNAP HMM (Korf 2004 (
link)) that was used for a subsequent run of MAKER with both est2genome and protein2genome set to 0. The predicted proteins were again used to train a new SNAP HMM (Campbell
et al. 2014 (
link)). This process was repeated twice to generate three SNAP HMMs, which were used sequentially in three independent runs of MAKER. The annotations produced were evaluated as previously described (Fletcher
et al. 2018 (
link)) to select a single optimal run. This involved contrasting the number of gene models predicted, mean protein length, BLASTp hits to other oomycete annotations, and Pfam domains annotated by InterProScan (Finn
et al. 2014 (
link); Jones
et al. 2014 (
link)). Annotation of genes encoding putative effectors was performed as previously described (Fletcher
et al. 2018 (
link)). Briefly, the entire genome was translated into ORFs. These ORFs were surveyed for secretion signals using SignalP3.1 and SignalP4.0, and crinkler (CRN) motifs of LWY domains using HMMs. For peptides with secretion signals, the 60 residues beyond the predicted cleavage site were surveyed for an RXLR or RXLR-like motif and subsequently for a downstream EER or EER-like motif. ORFs encoding peptides that were predicted to be secreted and contained an (L)WY domain or a CRN motif were considered high-confidence putative effectors (HCPEs). ORFs encoding peptides that were predicted to be secreted and encoded an RXLR and EER domain, but did contain an (L)WY domain, or encoding peptides not predicted to be secreted, but contained an (L)WY domain, or a CRN motif were considered low-confidence putative effectors (LCPEs). The putative effectors and MAKER annotations were reconciled so that annotations did not overlap on the same strand. This was performed so that (1) any HQE or LQE annotations that did not overlap a MAKER annotation were added to the master annotation; for
P. sorghi this was 12 HCPEs and 122 LCPEs. (2) HCPEs that overlapped MAKER annotations with the same start coordinates but earlier stop coordinates were discarded; for
P. sorghi this was six peptides. (3) HCPEs that overlapped MAKER annotations with the same start coordinates but later stop coordinates replaced the model proposed by MAKER if they had a higher BLASTp score to the NCBI nr database than the overlapping MAKER model; for
P. sorghi this was six peptides. (4) HCPEs that overlapped MAKER annotations but had different start coordinates and later or identical stop coordinates were retained over proposed MAKER models; for
P. sorghi this was 27 peptides. (5) HCPEs that overlapped MAKER annotations but had different start coordinates and earlier stop coordinates were investigated to determine if the MAKER model should have a modified start coordinate; for
P. sorghi this was six peptides. (6) Any LCPEs that overlapped MAKER annotations were discarded; for
P. sorghi this was 142 ORFs. The same effector prediction workflow was then applied to the reconciled annotation set to determine the reported effector counts. Tracks for repeats, transcript coverage, annotation, and effector annotations were generated in 100 kb windows along each chromosome using Bedtools v2.29.2 and plotted using Circos (Krzywinski
et al. 2009 (
link)).
Fletcher K., Martin F., Isakeit T., Cavanaugh K., Magill C, & Michelmore R. (2023). The genome of the oomycete Peronosclerospora sorghi, a cosmopolitan pathogen of maize and sorghum, is inflated with dispersed pseudogenes. G3: Genes|Genomes|Genetics, 13(3), jkac340.