SNP and indel sequence variation was assessed across a panel of cultivars relative to the cv. Tanjil reference genome. NGS reads were aligned to the cv Tanjil reference genome via bowtie v 2.0.5 (–very‐sensitive) (Langmead and Salzberg, 2012), and variants were called via the Genome Analysis Tookit 3.4‐46 (McKenna et al., 2010). GATK was used to perform read deduplication via Markduplicates, then variant calling with HaplotypeCaller (–stand_call_conf 20 –stand_emit_conf 20 –min_pruning 5), producing variant data in VCF format (Danecek et al., 2011). Genome comparisons were visualized using Circos v0.67‐1 (Krzywinski et al., 2009).
Orthologous gene clusters were predicted via OrthoMCL (Li et al., 2003) comparing translated annotations of NLL to protein datasets from C. cajan (Varshney et al., 2012), C. arietinum (Varshney et al., 2013), G. max (Schmutz et al., 2010), M. truncatula (Young et al., 2011), P. vulgaris (Schmutz et al., 2014) and A. thaliana (Initiative, 2000).
Analysis of rates of silent‐site substitutions was carried out by searching all peptides against all others for the species Lupinus angustifolius, Glycine max (v 2.0), Lotus japonicus (v 3.0), Medicago truncatula (v 4.0) and Phaseolus vulgaris (v 1.0). Top respective matches were retained between each species per chromosome pairing (allowing for multiple total hits between two species for a given query gene), and within each species (for analysis of whole‐genome duplications). Then in‐frame alignments of coding sequences were made for each retained peptide alignment. From alignments of coding sequences, values for Ks, Ka and Ka/Ks were calculated using the ‘codeml’ method from the PAML package (Yang, 2007). Also from protein alignments, synteny blocks were inferred using DAGchainer (Haas et al., 2004). From the per‐gene‐pair alignments and the synteny blocks, median Ks values for blocks were calculated and used for Ks histogram peaks (Figure 2).
Ages of species divergences and whole‐genome duplications (Figure S7) were calculated from modal Ks peaks (Data S7), by treating initially unknown branch lengths in the known species/duplication tree as variables in a set of equations. The species/duplication tree was rooted at the papilionoid whole‐genome duplication, which predated the main papilionoid radiation (Cannon et al., 2015). A time of 58 Mya for the initial papilionoid radiation was assumed (Lavin et al., 2005). There were 11 unknown branch lengths in the tree in Figure S7, and sufficient data from the modal distances between and within species comparisons to solve for these unknowns algebraically.
To evaluate evidence for a whole‐genome triplication (WGT), synteny blocks were identified using DAGchainer, and synteny coverage depth was calculated using the BEDTools v2.25.0 (Quinlan and Hall, 2010) ‘coverage’ function to make comparisons between other genomes and NLL as the reference, or between NLL and each other genome as the reference. Coverage of synteny blocks was calculated at each nucleotide position using the ‐d option and summarized per coverage depth level.
For visual dot plot assessments of NLL compared with itself and with other legume genomes, we used promer and mummerplot from the MUMmer package (Kurtz et al., 2004), (v3.23) to make comparisons of translated nucleotide sequence, on genomic sequence that was masked for all except exonic sequence. The promer results were filtered to require at least 80% identity.
Free full text: Click here