SMs were called using three different variant callers, each of which relied on a different underlying alignment tool. Sniffles v1.0.12b (Sedlazeck et al. 2018 (link)) was used to call SMs based on the pbmm2 read alignments described above. BAM files were preprocessed using SAMtools-calmd to generate the MD tag, which provides information on mismatching positions (i.e., variable coordinates in the reads). Sniffles was first run on each MA line individually, and the resulting VCF files were merged using SURVIVOR v1.0.7 (Jeffares et al. 2017 (link)). Following the pipeline recommended for population calling (https://github.com/fritzsedlazeck/Sniffles/wiki/), Sniffles was then run again with the merged VCF as input and the option “‐‐Ivcf.” This population calling enables consistent presence or absence calls for SMs across all MA lines within a strain. SURVIVOR was used again to generate a multisample VCF.
MUM&Co v3 (O'Donnell and Fischer 2020 (link)) was used to call SMs from individual alignments of MA line assemblies to their ancestral reference, setting a genome size of 110 Mb (“-g 110000000”). MUM&Co calls variants based on alignments produced by MUMmer v4 (Marçais et al. 2018 (link)), which is performed as part of a single script. Variants were obtained as TSV and VCF files.
The variation graph tool (vg) (Garrison et al. 2018 (link)) was used to call variants directly from the pangenome alignments using the deconstruct command (“‐‐path-traversals”). The resulting VCF file for each strain was reduced to variants >50 bp.
All called variants in callable regions were manually curated via visualization of read and assembly alignments using the Integrative Genomics Viewer (IGV) (Robinson et al. 2011 (link)). SMs were rejected if they were not supported unambiguously by the read alignments. Read support for very large SMs was visualized via Ribbon v1.1 (Nattestad et al. 2021 (link)), which enables the visualization of reads mapping to discordant genomic regions. Supplemental Figures S12–S26 provide examples of SM visualization and curation. Most variants were entirely spanned by the reads, leading to simple visual confirmation in IGV, but variants >30 kb in length (approximately the upper limit of read lengths), including large inversions and translocations, required additional curation. In addition to read support from Ribbon, these rearrangements were traced in the MA line assemblies by manually assessing the discordant mapping of MA line contigs in the PAF alignment files (see Supplemental Fig. S23). Complex SMs, including large rearrangements and duplications, were further visualized using Ribbon v1.1 (Nattestad et al. 2021 (link)).
Duplications and deletions were curated as tandem repeat expansions or contractions if they involved the duplication or deletion of one or more monomers of a tandem repeat. Most fell within existing tandem repeat annotations, that is, satellites and microsatellites, whereas a small number required manual inspection of indel flanks by self-vs-self dotplots generated using the MAFFT v7 online server (Katoh et al. 2019 (link)). Deletions that perfectly intersected with TEs annotated by RepeatMasker in the ancestor genome were called as mobile excisions. Mobile insertions for described TE families were identified as cases in which the inserted sequence had a near-perfect BLASTN match (Camacho et al. 2009 (link)) to the Chlamydomonas repeat library (Craig 2021 (link)). These hits all had expected length distributions; LINE and PLE insertions frequently only contained the 3′ end owing to 5′ truncation, whereas insertions of other TEs corresponded to the entire length of the TE. In cases in which an inserted sequence had no match to an existing TE model, we queried the insert sequence against the ancestor genome, extracted and aligned hits, and manually curated new consensus sequences following established protocols for mobile element annotation (Goubert et al. 2022 (link)). All insertions unambiguously matched either the existing or newly produced consensus sequences and could be neatly defined to specific mobile element families. The one exception to this pattern was the duplications mediated by Dualen LINEs, where the sequence called as an insertion partly matched Dualen-4b_cRei and partly matched the sequence immediately flanking the insertion. These Dualen-mediated duplications were manually split to two called SMs: one mobile insertion and one duplication of the appropriate lengths.
When curating inversions and translocations, we noticed that many events featured additional insertions at the rearrangement breakpoints that were not specifically detected by the variant callers. As above, these insertions were compared to the annotated TEs and defined as mobile insertions of specific TE families. Five rearrangements could not be fully characterized because one of the breakpoints was clearly supported, but the other was in an uncallable region. These were arbitrarily classified as translocations.