To evaluate assembly quality, several validation tools were applied. Both REAPR [36 (link)] and FRCbam [35 (link)] use paired Illumina reads to evaluate an assembly, giving a measure of the number of potential errors. Instead of using the raw reads, we used error corrected reads dumped from the ALLPATHS-LG assembly, reducing the running time of both the alignment step and the tools themselves.
Isoblat was used to determine how much of the Newbler transcriptome of 454 and Sanger reads was aligned to the different assemblies [37 ]. It was run with default options.
CEGMA is a tool that annotates 458 highly conserved genes in an assembly, and it can be used to assess the completeness of the genome assembly [38 (link), 65 (link)]. Version 2.4 was applied to all different versions of the assemblies.
BUSCO is similar to CEGMA in that it assesses the completeness of a genome by trying to find a set of universal single-copy orthologs [39 (link)]. In this study, we used the actinopterygii specific set of 3698 genes to investigate the completeness of the assemblies generated here.
A linkage map for Atlantic cod has been created from a set of 9355 SNPs (personal communication, Sigbjørn Lien). We used blat_parse.py to compare the linkage map to different assemblies to evaluate the completeness and long-range correctness. Briefly, this involved mapping the flanking sequences of the SNPs to the assembly using BLAT version 3.5 [88 (link)] and options "-noHead -maxIntron=100 genome.fasta flanking_sequences.fasta" and then parsing the output file while comparing with the order of the SNPs in the linkage map. A conflict with the linkage map is defined as a sequence that had SNPs mapped to it belonging to more than one linkage group. Some SNPs mapped equally well to more than one linkage group, and these were excluded since we could not confidently judge which mapping was correct.
Isoblat was used to determine how much of the Newbler transcriptome of 454 and Sanger reads was aligned to the different assemblies [37 ]. It was run with default options.
CEGMA is a tool that annotates 458 highly conserved genes in an assembly, and it can be used to assess the completeness of the genome assembly [38 (link), 65 (link)]. Version 2.4 was applied to all different versions of the assemblies.
BUSCO is similar to CEGMA in that it assesses the completeness of a genome by trying to find a set of universal single-copy orthologs [39 (link)]. In this study, we used the actinopterygii specific set of 3698 genes to investigate the completeness of the assemblies generated here.
A linkage map for Atlantic cod has been created from a set of 9355 SNPs (personal communication, Sigbjørn Lien). We used blat_parse.py to compare the linkage map to different assemblies to evaluate the completeness and long-range correctness. Briefly, this involved mapping the flanking sequences of the SNPs to the assembly using BLAT version 3.5 [88 (link)] and options "-noHead -maxIntron=100 genome.fasta flanking_sequences.fasta" and then parsing the output file while comparing with the order of the SNPs in the linkage map. A conflict with the linkage map is defined as a sequence that had SNPs mapped to it belonging to more than one linkage group. Some SNPs mapped equally well to more than one linkage group, and these were excluded since we could not confidently judge which mapping was correct.
Full text: Click here