Sequences were downloaded from the National Center for Biotechnology Information (NCBI) SRA (Daubentonia madagascariensis: SRP007603; Pan troglodytes: SRP012268 [SRX142913]). Raw sequences were preprocessed with Prinseq [32 (link)] to remove forward/reverse duplicates and SeqPrep [33 ] to remove adapters and merge overlapping reads. All preprocessed sequences were passed through kmer error correction using BFC [34 (link)] specifying the -s parameter for genome size. Multiplicity distribution of 23mers was carried out with Jellyfish2 [35 (link)] and KrATER [36 ] in order to estimate coverage. De novo genome assembly was performed with SOAPdenovo2 [37 (link)], using the sparse_pregraph module with the following parameters: -g 15 -d 4 -e 4 -R -r 0, and parameter -M 1 during contig phase.
Multiple sets of in silico mate-pairs were generated with Cross-mates. First, paired-end reads of the target organism are mapped onto the reference genome with BWA and default settings [38 ]. Then, a consensus is computed using samtools/bcftools [39 (link)] with the samtools legacy variant calling model. Read pairs are sampled from the consensus in systematic mode, i.e., using exact insert sizes and sampling fragments at regularly spaced offsets, skipping regions of coverage lower than three. For the chimpanzee assembly, 14 scaffolding libraries ranging from 500 bp to 200 kb were generated from the human reference at a 10x coverage. For the aye-aye assembly, 16 scaffolding libraries ranging from 500 bp to 20 kb were generated from the human and lemur references, respectively, at a 10x coverage.
Finally, gaps in the assembly were filled in using SOAPdenovo2 GapCloser [37 (link)]. Assembly statistics and mis-assemblies were measured with Quast [40 (link)]. Completeness and biological accuracy of assembly contiguity were measured by searching for 3,023 vertebrate orthologs as implemented in BUSCO [41 (link)] on a set of protein predictions generated by Augustus 3.1.0 [42 (link)]. Reference assembly sequences used for generating scaffolding libraries and benchmarking were obtained from NCBI: human (GRCh38.p8; GCF_0 00001405); gray mouse lemur Microcebus murinus (Mmur_2.0; GCF_000 165445); aye-aye (DauMad-1.0; GCA_000 241425). All steps used for creating in silico scaffolding libraries, including Cross-mates, have been implemented in the pipeline Cross-Species Scaffolding, which is publicly available and maintained at Github. An example of the Cross-mates command line scripts used for the pork tapeworm assembly experiments is included in Additional file 1 .
For the pork tapeworm test assembly, in silico mate pairs were generated using the reference genomes of four species of tapeworms (Taenia saginata, T. asiatica, T. multiceps, and T. solium) at a 10x coverage each, with multiple insert sizes ranging from 600 to 50,000 bp and assembled in SOAPdenovo. For the yeast test, we used a different assembler (SPAdes; [43 ]) for de novo assembly with 10x coverage of 500, 2,000, 5,000, and 10,000 bp insert sizes in silico mate pairs.
Multiple sets of in silico mate-pairs were generated with Cross-mates. First, paired-end reads of the target organism are mapped onto the reference genome with BWA and default settings [38 ]. Then, a consensus is computed using samtools/bcftools [39 (link)] with the samtools legacy variant calling model. Read pairs are sampled from the consensus in systematic mode, i.e., using exact insert sizes and sampling fragments at regularly spaced offsets, skipping regions of coverage lower than three. For the chimpanzee assembly, 14 scaffolding libraries ranging from 500 bp to 200 kb were generated from the human reference at a 10x coverage. For the aye-aye assembly, 16 scaffolding libraries ranging from 500 bp to 20 kb were generated from the human and lemur references, respectively, at a 10x coverage.
Finally, gaps in the assembly were filled in using SOAPdenovo2 GapCloser [37 (link)]. Assembly statistics and mis-assemblies were measured with Quast [40 (link)]. Completeness and biological accuracy of assembly contiguity were measured by searching for 3,023 vertebrate orthologs as implemented in BUSCO [41 (link)] on a set of protein predictions generated by Augustus 3.1.0 [42 (link)]. Reference assembly sequences used for generating scaffolding libraries and benchmarking were obtained from NCBI: human (GRCh38.p8; GCF_0
For the pork tapeworm test assembly, in silico mate pairs were generated using the reference genomes of four species of tapeworms (Taenia saginata, T. asiatica, T. multiceps, and T. solium) at a 10x coverage each, with multiple insert sizes ranging from 600 to 50,000 bp and assembled in SOAPdenovo. For the yeast test, we used a different assembler (SPAdes; [43 ]) for de novo assembly with 10x coverage of 500, 2,000, 5,000, and 10,000 bp insert sizes in silico mate pairs.