The 25,912 contig sequences (Acc. number CACC01000000) generated from Roche/454 and Newbler assembler (Roche, Inc) as the initial dataset;
The 88,000 Sanger BAC end reads (available on the Cocoa Genome Hub
The 398 million Illumina paired end reads as short reads (SR) for error correction (available on the Cocoa Genome Hub
We created four large insert size mate paired libraries of Theobroma cacao B97–61/B2 genome with insert sizes of 3–5 kb, 5–8 kb, 8–11 kb and 11–15 kb using the Nextera Mate Pair Sample Preparation Kit (Illumina, San Diego, CA). These libraries were sequenced by Illumina HiSeq 2000 to respectively 40×, 35×, 19× and 10× genome coverage. The reads were trimmed using the following criteria: (i) sequences of the Illumina adapters and primers used during construction of the library were removed from the whole reads; (ii) nucleotides with a quality value <20 were removed from both ends; (iii) the longest sequence without adapters and low quality bases was kept and the sequence between the second unknown nucleotide (N) and the end of the read was trimmed; (iv) reads shorter than 30 nucleotides after trimming were discarded; (v) finally, reads and their mates that mapped onto run quality control sequences (PhiX genome) were removed. These trimming steps were performed using fastx_clean (
We produced 78 SMRT Cells Pacific Biosciences sequencing data with C2 chemistry that corresponded to 52× genome coverage of long read (LR) data.