The bed bug is 1 of 30 i5K pilot genome assemblies that were subjected to automatic gene annotation using a Maker 2.0 (http://www.yandell-lab.org/software/maker.html) annotation pipeline tuned specifically for arthropods. The pipeline is designed to be systematic, providing a single consistent procedure for the species in the pilot study, scalable to handle 100s of genome assemblies, evidence guided using both protein and RNA-seq evidence to guide gene models and targeted to utilize extant information on arthropod gene sets. The core of the pipeline was a Maker 2 instance, modified slightly to enable efficient running on our computational resources. The genome assembly was first subjected to de novo repeat prediction and CEGMA analysis (http://korflab.ucdavis.edu/datasets/cegma/) to generate gene models for initial training of the ab initio gene predictors (Supplementary Data 33). Three rounds of training of the Augustus (http://bioinf.uni-greifswald.de/augustus/) and SNAP (http://korflab.ucdavis.edu/software.html) gene predictors within Maker were used to bootstrap to a high-quality training set. Input protein data included 1 million peptides from a non-redundant (nr) reduction (90% identity) of Uniprot Ecdysozoa (1.25 million peptides) supplemented with proteomes from 18 additional species (Strigamia maritima, Tetranychus urticae, Caenorhabditis elegans, Loa loa, Trichoplax adhaerens, Amphimedon queenslandica, Strongylocentrotus purpuratus, Nematostella vectensis, Branchiostoma floridae, Ciona intestinalis, Ciona savignyi, Homo sapiens, Mus musculus, Capitella teleta, Helobdella robusta, Crassostrea gigas, Lottia gigantea and Schistosoma mansoni) leading to a final nr peptide evidence set of 1.03 million peptides. RNA-seq from C. lectularius adult males and females was used judiciously to identify exon–intron boundaries but with a heuristic script to identify and split erroneously joined gene models. We used CEGMA models for QC purposes: for C. lectularius, of 1,977 CEGMA single-copy orthologue gene models, 1,928 were found in the assembly, and 1,892 in the final predicted gene set. Finally, the pipeline uses a nine-way homology prediction with human, Drosophila and C. elegans, and InterPro Scan5 to allocate gene names. The automated gene set is available from the BCM-HGSC website (https://www.hgsc.bcm.edu/arthropods/bed-bug-genome-project) and at the National Agricultural Library (https://i5k.nal.usda.gov).
Free full text: Click here