GenBank data was parsed using a combination of command-line and custom Perl scripts using BioPerl modules [22 (link)]. Tabular data was formatted using Python and plotted in R [23 ]. We use the terminology from Nilsson et al., (2005) and refer to taxa identified to the species rank as ‘fully identified’ and all other taxa as ‘insufficiently identified’ [24 (link)]. We also focused on NCBI nucleotide data deposited from 2003, the year COI barcoding was first introduced to the community, to present (2017) [25 (link)].
The names and taxonomic identifications for all Eukaryotes annotated to the species rank were retrieved from the NCBI taxonomy database using the Entrez query "Eukaryota[ORGN]+AND+species[RANK]" with an ebot script [Accessed November 3, 2017] [26 ]. Taxa were filtered according to the contents of the species field so that only fully identified taxa with a complete Latin binomial (genus and species) were retained. Entries that contained the abbreviations sp., nr., aff., or cf. were discarded. The remaining species names were formatted for use in the next query [species list]. For each year from 2003–2017 [year], records in the NCBI nucleotide database containing COI sequences were retrieved using the Entrez query "("CO1"[GENE] OR "COI"[GENE] OR "COX1"[GENE] OR "COXI"[GENE]) AND "Eukaryota"[ORGN] AND [year][PDAT]) AND [species list]” [2003–2016, accessed November 2017; 2017, accessed April 2018]. GenBank records were parsed, retaining information on year of record deposition and number of fully identified records. For fully identified records, sequence length as well as country and/or latitude-longitude fields were parsed.
We also assessed the number of high quality COI sequences that meet the standards developed between the INSDC and the Consortium for the Barcode of Life by looking for the BARCODE keyword in the GenBank record [11 (link)]. For each year from 2003–2017 [year], records in the NCBI nucleotide database containing COI BARCODE sequences were retrieved using the Entrez query "("CO1"[GENE] OR "COI"[GENE] OR "COX1"[GENE] OR "COXI"[GENE]) AND "Eukaryota"[ORGN] AND [year][PDAT] AND “BARCODE”[KYWD]) AND [species list]”. Fully identified and geotagged records were parsed as described above.
For our application example on freshwater biomonitoring, we retrieved a high-level list of relevant groups from Elbrecht and Leese (2017) to facilitate comparisons across studies [27 (link)]. Target freshwater taxa included: Annelida classes Clitellata and Polychaeta; Insecta (Arthropoda) orders Coleoptera, Diptera, Ephemeroptera, Megaloptera, Odonata, Plecoptera, and Trichoptera; Malacostraca (Arthropoda) orders Amphipoda and Isopoda; Mollusca classes Bivalvia and Gastropoda; and Platyhelminthes class Turbellaria. Within these groups there are likely to be non-freshwater taxa included, however, this method allowed us to quickly gauge the representation of freshwater taxa contained therein. These are also the same groupings often used to summarize results from COI freshwater biomonitoring assessments. A detailed look at specific freshwater taxa at finer taxonomic levels is beyond the scope of this paper and will be published elsewhere. For each freshwater target group we queried the NCBI taxonomy database for records identified to the species rank as described above. These taxon ids were concatenated and used to query the NCBI nucleotide database as described above. We assessed the representation of freshwater indicator taxa in the NCBI nucleotide database and level of annotation as described above.
For our application example on IUCN endangered animal species, we retrieved a list of endangered species names fromhttp://www.iucnredlist.org from all available years (1996, 2000, 2002–2004, 2006–2017) filtering the results for native Animalia species [Accessed Dec. 12, 2017]. We excluded insufficiently identified species containing the terms ‘affinis’, ‘sp.’, or ‘sp. nov.’, leaving us with a list of 4,289 endangered animal species as well as 2,089 synonyms. We submitted this combined list of species names to the ‘NCBI Taxonomy name/id Status Report Page’ (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi ) and retrieved a list of 2,613 taxon ids. For each taxon id, we queried the NCBI taxonomy and nucleotide databases as described above.
To assess the number of COI records unique to the BOLD database compared with the NCBI nucleotide database, we also retrieved records from the BOLD Application Programming Interface (API) as well as from the data releases. Since the BOLD database contains records from several DNA barcode markers such as ITS rDNA for fungi and COI mtDNA for animals, it was necessary to target just the COI records. COI sequences were retrieved from the BOLD API (http://www.boldsystems.org/index.php/API_Public/sequence ?) using the terms ‘marker = COI-3P|COI-5P&taxon = ‘ for each Eukaryote phylum except for Arthropoda which was queried separately for each class, and Insecta which was queried separately for each order to enable the download of complete files [Accessed Apr. 26, 2018]. Lists of Eukaryote phyla, Arthropoda classes, and Insecta orders were retrieved from the BOLD taxonomy browser (http://www.boldsystems.org/index.php/TaxBrowser_Home ). COI records were also retrieved from the BOLD data releases (http://www.boldsystems.org/index.php/datarelease ). All available releases of animal COI records up to and including Release 6.50v1 were individually downloaded and parsed. Note that the records retrieved from the data releases may not be as current as those retrieved through the BOLD API.
The names and taxonomic identifications for all Eukaryotes annotated to the species rank were retrieved from the NCBI taxonomy database using the Entrez query "Eukaryota[ORGN]+AND+species[RANK]" with an ebot script [Accessed November 3, 2017] [26 ]. Taxa were filtered according to the contents of the species field so that only fully identified taxa with a complete Latin binomial (genus and species) were retained. Entries that contained the abbreviations sp., nr., aff., or cf. were discarded. The remaining species names were formatted for use in the next query [species list]. For each year from 2003–2017 [year], records in the NCBI nucleotide database containing COI sequences were retrieved using the Entrez query "("CO1"[GENE] OR "COI"[GENE] OR "COX1"[GENE] OR "COXI"[GENE]) AND "Eukaryota"[ORGN] AND [year][PDAT]) AND [species list]” [2003–2016, accessed November 2017; 2017, accessed April 2018]. GenBank records were parsed, retaining information on year of record deposition and number of fully identified records. For fully identified records, sequence length as well as country and/or latitude-longitude fields were parsed.
We also assessed the number of high quality COI sequences that meet the standards developed between the INSDC and the Consortium for the Barcode of Life by looking for the BARCODE keyword in the GenBank record [11 (link)]. For each year from 2003–2017 [year], records in the NCBI nucleotide database containing COI BARCODE sequences were retrieved using the Entrez query "("CO1"[GENE] OR "COI"[GENE] OR "COX1"[GENE] OR "COXI"[GENE]) AND "Eukaryota"[ORGN] AND [year][PDAT] AND “BARCODE”[KYWD]) AND [species list]”. Fully identified and geotagged records were parsed as described above.
For our application example on freshwater biomonitoring, we retrieved a high-level list of relevant groups from Elbrecht and Leese (2017) to facilitate comparisons across studies [27 (link)]. Target freshwater taxa included: Annelida classes Clitellata and Polychaeta; Insecta (Arthropoda) orders Coleoptera, Diptera, Ephemeroptera, Megaloptera, Odonata, Plecoptera, and Trichoptera; Malacostraca (Arthropoda) orders Amphipoda and Isopoda; Mollusca classes Bivalvia and Gastropoda; and Platyhelminthes class Turbellaria. Within these groups there are likely to be non-freshwater taxa included, however, this method allowed us to quickly gauge the representation of freshwater taxa contained therein. These are also the same groupings often used to summarize results from COI freshwater biomonitoring assessments. A detailed look at specific freshwater taxa at finer taxonomic levels is beyond the scope of this paper and will be published elsewhere. For each freshwater target group we queried the NCBI taxonomy database for records identified to the species rank as described above. These taxon ids were concatenated and used to query the NCBI nucleotide database as described above. We assessed the representation of freshwater indicator taxa in the NCBI nucleotide database and level of annotation as described above.
For our application example on IUCN endangered animal species, we retrieved a list of endangered species names from
To assess the number of COI records unique to the BOLD database compared with the NCBI nucleotide database, we also retrieved records from the BOLD Application Programming Interface (API) as well as from the data releases. Since the BOLD database contains records from several DNA barcode markers such as ITS rDNA for fungi and COI mtDNA for animals, it was necessary to target just the COI records. COI sequences were retrieved from the BOLD API (
Full text: Click here