The protein reference sequence database was entirely revised for PREPACT 2.0 and now relies on translated coding sequences of full organelle genomes taking known editing events into account. Original GenBank accessions (Table 1) were retrieved from NCBI, split into their various elements such as header, feature list, qualifiers and sequence origin and saved into the internal MySQL database after format checking. The fully-functional hierarchical tree of feature and qualifier objects within sequence objects is retained. A complete set of associated methods for position calculation, information retrieval and manipulation within PREPACT 2.0 makes it possible to check for potentially erroneous feature locations, translational mismatches and CDS naming issues during subsequent revision where necessary. Flexible regular expression-based search and replace classes scanning the sequence entries for necessary modifications are stored on a per-accession basis to curate the available organellar genomes. When present, annotated editing sites were parsed from the different formats currently present in primary accessions (Fig. 1) into a new PREPACT-internal “RNA_editing” feature (Fig. 2). This process simultaneously checked for consistency and more common mistakes (eg, annotation of the wrong DNA strand), which were resolved automatically, and remaining annotation errors (such as obvious mislabeling of editing positions or misannotation of splicing) were corrected manually. Where no editing was annotated at all (eg, most angiosperm chloroplast [cp] DNA entries) RNA editing annotation was introduced manually into the same modifications database.
An auto-annotation module was created to process organelle genome entries without annotated RNA editing sites, but for which complete sets of cDNA are available; for example, the complex mitochondrial (mt) DNAs of lycophytes Isoetes engelmannii and Selaginella moellendorffii for which cDNAs exist as primary database entries, or Vitis vinifera, where editing information has been stored in REDIdb. The auto-annotation script aligns cDNA sequences to the corresponding CDS feature(s) in the organelle genome entries and automatically creates new “RNA_editing” features for these.
Out of a finally-curated organelle genome all CDS features are extracted, translated into proteins taking all corresponding RNA editing into account and stored as a BLAST database, which can be used for analysis. For genomes not being represented by a single accession (eg, the lycophytes mtDNA mentioned above), various accessions can be combined to a single BLAST database.