Structural rRNAs (5S, 16S and 23S) are highly conserved in closely related prokaryotic species. The NCBI RefSeq Targeted Loci collection (22 (
link)) contains curated sets of the three types of rRNA gene sequences, which serve as reference sets for PGAP (
https://ncbi.nlm.nih.gov/RefSeq/targetedloci/). To identify genes for 16S and 23S rRNAs PGAP uses members of the reference sets as queries in BLASTn (23 (
link)). Hits that correspond to partial alignments are dropped if they fall below a certain coverage and identity thresholds with respect to the average length of the corresponding rRNA (50% coverage and 70% identity for 16S rRNA; 50% coverage and 60% identity for 23S rRNA). Borders of predicted rRNA genes are defined by a voting mechanism similar to the one mentioned below for identifying gene starts among several alternative start codons.
For prediction of 5S rRNAs and small ncRNAs, PGAP uses
cmsearch (ver. 1.1.1) along with covariance models, score thresholds and recommended command line options from the Rfam database (release 12.0 (7 (
link))). Current execution of this
cmsearch version has been optimized to permit direct use of the tool without a preliminary BLASTn search (5 (
link)–7 (
link)).
For prediction of tRNA sequences, PGAP relies on tRNAscan-SE. The input genomic sequence is split into overlapping fragments long enough to cover a tRNA gene with possible introns. These fragments are used as inputs to tRNAscan-SE (8 (
link)), currently one of the most widely used tRNA gene identification tools. Domain specific parameters of tRNAscan-SE are selected automatically for each genome (8 (
link)).
All predicted RNA genes from the above steps are collected and presented to GeneMarkS+ as a set of RNA gene ‘footprints’ (Figure
2). GeneMarkS+ has several labels (‘M’, ‘N’ and ‘R’) for RNA gene footprints; the labels specify different types of possible overlaps between protein-coding genes and RNA genes.