The construction of KOGs followed the previously outlined strategy based on sets of consistent BeTs [9 (link),15 (link)], but included additional steps that reflected specific features of eukaryotic proteins. Briefly, the procedure was as follows. 1. Detection and masking of widespread, typically repetitive domains, which was performed by using the RPS-BLAST program and the PSSMs for the respective domains from the CDD collection [40 (link)]. These domains, namely, PPR (pfam01535), WD40 (pfam00400), IG (pfam00047), IGc1, Igv, IG_like, RRM (pfam00076), ANK (pfam00023), myosin tail (pfam01576), Fn3 (pfam00041), CA, (IG), ANK, kelch (pfam01344), OAD_kelch, SH3 (pfam00018), intermediate filaments (pfam00038), C2H2 finger (pfam00096), PDZ (pfam00595), POZ (pfam00651), PH (pfam00169), ZnF-C4 (pfam00105), spectrin (pfam00435), Sushi (pfam00084), TPR (pfam00017), BTB, LRR_CC, LY, ARM, SH2, and CH, were detected and masked prior to applying the COG construction procedure. Masking these domains was required to ensure the robust classification of the eukaryotic orthologous clusters with the KOG detection procedure because hits between these common, "promiscuous" domains resulted in spurious lumping of numerous non-orthologous proteins. 2. All-against-all comparison of protein sequences from the analyzed genomes by using the gapped BLAST program [58 (link)], with filtering for low sequence complexity regions performed using the SEG program [59 (link)]. 3. Detection of triangles of mutually consistent, genome-specific best hits (BeTs). 4. Merging triangles with a common side to form crude, preliminary KOGs. 5. Case by case analysis of each candidate KOG. This analysis serves to eliminate the false-positives that are incorporated in the KOGs during the automatic steps and included, primarily, examination of the domain composition of KOG members, which was determined using the RPS-BLAST program and the CDD collection of position-specific scoring matrices (PSSMs) for individual domains [40 (link)]. Generally, proteins were kept in the same KOG when they shared a conserved core domain architecture. However, in cases when KOGs were artificially bridged by multidomain proteins, the latter were split into individual domains (or arrays of domains) and steps (1)-(4) were repeated with these sequences; this results in the assignment of individual domains to KOGs in accordance with their distinct evolutionary affinities. 6. Assignment of proteins containing promiscuous domains. In cases when a sequence assigned to a KOG contained one or more masked promiscuous domains, these domains were restored and became part of the respective KOG. Proteins containing promiscuous domains but not assigned to any KOG were classified in Fuzzy Orthologous Groups (FOGs) named after the respective domains. 7. Examination of large KOGs, which included multiple members from all or several of the compared genomes by using phylogenetic trees, cluster analysis with the BLASTCLUST program , comparison of domain architectures, and visual inspection of alignments; as a result, some of these protein sets were split into two or more smaller ones that were included in the final set of KOGs.
The KOGs were annotated on the basis of the annotations available through GenBank and other public databases, which were critically assessed against the primary literature. For proteins that are currently annotated as "hypothetical" or "unknown", iterative sequence similarity searches with the PSI-BLAST program [58 (link)], the results of the RPS-BLAST searches, additional domain architecture analysis performed by using the SMART system [60 (link)], and comparison to the COG database by using the COGNITOR program (RLT, unpublished results) were employed to identify distant homologs with experimentally characterized functions and/or structures. The known and predicted functions of KOGs were classified into 23 categories (see legend to Fig.4 ); these were modified from the functional classification previously employed for prokaryotic COGs [15 (link)] by including several specific eukaryotic categories.
The KOGs were annotated on the basis of the annotations available through GenBank and other public databases, which were critically assessed against the primary literature. For proteins that are currently annotated as "hypothetical" or "unknown", iterative sequence similarity searches with the PSI-BLAST program [58 (link)], the results of the RPS-BLAST searches, additional domain architecture analysis performed by using the SMART system [60 (link)], and comparison to the COG database by using the COGNITOR program (RLT, unpublished results) were employed to identify distant homologs with experimentally characterized functions and/or structures. The known and predicted functions of KOGs were classified into 23 categories (see legend to Fig.