An up-to-date version of HOM39 (26 (link)) was extracted from the July 2004 release of HOMSTRAD (17 (link)) () based on two criteria used in (26 (link)). HOMSTRAD is a curated database of structural alignments of homologous proteins whose coordinates are available. Each entry of HOMSTRAD, a structural alignment, is extended by introducing homologous sequences with CLUSTAL W. Only the alignments based on structural superposition were used in this study. Out of 1033 entries of the HOMSTRAD, 55 entries (19.7% pairwise identity, 7.69 sequences and 159 aligned residues on average) were extracted for the evaluation of alignment accuracy. This dataset is referred to as ‘HOM+0’ in this paper.
We made the ‘HOM+20,’ ‘HOM+50’ and ‘HOM+100’ datasets by extending each entry of HOM+0 in a way similar to PREFAB (11 (link)). Amino acid sequences similar (E-value < 10−10) to each member of an entry were collected from the SwissProt database (rel. 43) using BLAST (27 (link)) and added to the entry. If more than n (=20, 50 or 100) sequences were collected, we randomly selected n sequences to be added. Only amino acid positions of the sequences that were reported to show significant similarity by BLAST were added. The accuracy of an alignment was measured by the fraction of columns aligned identically to the reference alignment. When we evaluated the accuracy, the n sequences added to the HOM+n were removed.
SABmark (18 (link)) version 1.65 was downloaded from . SABmark is designed to assess the performance of protein sequence alignment algorithms and consists of two parts, the Twilight Zone set (with ‘very low’ similarity; referred to as the TWI set in this paper) and the Superfamily set (with ‘low’ similarity; referred to as SUP). The TWI set was mainly used in the present study to examine the abilities of algorithms for aligning distantly related sequences. The TWI set was also extended in the same manner as described above. These are hereafter referred to as ‘TWI+n’ (n = 0, 20 and 50). The accuracy value fD, the ratio of the number of correctly aligned residues divided by the length of reference alignment, was calculated using the score.pl script provided by the authors of SABmark. The accuracies were separately considered for two subsets. One subset (denoted as TWIf+n) includes only the sequence pairs classified to the same family by Van Walle et al. (18 (link)), and the other subset (denoted as TWIs+n) consists of the sequence pairs classified not to the same family but to the same superfamily.
The PREFAB (11 (link)) version 3 dataset was downloaded from . The accuracy was measured using Q, the number of correctly aligned residue pairs divided by the number of residue pairs in the reference alignment (11 (link)).