Protein sequences were collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/ . The detailed procedures are basically the same as described in [14] (link); the only difference is: in order to establish a more updated benchmark dataset, instead of version 50.7 of the Swiss-Prot database released on 9-Sept-2006, the version 55.3 released on 29-Apr-2008 was adopted. After strictly following the procedures as described in [14] (link), we finally obtained a benchmark dataset containing 7,766 different protein sequences that are distributed among 22 subcellular locations (Fig. 1 ); i.e., where represents the subset for the subcellular location of “acrosome”, for “cell membrane”, for “cell wall”, and so forth; while represents the symbol for “union” in the set theory. A breakdown of the 7,766 eukaryotic proteins in the benchmark dataset according to their 22 location sites is given in Table 1 . To avoid redundancy and homology bias, none of the proteins in has pairwise sequence identity to any other in a same subset. The corresponding accession numbers and protein sequences are given in Online Supporting Information S1 .
Because the system investigated now contains both the single-location and the multiple-location proteins, some of the proteins in may occur in two or more location sites. Therefore, it is instructive to introduce the concept of “virtual sample”, as illustrated as follows. A protein sample coexisting at two different location sites will be counted as 2 virtual samples even though they have an identical sequence; if coexisting at three different sites, 3 virtual samples; and so forth. Accordingly, the total number of the different virtual protein samples is generally greater than that of the total different sequence samples. Their relationship can be formulated as follows where is the number of total different virtual protein samples in , the number of total different protein sequences, the number of proteins with one location, the number of proteins with two locations, and so forth; while is the number of total subcellular location sites (for the current case, as shown inFig. 1 and Table 1 ).
For the current 7,766 different protein sequences, 6,687 occur in one subcellular location, 1,029 in two locations, 48 in three locations, 2 in four locations, and none in five or more locations. Substituting these data intoEq.2 , we have which is fully consistent with the figures in Table 1 and the data in Online Supporting Information S1 .
As stated in a recent comprehensive review [20] , to develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to formulate the sample of a protein with the core features that have intrinsic correlation with its localization in a cell. Since the concept of pseudo amino acid composition (PseAAC) was proposed [16] , it has provided a very flexible mathematical frame for investigators to incorporate their desired information into the representation of protein samples. According to its original definition, the PseAAC is actually formulated by a set of discrete numbers [16] as long as it is different from the classical amino acid composition (AAC) and that it is derived from a protein sequence that is able to harbor some sort of its sequence order and pattern information, or able to reflect some physicochemical and biochemical properties of the constituent amino acids. Since the concept of PseAAC was proposed, it has been widely used to deal with many protein-related problems and sequence-related systems (see, e.g., [21] (link), [22] (link), [23] (link), [24] (link), [25] (link), [26] (link), [27] (link), [28] (link), [29] (link), [30] (link), [31] (link), [32] (link), [33] (link), [34] (link), [35] (link), [36] , [37] (link), [38] (link), [39] , [40] (link), [41] , [42] and a long list of PseAAC-related references cited in a recent review [20] ). As summarized in [20] , until now 16 different PseAAC modes have been used to represent the samples of proteins for predicting their attributes. Each of these modes has its own advantage and disadvantage. In this study, we are to formulate the protein samples by hybridizing the following three different modes of PseAAC.
Because the system investigated now contains both the single-location and the multiple-location proteins, some of the proteins in may occur in two or more location sites. Therefore, it is instructive to introduce the concept of “virtual sample”, as illustrated as follows. A protein sample coexisting at two different location sites will be counted as 2 virtual samples even though they have an identical sequence; if coexisting at three different sites, 3 virtual samples; and so forth. Accordingly, the total number of the different virtual protein samples is generally greater than that of the total different sequence samples. Their relationship can be formulated as follows where is the number of total different virtual protein samples in , the number of total different protein sequences, the number of proteins with one location, the number of proteins with two locations, and so forth; while is the number of total subcellular location sites (for the current case, as shown in
For the current 7,766 different protein sequences, 6,687 occur in one subcellular location, 1,029 in two locations, 48 in three locations, 2 in four locations, and none in five or more locations. Substituting these data into
As stated in a recent comprehensive review [20] , to develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to formulate the sample of a protein with the core features that have intrinsic correlation with its localization in a cell. Since the concept of pseudo amino acid composition (PseAAC) was proposed [16] , it has provided a very flexible mathematical frame for investigators to incorporate their desired information into the representation of protein samples. According to its original definition, the PseAAC is actually formulated by a set of discrete numbers [16] as long as it is different from the classical amino acid composition (AAC) and that it is derived from a protein sequence that is able to harbor some sort of its sequence order and pattern information, or able to reflect some physicochemical and biochemical properties of the constituent amino acids. Since the concept of PseAAC was proposed, it has been widely used to deal with many protein-related problems and sequence-related systems (see, e.g., [21] (link), [22] (link), [23] (link), [24] (link), [25] (link), [26] (link), [27] (link), [28] (link), [29] (link), [30] (link), [31] (link), [32] (link), [33] (link), [34] (link), [35] (link), [36] , [37] (link), [38] (link), [39] , [40] (link), [41] , [42] and a long list of PseAAC-related references cited in a recent review [20] ). As summarized in [20] , until now 16 different PseAAC modes have been used to represent the samples of proteins for predicting their attributes. Each of these modes has its own advantage and disadvantage. In this study, we are to formulate the protein samples by hybridizing the following three different modes of PseAAC.