The large and rapidly growing protein sequence databases provide a wealth of information for the identification of protein-coding transcript. We derive another three features from parsing the output of BLASTX (16 (link)) search (using the transcript as query, E-value cutoff 1e-10) against UniProt Reference Clusters (UniRef90) which was developed as a nonredundant protein database with a 90% sequence identity threshold (17 (link)). First, a true protein-coding transcript is likely to have more hits with known proteins than a non-coding transcript does. Thus we extract the NUMBER OF HITS as a feature. Second, for a true protein-coding transcript the hits are also likely to have higher quality; i.e. the HSPs (High-scoring Segment Pairs) overall tend to have lower E-value. Thus we define feature HIT SCORE as follows:
where Eij is the E-value of the j-th HSP in frame i, Si measures the average quality of the HSPs in frame i and HIT SCORE is the average of Si across three frames. The higher the HIT SCORE, the better the overall quality of the hits and the more likely the transcript is protein-coding. Thirdly, for a true protein-coding transcript most of the hits are likely to reside within one frame, whereas for a true non-coding transcript, even if it matches certain known protein sequence segments by chance, these chance hits are likely to scatter in any of the three frames. Thus, we define feature FRAME SCORE to measure the distribution of the HSPs among three reading frames:
The higher the FRAME SCORE, the more concentrated the hits are and the more likely the transcript is protein-coding.
We incorporate these six features into a support vector machine (SVM) machine learning classifier (18 ). Mapping the input features onto a high-dimensional feature space via a proper kernel function, SVM constructs a classification hyper-plane (maximum margin hyper-plane) to separate the transformed data (18 ). Known for its high accuracy and good performance, SVM is a widely used classification tool in bioinformatics analysis such as microarray-based cancer classification (19 (link),20 (link)), prediction of protein function (21 (link),22 (link)) and prediction of subcellular localization (23 (link),24 (link)). We employed the LIBSVM package (25 ) to train a SVM model using the standard radial basis function kernel (RBF kernel). The C and gamma parameters were determined by grid-search in the training dataset. We trained the SVM model using the same training data set as CONC used (13 (link)), containing 5610 protein-coding cDNAs and 2670 noncoding RNAs.