To assess a transcript's coding potential, we extract six features from the transcript's nucleotide sequence. A true protein-coding transcript is more likely to have a long and high-quality Open Reading Frame (ORF) compared with a non-coding transcript. Thus, our first three features assess the extent and quality of the ORF in a transcript. We use the framefinder software (14 ) to identify the longest reading frame in the three forward frames. Known for its error tolerance, framefinder can identify most correct ORFs even when the input transcripts contain sequencing errors such as point mutations, indels and truncations (14 ,15 (link)). We extract the LOG-ODDS SCORE and the COVERAGE OF THE PREDICTED ORF as the first two features by parsing the framefinder raw output with Perl scripts (available for download from the web site). The LOG-ODDS SCORE is an indicator of the quality of a predicted ORF and the higher the score, the higher the quality. A large COVERAGE OF THE PREDICTED ORF is also an indicator of good ORF quality (14 ). We add a third binary feature, the INTEGRITY OF THE PREDICTED ORF, that indicates whether an ORF begins with a start codon and ends with an in-frame stop codon.
The large and rapidly growing protein sequence databases provide a wealth of information for the identification of protein-coding transcript. We derive another three features from parsing the output of BLASTX (16 (link)) search (using the transcript as query, E-value cutoff 1e-10) against UniProt Reference Clusters (UniRef90) which was developed as a nonredundant protein database with a 90% sequence identity threshold (17 (link)). First, a true protein-coding transcript is likely to have more hits with known proteins than a non-coding transcript does. Thus we extract the NUMBER OF HITS as a feature. Second, for a true protein-coding transcript the hits are also likely to have higher quality; i.e. the HSPs (High-scoring Segment Pairs) overall tend to have lower E-value. Thus we define feature HIT SCORE as follows:

where Eij is the E-value of the j-th HSP in frame i, Si measures the average quality of the HSPs in frame i and HIT SCORE is the average of Si across three frames. The higher the HIT SCORE, the better the overall quality of the hits and the more likely the transcript is protein-coding. Thirdly, for a true protein-coding transcript most of the hits are likely to reside within one frame, whereas for a true non-coding transcript, even if it matches certain known protein sequence segments by chance, these chance hits are likely to scatter in any of the three frames. Thus, we define feature FRAME SCORE to measure the distribution of the HSPs among three reading frames:

The higher the FRAME SCORE, the more concentrated the hits are and the more likely the transcript is protein-coding.
We incorporate these six features into a support vector machine (SVM) machine learning classifier (18 ). Mapping the input features onto a high-dimensional feature space via a proper kernel function, SVM constructs a classification hyper-plane (maximum margin hyper-plane) to separate the transformed data (18 ). Known for its high accuracy and good performance, SVM is a widely used classification tool in bioinformatics analysis such as microarray-based cancer classification (19 (link),20 (link)), prediction of protein function (21 (link),22 (link)) and prediction of subcellular localization (23 (link),24 (link)). We employed the LIBSVM package (25 ) to train a SVM model using the standard radial basis function kernel (RBF kernel). The C and gamma parameters were determined by grid-search in the training dataset. We trained the SVM model using the same training data set as CONC used (13 (link)), containing 5610 protein-coding cDNAs and 2670 noncoding RNAs.