Transcript Coding Potential Assessment

To assess a transcript's coding potential, we extract six features from the transcript's nucleotide sequence. A true protein-coding transcript is more likely to have a long and high-quality Open Reading Frame (ORF) compared with a non-coding transcript. Thus, our first three features assess the extent and quality of the ORF in a transcript. We use the framefinder software (14 ) to identify the longest reading frame in the three forward frames. Known for its error tolerance, framefinder can identify most correct ORFs even when the input transcripts contain sequencing errors such as point mutations, indels and truncations (14 ,15 (link)). We extract the LOG-ODDS SCORE and the COVERAGE OF THE PREDICTED ORF as the first two features by parsing the framefinder raw output with Perl scripts (available for download from the web site). The LOG-ODDS SCORE is an indicator of the quality of a predicted ORF and the higher the score, the higher the quality. A large COVERAGE OF THE PREDICTED ORF is also an indicator of good ORF quality (14 ). We add a third binary feature, the INTEGRITY OF THE PREDICTED ORF, that indicates whether an ORF begins with a start codon and ends with an in-frame stop codon.
The large and rapidly growing protein sequence databases provide a wealth of information for the identification of protein-coding transcript. We derive another three features from parsing the output of BLASTX (16 (link)) search (using the transcript as query, E-value cutoff 1e-10) against UniProt Reference Clusters (UniRef90) which was developed as a nonredundant protein database with a 90% sequence identity threshold (17 (link)). First, a true protein-coding transcript is likely to have more hits with known proteins than a non-coding transcript does. Thus we extract the NUMBER OF HITS as a feature. Second, for a true protein-coding transcript the hits are also likely to have higher quality; i.e. the HSPs (High-scoring Segment Pairs) overall tend to have lower E-value. Thus we define feature HIT SCORE as follows:

where E_ij is the E-value of the j-th HSP in frame i, S_i measures the average quality of the HSPs in frame i and HIT SCORE is the average of S_i across three frames. The higher the HIT SCORE, the better the overall quality of the hits and the more likely the transcript is protein-coding. Thirdly, for a true protein-coding transcript most of the hits are likely to reside within one frame, whereas for a true non-coding transcript, even if it matches certain known protein sequence segments by chance, these chance hits are likely to scatter in any of the three frames. Thus, we define feature FRAME SCORE to measure the distribution of the HSPs among three reading frames:

The higher the FRAME SCORE, the more concentrated the hits are and the more likely the transcript is protein-coding.
We incorporate these six features into a support vector machine (SVM) machine learning classifier (18 ). Mapping the input features onto a high-dimensional feature space via a proper kernel function, SVM constructs a classification hyper-plane (maximum margin hyper-plane) to separate the transformed data (18 ). Known for its high accuracy and good performance, SVM is a widely used classification tool in bioinformatics analysis such as microarray-based cancer classification (19 (link),20 (link)), prediction of protein function (21 (link),22 (link)) and prediction of subcellular localization (23 (link),24 (link)). We employed the LIBSVM package (25 ) to train a SVM model using the standard radial basis function kernel (RBF kernel). The C and gamma parameters were determined by grid-search in the training dataset. We trained the SVM model using the same training data set as CONC used (13 (link)), containing 5610 protein-coding cDNAs and 2670 noncoding RNAs.

Partial Protocol Preview
This section provides a glimpse into the protocol.
The remaining content is hidden due to licensing restrictions, but the full text is available at the following link: Access Free Full Text.

Kong L., Zhang Y., Ye Z.Q., Liu X.Q., Zhao S.Q., Wei L, & Gao G. (2007). CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Research, 35(Web Server issue), W345-W349.

Publication 2007

Cancer Cdnas Frame Gamma Indels Microarray Noncoding rnas Nucleotide sequence Point mutations Protein Protein a Protein sequence Start codon Stop codon Tolerance

Corresponding Organization :

Other organizations : Peking University

Top 5 similar protocols

Protocol cited in 871 other protocols

Variable analysis

independent variables

LOG-ODDS SCORE
COVERAGE OF THE PREDICTED ORF
INTEGRITY OF THE PREDICTED ORF
NUMBER OF HITS
HIT SCORE
FRAME SCORE

dependent variables

Coding potential of the transcript

control variables

Known protein-coding transcripts
Known non-coding transcripts

positive controls

5610 protein-coding cDNAs

negative controls

2670 noncoding RNAs

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!