Machine Learning for Coding Sequence Identification

To distinguish protein-coding sequences from the non-coding sequences, we extracted five features, i.e. the length and S-score of MLCDS, length-percentage, score-distance and codon-bias. The length and S-score of MLCDS were used as the first two features, which assess the extent and quality of the MLCDS, respectively (Supplementary Table S3). Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously distinct from the other five in the distribution of ANT. We analyzed six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript, with the assumption that there must exist one best MLCDS (as described earlier in the text); however, this phenomenon does not generally exist for non-coding transcripts. Thus, we defined other two features, length-percentage and score-distance, as follows:

Where Ml is the length of the best MLCDS (according to S-score value) among that of six reading frames, and Y_i represents the length of each six of the MLCDS.

Where S is the S-score of the best MLCDS, and Ej represents the S-score of the other five MLCDS (Supplementary Table S3).
All aforementioned four selected features could, to some extent, distinguish the protein-coding and non-coding sequences and were concordantly higher in protein-coding transcripts and lower in non-coding transcripts (Supplementary Figure S4). Finally, we included the fifth feature, the frequency of single nucleotide triplets, in the MLCDS as the last feature to complement the construction of a classification model. This feature was defined as codon-bias, which evaluated the coding-non-coding bias for each of the 61 kinds of codons (the three stop codons were ruled out) (Supplementary Figure S5).
To get the positive and negative training sets, we extracted the five features for each best MLCDS from the known protein-coding and non-coding transcript data sets, respectively. We then incorporated these two training sets into a support vector machine (SVM) as a model construction (Figure 1c). We used the A Library for Support Vector Machines (LIBSVM) (13 ) to train an SVM model using the standard radial basis function kernel, where the C and gamma parameters were set by default.

Free full text: Click here

Sun L., Luo H., Bu D., Zhao G., Yu K., Zhang C., Liu Y., Chen R, & Zhao Y. (2013). Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research, 41(17), e166.

Publication 2013

Coding sequences Codons bias Gamma Library Nucleotide Protein Protein coding sequences Reading frames Stop codons Triplets

Corresponding Organization :

Other organizations : Institute of Computing Technology, Institute of Biophysics, Chinese Academy of Sciences

Top 5 similar protocols

Protocol cited in 41 other protocols

Variable analysis

independent variables

Length and S-score of MLCDS
Length-percentage
Score-distance
Codon-bias

dependent variables

Ability to distinguish protein-coding sequences from non-coding sequences

control variables

Six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript
Known protein-coding and non-coding transcript data sets used for the positive and negative training sets

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!