Machine Learning for Coding Sequence Identification
To distinguish protein-coding sequences from the non-coding sequences, we extracted five features, i.e. the length and S-score of MLCDS, length-percentage, score-distance and codon-bias. The length and S-score of MLCDS were used as the first two features, which assess the extent and quality of the MLCDS, respectively (Supplementary Table S3). Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously distinct from the other five in the distribution of ANT. We analyzed six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript, with the assumption that there must exist one best MLCDS (as described earlier in the text); however, this phenomenon does not generally exist for non-coding transcripts. Thus, we defined other two features, length-percentage and score-distance, as follows:
Where Ml is the length of the best MLCDS (according to S-score value) among that of six reading frames, and Yi represents the length of each six of the MLCDS.
Where S is the S-score of the best MLCDS, and Ej represents the S-score of the other five MLCDS (Supplementary Table S3). All aforementioned four selected features could, to some extent, distinguish the protein-coding and non-coding sequences and were concordantly higher in protein-coding transcripts and lower in non-coding transcripts (Supplementary Figure S4). Finally, we included the fifth feature, the frequency of single nucleotide triplets, in the MLCDS as the last feature to complement the construction of a classification model. This feature was defined as codon-bias, which evaluated the coding-non-coding bias for each of the 61 kinds of codons (the three stop codons were ruled out) (Supplementary Figure S5). To get the positive and negative training sets, we extracted the five features for each best MLCDS from the known protein-coding and non-coding transcript data sets, respectively. We then incorporated these two training sets into a support vector machine (SVM) as a model construction (Figure 1c). We used the A Library for Support Vector Machines (LIBSVM) (13 ) to train an SVM model using the standard radial basis function kernel, where the C and gamma parameters were set by default.
Sun L., Luo H., Bu D., Zhao G., Yu K., Zhang C., Liu Y., Chen R, & Zhao Y. (2013). Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research, 41(17), e166.
Publication 2013
Coding sequences Codons bias Gamma Library Nucleotide Protein Protein coding sequencesReading frames Stop codonsTriplets
Corresponding Organization :
Other organizations :
Institute of Computing Technology, Institute of Biophysics, Chinese Academy of Sciences
Ability to distinguish protein-coding sequences from non-coding sequences
control variables
Six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript
Known protein-coding and non-coding transcript data sets used for the positive and negative training sets
Annotations
Based on most similar protocols
Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.
As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.
About PubCompare
Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.
We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.
However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.
Ready to
get started?
Sign up for free.
Registration takes 20 seconds.
Available from any computer
No download required