Thanks to the Next Generation Sequencing methods, transcriptomes are becoming more and more abundant. Once the transcripts have been assembled and we dispose of the sequences that have been transcribed into RNA, we must distinguish between the transcripts that will be coding (mRNA) and the non-coding ones (ncRNA). This classification can be done by assigning to each transcript a score based on his nucleotide composition and patterns.
The "Coding Potential Assessment Tool" provides an easy and fast way to classify the transcripts according to their coding score. This tool integrates the CPAT algorithm within OmicsBox. The CPAT algorithm needs models in order to assign the coding potential scores to each sequence. OmicsBox incorporates the standard CPAT models and adds some of the most common organisms models used in molecular biology. In addition to the prebuilt models, this tool adds the option to create your species-specific model.
Run Coding Potential Assessment Tool
This tool can be found under Functional Analysis → Coding Potential Assessment (CPAT). The wizard allows adjusting analysis parameters (Figure 2).
Accuracy: By default, the accuracy is set automatically in order to reduce the false positives and the false negatives, this means that the threshold equals the value where the sensitivity has the same value as the specificity.
If higher accuracy is desired the accuracy can be set manually. Raising the accuracy will allow classifying the sequences into three categories: coding, non-coding, and transcripts with unknown coding potential (Figure 1).
The accuracy can be set manually but it can never be lower than the default value. In this case, the accuracy value will automatically fall back to the default value.
Figure 1: Accuracy: Interpretation of the double ROC Curve
Models: The algorithm needs models to calculate the coding potential for each transcript. Here we can choose the origin of these models:
Prebuilt: Use one of the prebuilt models available. Selecting one
of these prebuilt models, the algorithm will run faster.
From files: Create the model providing 2 FASTA files; one with coding sequences and another one with non-coding sequences.
Please make sure to follow these guidelines: http://rna-cpat.sourceforge.net/#how-to-prepare-training-dataset
From NCBI sequences: Create a new taxa-specific model from the sequences available at NCBI.
This will take into account the following for the model creation.
The amount of ncRNA and CDS is the same.
CDS and ncRNA datasets do not contain any duplicates.
All CDS lengths are divisible by 3.
To reach the minimum number of CDS and ncRNA, the tool will search in parent taxa up to the phylum rank, if necessary.
Figure 2: Wizard Page
Once finished, three results are automatically created:
Coding Potential Table
A table containing the coding potential results for each input sequence (Figure 3).
Tag: Marking for each sequence whether it is a coding, non-coding, or unknown coding potential transcript.
Sequence: The name of the sequence.
mRNA size: The length of the original transcript.
ORF size: The size of the potential ORF within the sequence.
Fickett score: The Fickett score is a linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias.
Hexamer score: The hexamer score is calculated using a log-likelihood ratio to measure differential hexamer usage between coding and non-coding sequences.
Coding Probability: The coding probability assigned to each transcript.
Figure 3: CPAT Results
The coding potential distribution is shown as a pie chart of the classification results for the corresponding sequences depending on the provided cutoffs (Figure 4).
Figure 4: Coding Potential Distribution
Model Accuracy via a double ROC-Curve chart
This chart opens when a new model is created or when the accuracy is manually set. In this chart, we can check the quality, accuracy, and the different thresholds of a model (Figure 5).
Figure 5: Double ROC Curve