Coding Potential (Genome Analysis)

Content of this page:

Introduction

Thanks to the Next Generation Sequencing methods, transcriptomes are becoming more and more abundant. Once the transcripts have been assembled and we dispose of the sequences that have been transcribed into RNA, we must distinguish between the transcripts that will be coding (mRNA) and the non-coding ones (ncRNA). This classification can be done assigning to each transcript a score based on his nucleotide composition and patterns.

Coding Potential Assessment Tool

The "Coding Potential Assessment Tool" provides an easy and fast way to classify the transcripts according to their coding score. This tool integrates the CPAT algorithm within OmicsBox. The CPAT algorithm needs of models in order to assign the coding potential scores to each sequence. OmicsBox incorporates the standard CPAT models and adds some of the most common organisms models used on molecular biology. In addition to the prebuilt models, this tool adds the option to create your species-specific model.


Figure 1: Coding Potential Assessment Tool in the OmicsBox Analysis Menu

Run Coding Potential Assessment Tool

This tool can be found under Genome Analysis → Coding Potential Assessment (CPAT). The wizard allows adjusting analysis parameters (Figure 3). 

  • Accuracy: By default, the accuracy is set automatically in order to reduce the false positives and the false negatives, this means that the threshold equals to the value where the sensitivity has the same value than the specificity.

    If higher accuracy is desired the accuracy can be set manually. Raising the accuracy will allow classifying the sequences into three categories: coding, non-coding, and transcripts with unknown coding potential (Figure 2). 

    The accuracy can be set manually but it can never be lower than the default value. In this case, the accuracy value will automatically fallback to the default value.

Figure 2: Accuracy: Interpretation of the double ROC Curve

  • Models: The algorithm needs models to calculate the coding potential for each transcript. Here we can choose the origin of these models: 
    • Prebuilt: Use one of the prebuilt models available. Selecting one
      of these prebuilt models, the algorithm will run faster.


      SpeciesAccuracyCoding Cutoff
      Arabidopsis thaliana0.9840.415
      Bos Taurus 0.9530.359
      Caenorhabditis elegans0.9980.523
      Danio rerio0.9840.38
      Drosophila melanogaster0.9630.39
      Gallus gallus0.930.402
      Homo sapiens0.9660.364
      Mus musculus0.9550.440
      Rattus norvegicus0.980.363
      Sus scrofa0.9460.467
      Xenopus laevis 0.9630.415
    • From files: Create the model providing 2 FASTA files; one with coding sequences and another one with non-coding sequences.
    • From NCBI sequences: Create a new species-specific model from the sequences available on the NCBI database by selecting his scientific name or ID on the search box. A minimum of 1000 non-coding and coding sequences are required.

      Note: Checking the `Get Parent-Taxa ncRNA` allows to use non-coding RNA sequences from higher parent taxa until complete the 1000 necessary non-coding sequences.

Figure 3: Wizard Page

Results

Once finished three result types are automatically created:

  • Coding Potential Table: Here you can see the results for each sequence: 
    • Tag: The tag marking for each sequence whether it is a coding, non-coding or unknown coding potential transcript.
    • Sequence: The name of the sequence.
    • mRNA size: The length of the original transcript.
    • ORF size: The size of the potential ORF within the sequence.
    • Fickett score: The Fickett score which is a linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias.
    • Hexamer score: The hexamer score is calculated using a log-likelihood ratio to measure differential hexamer usage between coding and noncoding sequences.
    • Coding Probability: The coding probability assigned to each transcript.
  • Pie Chart: The coding potential distribution is shown as a pie chart of the classification results for the corresponding sequences depending on the provided cutoffs (Figure 4).
  • Model Accuracy via a double ROC-Curve chart: This chart opens when a new model is created or when the accuracy is manually set. In this chart, we can check the quality, the accuracy and the different thresholds of a model (Figure 5).

Figure 4: Distribution of the coding potential

Figure 5: Double ROC curve showing the model quality, accuracy and threshold