Predict Coding Regions

Introduction

The Predict Coding Regions functionality detects candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly. It is based on TransDecoder, a pipeline that recognizes likely coding sequences based on the following criteria:

  • A minimum length open reading frame (ORF) is found in a transcript sequence.

  • A log-likelihood score is computed and it should be > 0.

  • The above coding score is higher when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.

  • If a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).

  • A Position-Specific Scoring Matrix (PSSM) is built, trained and used to refine the start codon prediction.

  • The putative peptide has a match to a Pfam domain above the noise cut-off score (optional).

Please cite TransDecoder as:

Run Predict Coding Regions

This functionality can be found under Transcriptomics → Assembly → Predict Coding Regions. The wizard allows to provide input files and adjust analysis parameters (Figure 1, and Figure 2).

Extract the Long ORFs Configuration

  • Genetic Code: Select the genetic code of the organism under study. The available genetic codes are:

 Available genetic codes
  • Universal

  • Acetabularia

  • Candida

  • Ciliate

  • Dasycladacean

  • Euplotid

  • Hexamita

  • Mesodinium

  • Mitochondrial Ascidian

  • Mitochondrial Chlorophycean

  • Mitochondrial Echinoderm

  • Mitochondrial Flatworm

  • Mitochondrial Invertebrates

  • Mitochondrial Protozoan

  • Mitochondrial Pterobranchia

  • Mitochondrial Scenedesmus obliquus

  • Mitochondrial Thraustochytrium

  • Mitochondrial Trematode

  • Mitochondrial Vertebrates

  • Mitochondrial Yeast

  • Pachysolen tannophilus

  • Peritrich

  • SR1 Gracilibacteria

  • Tetrahymena

  • Minimum Protein Length: Minimum protein length to retain coding regions.

  • Strand Specific: Only the top strand option is analyzed.

  • Provide Gene-Transcript relation: Provide a tab-delimited file with the information to map from transcript (isoform) IDs to gene IDs. Each line should be of the form: Gene ID[tab]Transcript ID.

Homology Search Configuration

  • Pfam Search: Identify ORFs with homology to known proteins via Pfam searches. Searching PFAM allows identifying common protein domains, that are included as ORF retention criteria. Note that this option will significantly increase the execution time.

Figure 1: Configuration Page 1

Predict the Likely Coding Regions Configuration

  • Retain Long Orfs Mode: Select the retain long ORFs strategy. The dynamic mode sets range according to 1% FDR in a random sequence of the same GC content. Under the strict mode, all ORFs found that are equal or longer to the Retain Long ORFs Length are kept, even if no other evidence marks it as coding.

  • Retain Long Orfs Length: Select the minimum length to retain ORFs under the strict mode.

  • Single Best Only: Retain only the single best ORF per transcript (prioritized by homology, then ORF length).

  • No Refine Starts: By default, the predict coding regions strategy identifies potential start codons for 5’ partial ORFs using a PWM (position weight matrix). Check this option to deactivate this process.

  • Top Longest ORF for Training: Top longest ORFs to train Markov Model (hexamer stats). The default value is 500. Note, 10X this value is first selected for removing redundancies, and then the value of the longest ORF is selected from the non-redundant set.

Figure 2: Configuration Page 2

Results

Once finished, results are returned in three projects (Figure 3):

  • Protein sequences: A sequence table that contains peptide sequences for the final candidate ORFs.

  • CDS sequences: A sequence table that contains nucleotide sequences for coding regions of the final candidate ORFs.

  • ORFs Coordinates: A GFF project that contains positions within the target transcripts of the final selected ORFs.

Note that in both sequence projects, CDSs and proteins, the description field contains details about the predicted ORF. This description includes:

  • The protein identifier composed of the original transcripts along with '|m.(number)'.

  • The type attribute indicates whether the protein is:

    • Complete: Contains a start and a stop codon.

    • 5' partial: It is missing a start codon and presumably part of the N-terminus.

    • 3' partial: It is missing the stop codon and presumably part of the C-terminus.

    • Internal: It is both 5' and 3' partial.

  • An indicator (+) or (-) to indicate in which strand the coding region was found, along with the coordinates of the ORF in that transcript sequence.

Figure 3: Predict Coding Regions Results

In addition, a result page will show a summary of the "Predict Coding Regions" results (Figure 4). This page provides a quick evaluation of the results and provides ID lists containing transcript identifiers assigned to the different categories.

Figure 4: Predict Coding Regions Report

Furthermore, the Predict Coding Regions Summary chart (Figure 5) shows the percentage of ORFs that have been predicted as Complete, 5' Partial, 3' Partial, and Internal.

Figure 5: Predict Coding Regions Summary