Predict Coding Regions

Content of this page:

Introduction

The Predict Coding Regions functionality detects candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly. It is based on TransDecoder, a pipeline that recognizes likely coding sequences based on the following criteria:

  • A minimum length open reading frame (ORF) is found in a transcript sequence.

  • A log-likelihood score is computed and it should be > 0.

  • The above coding score is higher when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.

  • If a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).

  • A Position-Specific Scoring Matrix (PSSM) is built, trained and used to refine the start codon prediction.

  • The putative peptide has a match to a Pfam domain above the noise cut-off score (optional).

Please cite TransDecoder as:

Run Predict Coding Regions

This functionality can be found under Transcriptomics → Assembly → Predict Coding Regions. The wizard allows to select input files and adjust analysis parameters (Figure 1).

  • Genetic Code: Select the genetic code of the organism under study. The available genetic codes are:

Universal

Mitochondrial Invertebrates

Acetabularia

Mitochondrial Protozoan

Candida

Mitochondrial Pterobranchia

Ciliate

Mitochondrial Scenedesmus obliquus

Dasycladacean

Mitochondrial Thraustochytrium

Euplotid

Mitochondrial Trematode

Hexamita

Mitochondrial Vertebrates

Mesodinium

Mitochondrial Yeast

Mitochondrial Ascidian

Pachysolen tannophilus

Mitochondrial Chlorophycean

Peritrich

Mitochondrial Echinoderm

SR1 Gracilibacteria

Mitochondrial Flatworm

Tetrahymena

  • Minimum Protein Length: Minimum protein length to retain coding regions.

  • Strand Specific: Only the top strand option is analyzed.

  • Provide Gene-Transcript relation: Provide a tab-delimited file with information to map from transcript (isoform) IDs to gene IDs. Each line should be of the form: Gene ID[tab]Transcript ID.

  • Pfam Search: Identify ORFs with homology to known proteins via Pfam searches. Searching PFAM allows to identify common protein domains, that are included as ORF retention criteria. Note that this option will significantly increase the execution time.

  • Retain Long Orfs Mode: Select the retain long ORFs strategi. The dynamic mode, sets range according to 1% FDR in random sequence of same GC content. Under the strict mode, all ORFs found that are equal or longer to the Retain Long ORFs Length are kept, even if no other evidence marks it as coding.

  • Retain Long Orfs Length: Select the minimum length to retain ORFs under the strict mode.

  • Single Best Only: Retain only the single best ORF per transcript (prioritized by homology, then ORF length).

  • No Refine Starts: By default, the predict coding regions strategy identifies potential start codons for 5’ partial ORFs using a PWM (position weight matrix). Check this option to deactivate this process.

  • Top Longest ORF for Training: Top longest ORFs to train Markov Model (hexamer stats). The default value is 500. Note, 10X this value are first selected for removing redundancies, and then this value of longest ORFs are selected from the non-redundant set.

Figure 1: Configuration Wizard Page

Results

Once finished, results are returned in three projects:

  • Protein sequences: A sequence table that contains peptide sequences for the final candidate ORFs.

  • CDS sequences: A sequence table that contains nucleotide sequences for coding regions of the final candidate ORFs.

  • ORFs Coordinates: A GFF project that contains positions within the target transcripts of the final selected ORFs.

Note that in both sequence projects, CDSs and proteins, the description field contains details about the predicted ORF. This description includes:

  • The protein identifier composed of the original transcripts along with '|m.(number)'.

  • The type attribute indicates whether the protein is:

    • Complete: Contains a start and a stop codon.

    • 5' partial: It is missing a start codon and presumably part of the N-terminus.

    • 3' partial: It is missing the stop codon and presumably part of the C-terminus.

    • Internal: It is both 5' and 3' partial.

  • An indicator (+) or (-) to indicate in which strand the coding region was found, along with the coordinates of the ORF in that transcript sequence.

In addition, a result page will show a summary of the "Predict Coding Regions" results (Figure 2). This page provides a quick evaluation of the results and provides ID lists containing transcript identifiers assigned to the different categories.

Figure 2: Predict Coding Regions Report

Furthermore, the Predict Coding Regions Summary chart (Figure 3) shows the percentage of ORFs that have been predicted as Complete, 5' Partial, 3' Partial and Internal.

Figure 3: Predict Coding Regions Summary