Eukaryotic Gene Finding by Augustus
Augustus is a program that predicts genes in eukaryotic genomic sequences; It is one of the most accurate programs for the species it is trained for. In the human ENCODE project, it proved to be the most accurate gene finder among the tested `ab initio' programs. In the more recent nGASP (worm) project, it was again among the best in the `ab initio' and transcript-based categories.
The accuracy of Augustus lies on his precomputed models which facilitate fast and accurate gene prediction.
Run Eukaryotic Gene Finding
Input FASTA: The query file contains the DNA input sequence which must be in decompressed (multiple or single) FASTA format. Every letter other than a, c, g, t, A, C, G, and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted.
Note: The differences that make the fasta identifiers unique must be within the first 30 characters to be recognized by Augustus.
Figure 1: Augustus Input Data Page
Configuration 1 Page
- Closest species: This list allows for the selection of the closest related organism to your query, in order to obtain the most accurate prediction.
- Strand: Here you can choose the sense of the gene search, obtaining the predicted genes on the forward strand, the backward strand or on both strands.
- Type of gene: With this option, you can select the gene model.
- partial: allows prediction of incomplete genes at the sequence boundaries (default)
- intronless: predicts only single-exon genes like in prokaryotes and some eukaryotes
- complete: predicts only complete genes
- Output type: Specify whether the output sequences will be extracted as nucleotides or amino acids.
- Protein length threshold: Set a minimum length of the predicted proteins.
- Allow in-frame stops: Activating this checkbox will allow the detection of genes containing a stop codon in its reading frame, detecting fragment genes with some undetected zones; normally it's the most suitable option for an 'ab initio' search.
Figure 2: Augustus Configuration 1 Page
Configuration 2 Wizard Page
The eukaryotic gene finding can be executeConfiguration 2 Wizard Paged 'ab initio', using only DNA-seq data, or using 'hints' obtained from the RNAseq alignment in order to increase the truthfulness of the predicted genes.
- RNAseq alignment file: The file containing the alignments in BAM format. This file is the output of every RNAseq aligner program as TopHat, BWA or STAR. To be able to locate hints in the alignment file, it must not be filtered by any parameter, that means that it must be the same file that you obtain from the aligner. For this reason, the alignment files from Ensembl are not suitable for retrieving hints as they are filtered and processed.
- Qmap threshold: This parameter allows filtering the aligned reads that will be used to create the intron 'hints'. The Qmap corresponds to the mapping quality in a range from 0 to 60 and it is calculated as: Meaning that a Qmap of 50, corresponds to a mapping error of 5 x 105. Default: 50.
Minimum read alignment: Specify the minimum length of the read that must map to the reference genome at the beginning of the intron. If this value is too small, it can lead the program to detect an intron derived from a miss-alignment (figure 4).
Note: This value has 0 as minimum and the maximum depends on your reads length. Default: 11.
- Minimum intron length: Sets the minimum intron length (figure 4). Default: 32.
- Minimum exon length: Sets the minimum exon length (figure 4). Default: 300.
- Depth coverage: Sets the number of reads that must be aligned at a position to consider it as a consistent exon. Default: 20.
Figure 3: Augustus Configuration 2 Page
Figure 4: The concept of minimum read alignment and minimum intron length
Two result tables will automatically be opened:
Here you can see the traditional OmicsBox table showing the sequence name corresponding to the fasta ID line plus a gene identification, and the sequence length.
Note: These sequences can be on nucleotides or in amino acids, depending on the wizard selection.
- GFF3 table columns:
- Sequence: The name of the source sequence that belongs to this feature.
- Source: The name of the program that has predicted this feature, in this case, `Augustus'.
- Type: The type of the feature, that can be `gene', `mRNA', `CDS', `gene', `Start', `stop', `gene'
- Start: The coordinate of the start codon.
- End: The coordinate of the stop codon.
- Score: The score assigned to the feature, except the exons.
- Strand: The strand of the feature, where a `+' means that the feature is forward oriented and `-' backwards.
- Phase: The correct frame to translate this feature, the values can be `0', `1' or `2'. A gene `set' of features can have variant phase values, due to a frame shift in an intron.
- Attributes: Here we can see all the attributes assigned to each feature. The attributes are `ID' that assigns an id to each feature, `parent' present on the CDS and exon features, and provides information about the feature to which it belongs (refereeing to the sequence by his ID).
The resulting GFF3 can be inspected by using the Genome Browser. To display a GFF entry right click on it and select the Show in the Genome Browser option (figure 5). For more information about this feature visit the Genome Browser.
Figure 5: How to open the Genome Browser
A Result Viewer is also opened to display some the number and name of sequences per spitted file, the average number of exons, the minimum, maximum and average gene length, and the number of genes per strand (figure 6).
Figure 6: Result Summary