Eukaryotic Gene Finding by AUGUSTUS
The Eukaryotic Gene Finding functionality is intended to predict gene structures in genomic sequences, such as genomes, chromosomes, or scaffolds. It is based on the AUGUSTUS software which is designed to predict genes in genomic sequences, especially for those from eukaryotic organisms, and it is one of the most accurate programs for the species for which it is trained.
AUGUSTUS can be used as an ab initio program, which means it bases its prediction purely on the sequence. Includes pre-trained models for over 100 species. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as RNA-Seq, proteins, EST/cDNA, and IsoSeq data. Hints are extrinsic evidence about the location and structure of genes. Each hint is local information, associated with a particular genome region. When predicting genes, AUGUSTUS can incorporate these hints, which will change the likelihood of gene structure candidates. It will tend to predict gene structures that are in agreement with the hints.
Escherichia coli K-12
Alveolata & Protozoan
Pichia stipitis (Scheffersomyces stipitis)
Saccharomyces cerevisiae (RM11-1a_1)
Saccharomyces cerevisiae (S288C)
Nematoda & Nemertea (Roundworms & Ribbon worms)
Arthropoda (Insecta & Arachnida)
Chordata (Fish, Bird & Mammal)
Cnidaria & Ctenophora (Jellyfish & Anemone)
Echinodermata (Starfish & Sea Urchin)
Hemichordata & Mollusca (Acorn worm & Mollusk)
Placozoa (Marine free-living organism)
Ostreococcus sp. 'lucimarinus'
RNA-Seq alignments provide two types of features that are helpful for gene prediction:
Spliced alignments of reads give information about introns.
Coverage (e.e, how many reads are aligned to a particular position in the genome) gives information about exons.
The integration of coverage (exon part) information is not trivial. The problem is that coverage may not only be high in CDS regions, but also in UTRs and in partially retained introns. If the selected species do not have UTR parameters (see UTR Prediction parameter below), RNA-Seq hints are not recommended.
RNA-Seq data is required as sequencing reads in FASTA/FASTQ format. Reads are aligned to the genome using the STAR aligner software. If RNA-Seq hints are provided, please cite STAR as:
Dobin A, Davis CA, Schlesinger F, et al (2012). "STAR: ultrafast universal RNA-seq aligner." Bioinformatics, 29(1):15-21.
Protein alignments can aid the prediction of CDSs (including the correct reading frame, start and stop codon positions) and the prediction of introns.
Protein data is required in FASTA format. Proteins are aligned to the genome using the GenomeThreader software. If protein hints are provided, please cite GenomeThreader as:
Gremme G, Brendel V, Sparks M E, and Kurtz S (2005). “Engineering a software tool for gene structure prediction in higher organisms”. Information and Software Technology, 47(15):965-978.
EST & cDNA Hints
ESTs (Expressed Sequence Tags) and cDNAs are suitable for generating intron, exon part, and exon hints.
EST and cDNA sequences are required in FASTA format. ESTs and cDNAs are aligned to the genome using BLAT and pslCDnaFilter. If EST/cDNA hints are provided, please cite BLAT as:
Kent WJ (2002). “BLAT--the BLAST-like alignment tool”. Genome Res., 656-64.
Single-molecule Pacific Bioscience (PacBio) RNA-seq reads can improve the identification of new isoforms. Circular Consensus Sequences (CCS) from IsoSeq often constitute near-full-length transcripts.
IsoSeq sequences are required in FASTA format. IsoSeq sequences are aligned to the genome using GMAP. If IsoSeq hints are provided, please cite GMAP as:
Wu TD, Watanabe CK (2005). “GMAP: a genomic mapping and alignment program for mRNA and EST sequences”. Bioinformatics, 1;21(9):1859-75.
Please cite AUGUSTUS as:
Hoff KJ. and Stanke M. (2019). Predicting Genes in Single Genomes with AUGUSTUS. Current protocols in bioinformatics, 65(1), e57.
Input Sequences: Select the file containing the input DNA sequences. This application expects genomic sequences (e.g. genome, chromosomes, scaffolds…). Sequences must be in FASTA or multi-FASTA format. Everly letter other than A, C, G, and T is interpreted as an unknown base.
If repeat masked sequences are provided, masked regions must be indicated in lowercase (soft masking).
Figure 1: Input Page
Closest Species: AUGUSTUS has been trained for predicting genes in the following species. The closest species to the query should be selected. Each option shows the scientific names of the species, the kingdom, the phylum, and the class to which it belongs (if this information is available). Provide any of these taxonomies (e.g. class) to filter and find all the species related to the search term (e.g. if “Fungi” is provided, all species of the Fungi kingdom are displayed).
Strand: Report predicted genes on both strands, just the forward or just the reverse strand.
Ignore Strand Conflicts: Predict genes independently on each strand and allow overlapping genes on opposite strands.
This option is not available for prokaryotic species (archaea and bacteria).
Allowed Gene Structure: Restrict the sear to one of these gene models:
Partial: Allow prediction of incomplete genes at the sequence boundaries. This option is recommended.
Intronless. Only predict single-exon genes like in prokaryotes and some eukaryotes.
Complete: Only predict complete genes.
At Least One: Predict at least one complete gene.
Exactly One: Predict exactly one complete gene.
Output Genomic Features: Specify which features should be reported: introns, start codons, and stop codons.
UTR Prediction: Predict the untranslated regions in addition to the coding sequence. UTR prediction is only supported in combination with the Partial and Complete gene structures. UTR prediction is not possible in combination with the Ignore Strand Conflicts option. This option currently works only for a subset of species.
If RNA-Seq hints are provided, this option is activated automatically (if possible), regardless of the user’s choice.
No In-frame Stop Codons: Do not report transcripts with in-frame stop codons. Otherwise, intron-spanning stop codons could occur.
Stop Codons Excluded From CDS: By default, stop codons are included in CDSs, which is required by the GFF3 standard. Check this option to exclude stop codons from CDS.
Repeat Masked Sequences: If repeat masked genome sequences are provided, mark this option. Note that AUGUSTUS expects the soft-masked version of the genome (repeat fragments are represented in lowercase characters).
Repeats can severely disturb gene prediction. It is strongly recommended to mask genome sequences for gene prediction. This task can be done within OmicsBox: Repeat Masking.
Sample: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The posterior probabilities are estimated using a sampling algorithm. This parameter adjusts the number of sampling iterations. The higher value is the more accurate is the estimation. The default is 100. If you do not need the posterior probabilities, set this parameter to 0.
Alternatives From Sampling: Report alternative transcripts generated through probabilistic sampling. If this option is checked, the following parameters can be adjusted.
Figure 2: General Configuration Page
Alternatives From Sampling Configuration
Min. Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low exon and intron probabilities.
Min. Mean Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low mean exon and intron probabilities.
Max. Tracks: Upper limit for the number of transcripts that span any given genome position.
Temperature: If the aim is to produce a diverse, sensitive (including) set of gene structures, this parameter can be increased. The larger temperature the more alternatives are sampled. 3 is a good compromise between getting a high sensitivity but not getting too many exons sampled in total.
Configuration: Gene Finding Mode
Gene Finding Mode: Choose the Gene Finding Mode.
Ab initio Prediction: The Ab initio mode relies only on the pre-computed trained models. It predicts genes using probabilistic models based on Hidden Markov Models.
Prediction Using Extrinsic Evidence: The Extrinsic Evidence mode uses experimental evidence to identify parts of gene structures, to uncover alternative splicing, o to overall improve annotation quality. If this option is selected, the Extrinsic Evidence Configuration section can be adjusted.
Extrinsic Evidence Data: The Extrinsic Evidence Mode support extrinsic evidence hints from:
RNA-Seq: Sequencing reads in FASTA or FASTQ format. If data is single-end, provide a single file as an RNA-Seq SE file. If data is paired-end, provide the upstream file as RNA-Seq SE/Upstream, and the downstream file as RNA-Seq Downstream.
Protein: Protein sequences in FASTA format.
EST/cDNA: EST or cDNA sequences in FASTA format.
IsoSeq: Single-molecule Pacific Bioscience (PacBio) reads in FASTA or FASTQ format.
One file of each type is supported.
Extrinsic Evidence Configuration
Minimum Intron Length: Define the minimum length of intron hints.
Maximum Intron Length: Define the maximum length of intron hints.
Allow Hinted Splice Sites (AT/AC): This option allows to predict the (rare) introns that start with AT and end with AC, in addition to the GT-AG and GC-AG introns that are allowed by default.
Alternatives From Evidence: Report alternative transcripts when they are suggested by hints.
Figure 3: Gene Finding Mode Page
The Eukaryotic Gene Finding process returns the results in three projects (Figure 4):
GFF Coordinates: This project contains the coordinates of the predicted genomic features in GFF format. It may contain genes, transcripts, introns, start codons, stop codons, and CDSs, depending on the “Output Genomic Features” selected when configuring the analysis (see the “Configuration: General” section).
CDS Sequences: A sequence table that contains the nucleotide sequences for coding regions of the predicted genes.
Protein Sequences: A sequence table that contains the protein sequences of the predicted genes.
In CDS and Protein projects, identifiers (SeqName) have the format "g1.t1". The "g1" indicates that the CDS / Protein comes from the "g1" gene. The "t1" indicates that the CDS / Protein comes from the "t1" transcript, which belongs to the "g1" gene. When the "Alternatives From Sampling" or "Alternatives From Evidence" options are provided, more than one transcript (isoform) per gene can be reported. Thus, the additional isoforms are called "g1.t2" and so on. The description column shows the genomic sequence to which each CDS or protein belongs.
The “coordinates” project follows the GFF format specification. It contains one line per predicted feature. The columns contain:
SeqID: Name of the chromosome or scaffold.
Source: Name of the program that generated this feature (AUGUSTUS).
Type: Feature type name (e.g. gene, transcript, intron, CDS…).
Note that CDS entries in the GFF define exon regions. The sequences contained in the CDS project contain all CDS entries for the corresponding gene/transcript.
Start: Start position of the feature.
End: End position of the feature.
Score: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The reported probability of a gene is the probability that some coding sequence is in the reported range on the reported strand, regardless of the exact transcript. The posterior probabilities are estimated using a sampling algorithm.
Strand: Defined as + (forward) or - (reverse).
Phase: Indicates the base of the feature that is the first base of a codon (0, 1 or 2).
Attributes: Provide additional information about the feature.
Attr.ID: Feature identifier.
Atrr.Parent: Identifier of the parent feature.
Attr.HintSupport: Hint support percentage. It is the percentage of the feature that has been supported by the extrinsic evidence data provided.
The “Attr.HintSupport” column is only displayed when the Prediction Using Extrinsic Evidence mode is used.
Figure 4: Eukaryotic Gene Finding Results
In addition to GFF and sequence projects, a result page will show a summary of the “Eukaryotic Gene Finding” results (Figure 5). This page provides information about the input data and the selected species, as well as a quick evaluation of the results obtained. If hint data was provided, an additional section is included, which summarizes the information obtained from the hint data.
Figure 5: Eukaryotic Gene Finding Report
Furthermore, different charts are generated for a global visualization of the results.
Length Distribution Chart
This chart shows the distribution of lengths of the predicted CDS sequences (Figure 6). Note that this distribution is computed from the sequences contained in the CDS project.
Figure 6: Length Distribution Chart
Hint Support Distribution Chart
This chart shows the distribution of hint support (%) of the predicted CDS sequences (Figure 7). It is only available for the Prediction Using Extrinsic Evidence mode.
Figure 7: Hint Support Distribution Chart
Hint Type Distribution Chart
This chart shows the distribution of hint types that have been obtained from the extrinsic evidence data provided (Figure 8). A description of each hint type is included in the summary report. This chart is only available for the Prediction Using Extrinsic Evidence mode.
Figure 8: Hint Type Distribution Chart