Eukaryotic Gene Finding by Augustus

Content of this page:


Augustus is a program that predicts genes in eukaryotic genomic sequences; It is one of the most accurate programs for the species it is trained for. In the human ENCODE project, it proved to be the most accurate gene finder among the tested `ab initio' programs. In the more recent nGASP (worm) project, it was again among the best in the `ab initio' and transcript-based categories.

The accuracy of Augustus lies on his precomputed models which facilitate fast and accurate gene prediction.

Run Eukaryotic Gene Finding

In order to speed up the gene finding process, the fasta will be split by sequence, i.e, each fasta entry will be sent to a different node for parallel execution (figure 1figure 2 and figure 3).

Input Page

  • Input FASTA: The query file contains the DNA input sequence which must be in decompressed (multiple or single) FASTA format. Every letter other than a, c, g, t, A, C, G, and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted. 

    Note: The differences that make the fasta identifiers unique must be within the first 30 characters to be recognized by Augustus.

Figure 1: Augustus Input Data Page 

Configuration 1 Page

  • Closest species: This list allows for the selection of the closest related organism to your query, in order to obtain the most accurate prediction.

List of available species:

 Click here to expand...
  • Acyrthosiphon pisum, [Metazoa - Arthropoda - Insecta]
  • Aedes aegypti [Metazoa - Arthropoda - Insecta]
  • Amphimedon queenslandica [Metazoa - Porifera - Demospongiae]
  • Ancylostoma ceylanicum [Metazoa - Nematoda - Chromadorea]
  • Apis [Metazoa - Arthropoda - Insecta]
  • Apis dorsata [Metazoa - Arthropoda - Insecta]
  • Arabidopsis thaliana [Plantae - Streptophyta - Magnoliophyta]
  • Aspergillus fumigatus [Fungi - Ascomycota - Eurotiomycetes]
  • Aspergillus nidulans [Fungi - Ascomycota - Eurotiomycetes]
  • Aspergillus oryzae [Fungi - Ascomycota - Eurotiomycetes]
  • Aspergillus terreus [Fungi - Ascomycota - Eurotiomycetes]
  • Bombus impatiens [Metazoa - Arthropoda - Insecta]
  • Bombus terrestris [Metazoa - Arthropoda - Insecta]
  • Botrytis cinerea [Fungi - Ascomycota - Leotiomycetes]
  • Brugia malayi [Metazoa - Nematoda - Chromadorea]
  • Caenorhabditis elegans [Metazoa - Nematoda - Chromadorea]
  • Callorhinchus milii [Metazoa - Chordata - Chondrichthyes]
  • Camponotus floridanus [Metazoa - Arthropoda - Insecta]
  • Candida albicans [Fungi - Ascomycota - Saccharomycetes]
  • Candida guilliermondii [Fungi - Ascomycota - Saccharomycetes]
  • Candida tropicalis [Fungi - Ascomycota - Saccharomycetes]
  • Chaetomium globosum [Fungi - Ascomycota - Sordariomycetes]
  • Chlamydomonas reinhardtii [Plantae - Chlorophyta - Chlorophyceae]
  • Chlorella [Plantae - Chlorophyta - Trebouxiophyceae]
  • Coccidioides immitis [Fungi - Ascomycota - Eurotiomycetes]
  • Conidiobolus coronatus [Fungi - Entomophthoromycota - Entomophthoromycetes]
  • Coprinus [Fungi - Basidiomycota - Agaricomycetes]
  • Coprinopsis cinerea [Fungi - Basidiomycota - Agaricomycetes]
  • Cryptococcus [Fungi - Basidiomycota - Tremellomycetes]
  • Cryptococcus neoformans gattii [Fungi - Basidiomycota - Tremellomycetes]
  • Cryptococcus neoformans neoformans B [Fungi - Basidiomycota - Tremellomycetes]
  • Cryptococcus neoformans neoformans JEC21 [Fungi - Basidiomycota - Tremellomycetes]
  • Culex [Metazoa - Arthropoda - Insecta]
  • Danio rerio [Metazoa - Chordata - Actinopterygii]
  • Debaryomyces hansenii [Fungi - Ascomycota - Saccharomycetes]
  • Drosophila melanogaster [Metazoa - Arthropoda - Insecta]
  • Encephalitozoon cuniculi GB-M1 [Fungi - Microsporidia - Microsporea]
  • Eremothecium gossypii [Fungi - Ascomycota - Saccharomycetes]
  • Escherichia coli [Bacteria - Proteobacteria - Gammaproteobacteria]
  • Fusarium [Fungi - Ascomycota - Sordariomycetes]
  • Fusarium graminearum/Gibberella zeae [Fungi - Ascomycota - Sordariomycetes]
  • Galdieria sulphuraria [Plantae - Rhodophyta - Cyanidiophyceae]
  • Gallus gallus [Metazoa - Chordata - Aves]
  • Heliconius melpomene [Metazoa - Arthropoda - Insecta]
  • Histoplasma capsulatum [Fungi - Ascomycota - Eurotiomycetes]
  • Homo sapiens [Metazoa - Chordata - Mammalia]
  • Kluyveromyces lactis [Fungi - Ascomycota - Saccharomycetes]
  • Laccaria bicolor [Fungi - Basidiomycota - Agaricomycetes]
  • Leishmania tarentolae [Euglenozoa - Kinetoplastida - Trypanosomatidae]
  • Lodderomyces elongisporus [Fungi - Ascomycota - Saccharomycetes]"),
  • Magnaporthe grisea [Fungi - Ascomycota - Saccharomycetes]
  • Nasonia vitripennis [Metazoa - Arthropoda - Insecta]
  • Neurospora [Fungi - Ascomycota - Sordariomycetes]
  • Neurospora crassa [Fungi - Ascomycota - Sordariomycetes]
  • Oryza sativa [Plantae - Magnoliophyta - Liliopsida]
  • Parasteatoda [Metazoa - Arthropoda - Arachnida]
  • Petromyzontiformes [Metazoa - Chordata - Cyclostomata]
  • Phanerochaete chrysosporium [Fungi - Basidiomycota - Agaricomycetes]
  • Plasmodium falciparum [Alveolata - Apicomplexa - Aconoidasida]
  • Phanerochaete chrysosporium [Fungi - Basidiomycota - Agaricomycetes]
  • Pichia stipitis [Fungi - Ascomycota - Saccharomycetes]
  • Pneumocystis jirovecii [Fungi - Ascomycota - Pneumocystidomycetes]
  • Rhizopus oryzae [Fungi - Mucorales - Void]
  • Rhodnius [Metazoa - Arthropoda - Insecta]
  • Saccharomyces cerevisiae_rm11-1a [Fungi - Ascomycota - Sordariomycetes]
  • Saccharomyces cerevisiae S288C [Fungi - Ascomycota - Sordariomycetes]
  • Schistosoma mansoni [Metazoa - Platyhelminthes - Trematoda]
  • Schizosaccharomyces pombe [Fungi - Ascomycota - Schizosaccharomycetes]
  • Solanum lycopersicum [Plantae - Streptophyta - Pentapetalae]
  • Staphylococcus aureus [Bacteria - Firmicutes - Cocci]
  • Tetrahymena [Alveolata - Ciliophora - Oligohymenophorea]
  • Theobroma cacao [Plantae - Streptophyta - Pentapetalae]
  • Thermoanaerobacter tengcongensis/Caldanaerobacter subterraneus subsp. tengcongensis [Bacteria - Firmicutes - Clostridia]
  • Toxoplasma gondii [Alveolata - Apicomplexa - Coccidia]
  • Tribolium castaneum [Metazoa - Arthropoda - Insecta]
  • Trichinella spiralis [Metazoa - Nematoda - Enoplea]
  • Triticum aestivum [Plantae - Streptophyta - Liliopsida]
  • Ustilago [Fungi - Basidiomycota - Ustilaginomycetes]
  • Ustilago maydis [Fungi - Basidiomycota - Ustilaginomycetes]
  • Verticillium albo atrum [Fungi - Ascomycota - Sordariomycetes]
  • Verticillium longisporum [Fungi - Ascomycota - Sordariomycetes]"),
  • Volvox carteri [Plantae - Chlorophyta - Chlorophyceae]
  • Xipophorus maculatus [Metazoa - Chordata - Actinopterygii]
  • Yarrowia lipolytica [Fungi - Ascomycota - Saccharomycetes]
  • Zea mays [Plantae - Streptophyta - Liliopsida]

  • Strand: Here you can choose the sense of the gene search, obtaining the predicted genes on the forward strand, the backward strand or on both strands.
  • Type of gene: With this option, you can select the gene model.
    • partial: allows prediction of incomplete genes at the sequence boundaries (default)
    • intronless: predicts only single-exon genes like in prokaryotes and some eukaryotes
    • complete: predicts only complete genes
  • Output type: Specify whether the output sequences will be extracted as nucleotides or amino acids.
  • Protein length threshold: Set a minimum length of the predicted proteins. 
  • Allow in-frame stops: Activating this checkbox will allow the detection of genes containing a stop codon in its reading frame, detecting fragment genes with some undetected zones; normally it's the most suitable option for an 'ab initio' search.

Figure 2: Augustus Configuration 1 Page 

Configuration 2 Wizard Page

The eukaryotic gene finding can be executeConfiguration 2 Wizard Paged 'ab initio', using only DNA-seq data, or using 'hints' obtained from the RNAseq alignment in order to increase the truthfulness of the predicted genes.

  • RNAseq alignment file: The file containing the alignments in BAM format. This file is the output of every RNAseq aligner program as TopHat, BWA or STAR. To be able to locate hints in the alignment file, it must not be filtered by any parameter, that means that it must be the same file that you obtain from the aligner. For this reason, the alignment files from Ensembl are not suitable for retrieving hints as they are filtered and processed.
  • Qmap thresholdThis parameter allows filtering the aligned reads that will be used to create the intron 'hints'. The Qmap corresponds to the mapping quality in a range from 0 to 60 and it is calculated as: Meaning that a Qmap of 50, corresponds to a mapping error of 5 x 105. Default: 50.
  • Minimum read alignment: Specify the minimum length of the read that must map to the reference genome at the beginning of the intron. If this value is too small, it can lead the program to detect an intron derived from a miss-alignment (figure 4). 

    Note: This value has 0 as minimum and the maximum depends on your reads length. Default: 11.

  • Minimum intron length: Sets the minimum intron length (figure 4). Default: 32.
  • Minimum exon length: Sets the minimum exon length (figure 4). Default: 300.
  • Depth coverage: Sets the number of reads that must be aligned at a position to consider it as a consistent exon. Default: 20.

Figure 3: Augustus Configuration 2 Page

Figure 4: The concept of minimum read alignment and minimum intron length


Two result tables will automatically be opened:

  • Sequence table:
    Here you can see the traditional OmicsBox table showing the sequence name corresponding to the fasta ID line plus a gene identification, and the sequence length. 

    Note: These sequences can be on nucleotides or in amino acids, depending on the wizard selection.

  • GFF3 table columns:
    • Sequence: The name of the source sequence that belongs to this feature.
    • Source: The name of the program that has predicted this feature, in this case, `Augustus'.
    • Type: The type of the feature, that can be `gene', `mRNA', `CDS', `gene', `Start', `stop', `gene'
    • Start: The coordinate of the start codon.
    • End: The coordinate of the stop codon.
    • Score: The score assigned to the feature, except the exons.
    • Strand: The strand of the feature, where a `+' means that the feature is forward oriented and `-' backwards.
    • Phase: The correct frame to translate this feature, the values can be `0', `1' or `2'. A gene `set' of features can have variant phase values, due to a frame shift in an intron.
    • Attributes: Here we can see all the attributes assigned to each feature. The attributes are `ID' that assigns an id to each feature, `parent' present on the CDS and exon features, and provides information about the feature to which it belongs (refereeing to the sequence by his ID).

The resulting GFF3 can be inspected by using the Genome Browser. To display a GFF entry right click on it and select the Show in the Genome Browser option (figure 5). For more information about this feature visit the Genome Browser

Figure 5: How to open the Genome Browser   

A Result Viewer is also opened to display some the number and name of sequences per spitted file, the average number of exons, the minimum, maximum and average gene length, and the number of genes per strand (figure 6).

Figure 6: Result Summary