Prokaryotic Gene Finding by Glimmer

Content of this page:

Introduction

Glimmer (Gene Locator and Interpolated Markov ModelER) is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer uses Interpolated Markov Models (IMMs) to identify the coding regions and to distinguish them from non-coding DNA. Glimmer was the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and since then has been used to annotate the genomes of hundred's of bacterial and archaea species from TIGR and other labs.

The precision of Glimmer lies in its Interpolated Context Models (ICM), which are built for every query genome, by calculating and adapting the algorithm parameters to the GC content, the start and stop codons, etc.

Run Prokaryotic Gene Finding

To create the most accurate model for your genome, this tool joins all the input fasta files and builds the model with it. Once the model is built, it performs the gene finding for each entry in the files. This methodology allows you to save the model created with all your sequences (belonging to the same organism), and use it to find the genes on a short sequence without loading the entire genome. If you are running this tool on small genomic fragments, the genome of the closest available evolutionary relative of the target organism can be used to provide a training set of genes, if no genome is available for your organism.


Input Page

The query file contains the DNA input sequence and must be in decompressed (multiple or single) FASTA format (figure 1). You can select a folder or multiple fasta files.

Note: Be sure to select only the fasta files containing the sequences of your query organism.

Figure 1: Input Data Page

Configuration 1 Page

This page groups the main settings regarding your query genome (figure 2).

  • Genetic code: Here you can choose the genetic code for your genome. Only the 1st, 2nd, 11th, corresponding to the General genetic code, the Mycoplasma/Spiroplasma Code and the Bacterial and Archeal code are available.

  • Minimum gene length: Allows setting the length threshold for the found genes in nucleotides.

  • Maximum gene overlap: Here you can choose the maximum overlap length. Unlike eukaryotic genes, prokaryotic genes often have their genes overlapped.

  • Minimum gene score: Every ORF found has an assigned a score depending on his length, start and stop codons. Here you can modify the limit of the score necessary to consider an ORF a gene. Lowering these values will increase the number of genes found, but will also increase prediction errors.

  • Genome Shape: Here you can specify the genome shape, assuming a linear rather than circular genome, there will be no genes that `wrap around' between the beginning and end of the sequence.

Figure 2: Configuration 1 Page

Configuration 2 Page

The second wizard page is dedicated to the Interpolated Context Model (ICM) creation parameters. The ICMs are a further extension of Interpolated Markov Models (IMMs) used to identify the coding regions and distinguish them from non-coding DNA. This step is one of the most sensitive points of the process, as it will determine the accuracy of all the following gene predictions (figure 3).

First, you can choose to create a new ICM or to use one created previously. If you choose to create a new ICM, you can create one with the default parameters or modify the parameters by selecting the advanced parameters checkbox:

  • Allow in-frame stops: ORFs with in-frame stop codons are omitted in the building of the model Default: off

  • ICM depth: The maximum number of positions in the context window that will be used to determine the probability of the predicted positions. Default: 7

  • ICM Width: Set the width of the ICM to the specified number. The width includes the predicted position. Default: 12

  • ICM Period: The period is the number of different submodels for different positions in the text in a cyclic pattern, i.e., if the period is 3, the first submodel will determine positions 1, 4, 7,... .; the second submodel will determine positions 2, 5, 8,... .; and the third submodel will determine positions 3, 6, 9, . . .. For a non-periodic model, use a value of 1. Default: 3

  • Gene entropy cutoff: If this cutoff is raised, more sequences will be identified as coding, resulting in more candidate genes. 

    Only genes with an entropy distance score smaller than the given value will be considered. This parameter is inspired by the observation that the coding sequences can be translated to an amino acid sequence capable of folding into a protein, whereas the non-coding sequences do not have this function. The class of amino acid sequences capable of folding to a protein has a global organizational order in contrast to those pseudo-amino-acid sequences generated from non-coding (or completely random) DNA sequences. Looking at the amino acid composition (or abundance) of a sequence we can determine the entropy of the resulting protein which allows us to cluster two kinds of sequences (coding and non-coding). Default: 1.15

If you choose to create a new ICM, you can save it by checking the option and selecting the output folder. The ICM file (.icm) can be used in posterior runs to saving computation time.


Figure 3: Configuration 2 Page

Configuration 3 Page

This page groups the settings which pertain to the gene finding process. All of these settings are made in pairs. The first member of each pair is a checkbox allowing transition from the automatic value to the manually set value. Note: if the value is set as `Automatic', these values will be calculated automatically (figure 4).

  • GC content: Allow the percentage of the content of G+C to be set.
  • Start codons: Allow the start codons to be set as a comma-separated list. Note: If you want to use only one start codon, it's suitable to set the three start codons, and to change the weight of the desired start codon to 1 in the `start codons weight' parameter.
  • Start codons weight: Specify the probability of different start codons (same number and order as in the `Start codons' parameter). If the start codons have been specified without weights, then each start codon will be assigned equal weights (which is very unusual).
  • Stop codons: Allow the stop codons to be set as a comma-separated list.

Figure 4: Configuration 3 Page

Results

Once the gene finding tools have finished, two objects will automatically be opened:

  • Sequence table: Here you can see the traditional OmicsBox table showing the sequence name corresponding to the fasta ID line plus a gene identification, and the sequence length. Note: this sequence can be on nucleotides or in amino acids, depending on the wizard selection.

  • GFF3 table: Here you can see the results as a gff file with:

    • Sequence: The name of the source sequence that belongs to this feature.

    • Source: The name of the program that has predicted this feature, in this case, `Glimmer'.

    • Type: The type of the feature, that can be `gene', `mRNA', `CDS', `gene', `Start', `stop', `gene'

    • Start: The coordinate of the start codon.

    • End: The coordinate of the stop codon.

    • Score: The score assigned to the feature, except the exons.

    • Strand: The strand of the feature, where a `+' means that the feature is forward oriented and `-' backwards.

    • Phase: The correct frame to translate this feature, the values can be `0', `1' or `2'. A gene `set' of features can have variant phase values, due to a frameshift in an intron.

    • Attributes: Here we can see all the attributes assigned to each feature. The attributes are `ID' that assigns an id to each feature, `parent' present on the CDS and exon features, and provides information about the feature to which it belongs (refereeing to the sequence by its ID).

The resulting GFF3 can be inspected using the Genome Browser. To display a GFF entry right click on it and select the Show in the Genome Browser option (figure 5). For more information about this feature visit the Genome Browser documentation section.

Figure 5: How to open the Genome Browser


A Result Viewer is also opened to display the name of each sequence present in the fasta file, the number of genes per sequence, the minimum and maximum gene length, and the strand position of the genes found (figure 6).

Figure 6: Result Summary