Gene Ontology Annotation

Content of this page:

Annotation Rule

This is the process of selecting GO terms from the GO pool obtained by the Mapping step and assigning them to the query sequences. In the current OmicsBox version, this is the core type of functional annotation.

GO annotation is carried out by applying an annotation rule (AR) on the found ontology terms. The rule seeks to find the most specific annotations with a certain level of reliability. This process is adjustable in specificity and stringency.

For each candidate GO an annotation score (AS) is computed. The AS is composed of two additive terms.

The first, direct term (DT), represents the highest hit similarity of this GO weighted by a factor corresponding to its EC.

The second term (AT) of the AS provides the possibility of abstraction. This is defined as an annotation to a parent node when several child nodes are present in the GO candidate collection. This term multiplies the number of total GOs unified at the node by a user-defined GO weight factor that controls the possibility and strength of abstraction. When GO weight is set to 0, no abstraction is done.

Finally, the AR selects the lowest term per branch that lies over a user-defined threshold. DT, AT and the AR terms are defined as given in figure 1.

Image annotrule

Figure 1: OmicsBox Annotation Rule


To better understand how the annotation score works, the following reasoning can be done: When EC-weight is set to 1 for all ECs (no EC influence) and GO-weight equals zero (no abstraction), then the annotation score equals the maximum similarity value of the hits that have that GO term and the sequence will be annotated with that GO term if that score is above the given threshold provided. The situation when EC-weights are lower than 1 means that higher similarities are required to reach the threshold. If the GO-weight is different to 0 this means that the possibility is enabled that a parent node will reach the threshold while its various children nodes would not.

The annotation rule provides a general framework for annotation. The actual way annotation occurs depends on how the different parameters at the AS are set. These can be adjusted in the Annotation Configuration Dialog (figure 2) and in the Evidence Code Weight Configuration Dialog (figure 3).


  1. Annotation Cut-Off (threshold).The annotation rule selects the lowest term per branch that lies over this threshold (default=55).
  2. GO-Weight. This is the weight given to the contribution of mapped children terms to the annotation of a parent term (default=5).
  3. Filter GO by taxonomy: The filter will remove the Gene Ontology terms known not to be in the given taxonomy using the restrictions defined by Gene Ontology. You can select one of the given options or simply write a taxonomy id.
  4. E-Value-Hit-Filter. This value can be understood as a pre-filter: only GO terms obtained from hits with a greater e-value than given will be used for annotation and/or shown in a generated graph (default=1.0E-6).
  5. Hsp-HitCoverage CutOff. Sets the minimum needed coverage between a Hit and his HSP. For example, a value of 80 would mean that the aligned HSP must cover at least 80% of the longitude of its Hit. Only annotations from Hit fulfilling this criterion will be considered for annotation transference.
  6. Hit Filter. This option allows you to consider only the first N hits during annotation. This option is correlative with "Only hits with GOs'' feature.
  7. Only hits with GOs. This option together with the "Hit Filter'' option allows to apply it only on hits that have a GO term candidate.
  8. EC-Weight. EC code weights can be modified at the following pages of the Run Annotation dialogue by clicking Next. Note that in case of influence by evidence codes is not wanted, you can set them all at 1. Alternatively, when you want to exclude GO annotations of a certain EC (for example IEAs), you can set this EC weight at 0.



Figure 2: Annotation Configuration

Figure 3: Evidence Code weight configuration

Successful annotation for each query sequence will result in a color change for that sequence from light-green to blue at the Main Sequence Table, and only the annotated GOs will remain in the GO IDs column.

Individual Annotation Results


Annotation results for each sequence can also be visualized on the GO DAG by selecting "Draw Graph of GO-Mapping with Annotation Score'' at the context menu. Additionally, the "Change Annotation and Description'' options of this menu offer also the possibility to adjust annotations specifically for a single sequence.
This function edits the annotation of the selected and allows typing and deleting of annotation or sequence description. A manual annotation check-box (see figure 5) is available for marking sequences with manual annotation. The sequence will get the pink label on the Main Sequence Table.




Image ChangeAnnotationDesc

Figure 4: Manually change Annotation and Description

Figure 5: Mark Manual Annotation

Statistics

An overview of the extent and intensity of the annotation can be obtained from the Annotation Distribution Chart (Figure 6), which shows the number of sequences annotated with different amounts of GO-terms.


Image annotation_dist

Figure 6: Annotation Distribution


In order to display the Annotation Statistics Wizard go to the "charts" icon in the main toolbar and select "Annotation Statistics".

The following statistics are available:

  • Annotation Distribution: This chart informs about the number of GO terms assigned per sequence.
  • GO Annotation Level Distribution: A bar chart which shows all GO terms for all 3 categories for a given GO level taking into account the GO hierarchy (parent-child relationships).
  • Annotation Score Distribution: A chart that shows the number of sequences per annotation score.
  • Annotated Seqs/Seq-Length: Shows the relation between the amount of annotated sequences and sequence lengths.
  • Number of GOs/Seq-Length: Shows the relation between sequence length and number of GOs.
  • Go Distribution by Level: A bar chart which shows all the GO terms for all 3 categories for GO level 2, taking into account the GO hierarchy.
  • Direct GO Count:
    • Molecular Function: A chart for the Molecular Function GO category, which shows the most frequent GO terms within a data-set without taking into account the GO hierarchy.
    • Biological Process: Same as above but for Biological Process.
    • Cellular Component: Same as above but for Cellular Component.


Figure 7: Annotation Statistics Wizard


Annotate GOs from Blast Descriptions

This tool looks at every significant alignment (Right-Click > Show Blast Result on a sequence) for each sequence and searches their description lines for GO ids. These GOs are now directly annotated to the sequence if the alignments similarity passes the desired minimum. Validation can also be applied and is recommended, it will remove intermediate GO terms.


Exporting Annotation

The annotation results can be exported in a variety of formats. This function is available under File > Export > Export Annotation.

  1. .annot. This is the default option for Annotation export and the exchange annotation format in OmicsBox. Annotations are provided in a three-column fashion. The first column contains the sequence name, the second the annotation code and the third the sequence description. When multiple annotations for the same sequence are available, these come in subsequent rows. GO and EC annotations are exported jointly in the same format.


  2. Genespring format. One single row is given by sequence where three different columns are provided for Molecular Function, Biological Process and Cellular Component. GO terms are denoted by their description rather than by their code.


  3. GoStats format. One single row is given by sequence and GO terms are only denoted by entire numbers ("GO:" and left zero's are skipped)


  4. WEGO format (native). One single row is given by sequence, including those without annotated GOs. Belonging GOs are added to each sequence separated by tabs. The format corresponds to the "WEGO native format'', shown in this example: 
    http://wego.genomics.org.cn/docs/input01.lst.
  5. Custom: It is possible to customize the exportation of the annotation file according to the information desired or the column separator see the next figure.



OmicsBox allows to export additional annotation file formats.

  1. Export Annotations in GO Annotation File Format (GAF v.2), which is the primary format currently used by the GO Consortiumhttp://geneontology.org/page/go-annotation-file-formats.
  2. Export Annotation Descriptions.
  3. Export GO Propagation: Exports the GO parents up to the root for the annotated sequences.
  4. Export Sequences per GO (Gene Sets).


Figure 8: Export Annotation Configuration


Figure 9: Export Annotations Custom Configuration

GO-Slim

GO-Slim is a reduced version of the Gene Ontology that contains a selected number of relevant nodes. The Run GO-Slim (online) function (under the Functional analysis → Blast2GO Annotation → GO-Slim menu) generates a GO-Slim mapping for the available annotations. Different GO-Slims are available which are adapted to specific organisms.OmicsBox supports the following GO-Slim mappings: General, Plant, Yeast, GOA (GO-Association) and TAIR.

Use the Functional analysis > Blast2GO Annotation > GO-Slim > Remove GO-Slim option to return to the original annotations.

Enzyme Code

OmicsBox provides EC annotation through the direct GO > EC mapping file available at the GO website. This means that only sequences with GO annotations will eventually show also EC numbers and that the GO annotation accuracy can be made extensive to Enzyme annotations.
The Enzyme codes will be used to Load KEGG pathways.

Statistics

To see the main Enzyme classes in the dataset it is possible to generate a distribution Enzyme Code chart on the "Charts and statistics" menu.

  • Main Enzyme Classes: Shows the distribution of the 6 main enzyme classes over all sequences.
  • Second Level Classes: Same as above but for the corresponding subclass.


Image enzymedist

Figure 10: Enzyme Code Distribution

Figure 11: Enzyme Code Statistic

Load Annotation Results (.annot)

Already made or existent annotation can be imported using the .annot format. For import purposes only, the .annot format allows also multiple annotations of the same sequence to be given in one single row, separated by commas, as shown above (Schema: Seq-Name <tab>GO(s) or EC(s) <tab>Sequence description):

OmicsBox Annotation File (.annot):

Seq1 GO:0001234 glycolipid transfer protein-like
Seq1 GO:0001264,GO:0004567,...
Seq1 GO:0034567
Seq1 EC:2.1.2.10
Seq2 GO:0001234,... sorbitol transporter
Seq2 GO:0001244
Seq3 GO:0001234,GO:0004567,GO:0009123
Seq3 EC:1.2.4.1, EC:3.1
....


There are still other annotation functions available in the submenu:

Other Annotation Functions

  • Remove Annotation. Delete Annotation results for the selected sequences.
  • Filter Annotation by GO Taxa
  • Validate Annotations. OmicsBox annotation generates lowest node annotations. This is not always guaranteed when Annotations have been imported or changed manually. This function can be run to ensure that no parent-child redundancy is present in the annotated set.
  • Remove 1. Level Annotations
  • Annotate GOs from Blast Descriptions allows to transfer GOs from the Blast hit descriptions to their sequences.