Fisher Exact Test

Introduction

Fisher’s Exact Test can be used to find GO terms, or other annotations, that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). This set of genes can be the differentially expressed genes of differential expression analysis, a set of genes related to a phenotype of interest, etc. Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.

When the proportion of genes annotated with a determined GO term in the test set is significantly higher than the proportion in the reference set, this GO term will be detected as over-represented, and otherwise, it will be declared under-represented.

OmicsBox has integrated the FatiGO package for statistical assessment of annotation differences between 2 sets of sequences. This package uses Fisher's Exact Test and corrects for multiple testing. For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can either be the result of a OmicsBox annotation or the imported annotation by file (.annot), see Gene Ontology Annotation of this manual.

This functionality can be found as a side panel button in the following tables:

  • Annotated sequences from Functional Analysis.

  • Pairwise Differential Expression results.

  • Combined Pathway Analysis results.

  • Time Course Differential Expression results.

A dialog screen appears (see Figure 1). On the one hand, if the wizard has been launched from an annotation project, Test and Reference Sequences can be selected by uploading text files or ID-List .box files containing the lists of sequence IDs for the two groups. When there is no reference set chosen, the whole dataset present in the project will be taken as Reference. A detailed description of each parameter is available by clicking the help icon next to the parameter. On the other hand, if the Fisher’s Exact Test is applied to differential expression results, the Test Set can be selected from the significant differential expressed features (genes/transcripts). Reference Set would be the rest of annotated features from the Reference Set file provided.

The Fisher's Exact Test implementation is sensitive in the direction of the test: the sequences that are present in the test-set and also in reference-set will be deleted from the reference, but not from the test-set. 

For further details please refer to the FatiGO publication (Al-Shahrour, F., Díaz-Uriarte, R., and Dopazo, J. (2004). Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics, 20(4):578–580).

New: In OmicsBox it is now possible to perform Fisher's Exact Test for different types of annotations for most of the results generated in OmicsBox. The Annotations parameter allows selecting the column of the table to use as an annotation. With this feature, It is possible to perform an enrichment analysis of enzymes or InterPro IDs for example.

Input parameters

  • Test-set Files. ID-list with sequences belonging to the test-set (Annotation project).

  • Test-set Genes. Subset of significant genes to be considered as the test set. It allows to pick between up-regulated or down-regulated genes (Pairwise Differential Expression and Combined Pathway Analysis results).

  • Type of List. Subset of significant genes to be considered as the test set. In this case, you can select the subset of genes according to either the regression variables or the experimental groups (Time Course Differential Expression).

  • Reference-set Files. ID-list with sequences belonging to the reference-set.

  • Two-tailed test. This option allows us to test for over and under-representation: the test-set will be tested against the reference-set and vice versa.

  • Annotations. You can select how gene sets are selected for the enrichment analysis: group genes by GO term, by Enzyme Code, etc.

Click on the Run button to start the analysis. It may take a while depending on the number of annotations.

Results Table

Once completed the results table will be shown in a new tab (see Figure 2) containing all the annotation terms, displaying a tag only where the adjusted p-values are below given threshold. The main columns are:

Tag

Adj. P-value

p-Value

It indicates if the GO term has been declared over or under-represented in the test-set.

Corrected p-value by the multiple test correction method chosen (False Discovery Rate control according to Benjamini-Hochberg procedure by default). 

Raw p-Value without multiple testing corrections.


Using the context menu of each row It is possible to get more details about the annotation and also create an ID-List with the sequences annotated in the Test-Set or the Reference-Set.

  • #Test is the number of sequences that are annotated with the GO and are in the test set.

  • #NotAnnotTest is the number of sequences that are not annotated with that GO, that is in the test set. 

Adding these two numbers it gives the total amount of sequences that are annotated overall in your test set e.g. GO:0061135: 9 + 52 = 61

Figure 2: Enrichment Results Table

Sidebar Options

In the sidebar there are located all possible action that can be performed for this enrichment result, including three options for the visual display of the results:

  • Actions:

    • Set Over/Under Tags: this option allows to define which column (raw p-value or adjusted) should be used to display the enrichment tag, as well as the threshold value. The adjusted p-value column can also be updated by selecting a different multiple test correction method, to choose between:

      • Benjamini-Hochberg: default value, the most commonly used method when controlling for FDR. For further information about how p-values are adjusted by FDR according to Benjamini-Hochberg procedure please refer to the publication: Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289-300.

      • Bonferroni: the most restrictive method, recommended to avoid type I errors at all costs. Controls the family-wise error rate (FWER), or the probability of making one or more false discoveries.

      • Benjamini-Yekutieli: method for controlling the FDR, more conservative than Benjamini-Hochberg and designed to work with dependent conditions.

      • Holm: an updated version of the Bonferroni method, less restrictive and controlling FDR rather than FWER.

      • Hochberg: a method similar to Holm.

    • Reduce to Most Specific (only for GO annotations): use this option to remove more general GO terms from the results and get only the most specific terms (with the lowest level in the GO DAG).

  • Charts:

    • Make Enriched Graph (only for GO annotations): use this option to generate a representation on the GO DAG (see Figure 3). Nodes are color-highlighted proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes. Additionally, the Filter intermediate the checkbox will hide non-enriched nodes. More options are available in the graph viewer's sidebar. Gene Ontology Graphs of this manual gives further information on the graphical functions in OmicsBox.

    • Bar Chart: this option generates a bar display of the percentages of sequences at both, test and reference set, for each annotation of the table (see Figure 4).

    • Dot Plot: this option generates a dot plot, a chart representing 3 dimensions: the annotation term in the Y axis, the gene ratio (ratio of number of test sequences in the test between the total number of total sequences) in the X axis and number of test sequences of the set as the dot size (see Figure 5).

Figure 3: Enriched Graph

Figure 4: Enriched Bar Chart

Figure 5: Dot Plot Chart