Comparative Analysis

Content of this page:

Comparative Analysis

This section explains the tools for the comparison of identified OTUs and functional annotation compositions between samples.

The 2 first tools, sample comparison chart and graph, are visual and allow to compare function abundances between samples. GO Slim generalizes GO annotations to make them comparable.

OTU Differential Abundance Testing identifies over and underrepresented OTUs between samples and conditions with the help of edgeR, a Bioconductor.

Sample Comparison Chart

This feature helps to compare annotations between different samples with distribution charts. It also helps to compare GO annotations from EggNOG and PfamScan (or other tools) for the same sample. First, the different samples have to be selected. It is also possible to load external annotations through File > Load > Load Metagenomic GO Annotations and to load them here (figure 1).

Figure 1. Sample comparison charts: input data page.

The second wizard page allows to configure the distribution chart (figure 2).

  • Columns to Compare: Only annotations that exist in all selected data-sets, can be selected here.
  • Normalize Counts: In most cases normalizing the counts between 0 and 1 gives better results, because sample sizes are seldom equal.
  • GO Categories: Create charts for each of the 3 main GO categories.
  • Propagation of GO Terms: The GO hierarchy is reflected in the resulting chart and helps to compare less and more specific GO annotations at higher levels.
  • GO Level Filter: Obviously, GOs at higher levels are represented in higher numbers (if propagation is enabled). This option makes it possible to focus on specific levels.

Figure 2. Configuration page.

On the right we can see the comparison of two different samples, both annotated with EggNOG Mapper. Cellular Component GOs level from 5 and lower are shown, ordered by maximum difference. The graphic visualizes that the red sample has major activity in intracellular parts and external encapsulating structures, while the blue sample works in different parts of the cell.

The graphic can be plotted as vertical or horizontal bars, line or area chart. Samples can be included or excluded, their colors can be changed, as well as their labels. The remaining options are self-explaining (figure 3).

Figure 3. Sample comparison GO chart.

Sample Comparison GO Graph

The colored GO graph on the right side visualizes the same data as above. Only GOs that appear in both samples are shown (Sample Filter = 2). The graph nodes are colored with different areas for each sample. The area's sizes depend on the relative counts (figure 4).

Figure 4. Sample comparison GO graph.

GO Slim

GO Slim is a reduced version of the Gene Ontology that contains a selected number of relevant GOs. More specifically, GO annotations are generalized and lifted up in the hierarchy. This can be seen as a way to normalize GO annotations to simplify comparison between samples.

OTU Differential Abundance Testing

The OTU Differential Abundance Testing is a tool to identify Operational Taxonomic Units (OTUs) that significantly differ between two microbial communities. This feature is based on edgeR, which belongs to the Bioconductor project, and implements statistical tests to evaluate the significance of OTU abundances between a contrast and a reference group.

Figure 5. Differential Abundance Testing: presentation of results.

With a Taxonomic Classification result opened, go to Metagenomics → Comparative Analysis → OTU Differential Abundance Testing. In the wizard, you can select the parameters to run the test. It is divided into three different sections: filtering and normalization (figure 6), experimental design (figure 8) and statistical test (figure 9).

First Wizard Page - Filtering and Normalization

OTUs with low counts will not be considered for the test as they provide little evidence of differential abundance. There are two different filtering steps:

  • Counts per Million FilterSet a filter to exclude OTUs with low counts across all samples. Filtering is performed on a count-per-million (CPM) basis to account for differences in library sizes between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million total counts). Set this value to 0 if no filtering is desired.
  • Minimum Samples Filter. Set a minimum number of samples in which the CPM has to be above the previous filter. If this value is set to e.g. 5, at least 5 of the samples have to show a count above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has 2 replicates for each condition or group, an OTU should be counted in at least 2 samples). Set this value to 0 if no filtering is desired.

In this test, the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. 5 different options are available for the normalization step:

  • TMM (Trimmed Mean of M-values). The M-values are weighted according to inverse variances and computed by the delta method for logarithms of binomial random models.
  • TMMwsp (TMM with singleton pairing). This is a variant of TMM that is intended to perform better for data with a high proportion of zeros (default).
  • RLE (Relative Log Expression). Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).
  • Upper-quartile. 75% quantiles for the counts of each library are used to calculate the scale factors.
  • None. All normalization factors are set to 1.

Figure 6. Differential Abundance Testing wizard: filtering and normalization page.

Second Wizard Page - Experimental Design

Here, the two groups for the test, reference and contrast, have to be specified. You can select the groups by choosing which samples from the taxonomic classification project you want to include in each one, or by loading an experimental design file and selecting the conditions you want to test.

Select samples (no experimental design file loaded)

Select the samples to be considered for the test and divide them into two groups or conditions. The Contrast Group will be the samples which will be tested against the Reference Group

Experimental design file

You can load your experimental design file. This file must contain the sample names in the first column and the experimental conditions of each sample in the following ones, as can be seen in figure 7. Please make sure the sample names in the first column of this experimental design file match exactly with the samples in the taxonomic classification result.

This experimental design file must be in tsv format (tab-separated values file). In this kind of files, each field is separated with a tab character. Please do not use spaces and avoid strange characters when writing your experimental design file to be sure that it will be correctly read and processed.

Experimental Design
Sample	Lake	Time
PAB		Preta	Afternoon
PMB		Preta	Morning
VAB		Verde	Afternoon
VMB		Verde	Morning

Figure 7. Experimental Design file.

Once the file is properly loaded, you can select an experimental factor from the experimental design and the conditions to test in both, Contrast and Reference group. You can also select samples separately as described in the previous section if the Select Samples option is checked.

If a paired design is desired, a Pairing Factor from the experimental design can be optionally selected to adjust for the baseline difference of this factor. Note that this option is only available if you have provided an experimental design file.

Figure 8. Differential Abundance Testing wizard: experimental design page.

Third Wizard Page - Statistical Test

You can Test at Specific Taxonomic Levels to only consider results for a specific taxon (species, genus, family, ...).

Here, you can select the statistical test to be used to detect the differentially abundant OTUs. The test will suppose that the OTU counts across groups are distributed as negative binomial random variables. Two different kinds of tests are available:

  • Exact Test. Run an Exact Test to detect a difference in mean between two groups of OTU abundance libraries, reference and contrast groups. This test is performed for each OTU and can only be used if no pairing factor is selected.
  • Generalized Linear Model. Fit a negative binomial generalized log-linear model (GLM) to the counts for each OTU. Two different GLM tests are allowed: 
    • GLM Likelihood Ratio Test. This mode conducts likelihood ratio tests for the coefficients in the linear model using the Cox-Reid dispersion estimates.
    • GLM Quasi Likelihood F-Test. It is similar to the LRT test, except that it replaces likelihood ratio tests with empirical Bayes quasi-likelihood F tests. This test provides a more robust and reliable error rate control when the number of replicates is small.

Figure 9. Differential Abundance Testing wizard: statistical test page.


Once the taxonomic abundance analysis has finished, a new table with the results will open (figure 10). Each row of this table corresponds to a different tested OTU. Each column contains:

  • Tags. Indicate if a specific OTU is overrepresented -OVER- (FDR < 0.05 and logFC > 1) or underrepresented -UNDER- (FDR < 0.05 and logFC < -1) in the contrast sample.
  • FC (Fold Change). The ratio between the mean abundance value of a specific OTU in the contrast condition and this value in the reference condition, if the mean abundance value in the contrast group is bigger than in the reference group. If this value is bigger in the reference group, then the FC is calculated as the ratio between the mean abundance value in the reference condition and the value in the contrast condition with a negative sign. By default, an OTU is defined as overrepresented if FC > 2, and it is underrepresented if FC < -2.
  • LogFC. The log2 FC. By default, an OTU is defined as overrepresented if logFC > 1, and it is underrepresented if logFC < -1 if it is statistically significant (FDR < 0.05 by default).
  • LogCPM. The average log2-counts-per-millions.
  • LR (Likelihood Ratio). Likelihood Ratio statistic for the GLM (only if GLM LR test is selected).
  • F. Quasi-likelihood F-statistic for the GLM (only if GLM QL test is selected).
  • P-value. The p-value for the null hypothesis of non-differential abundance.
  • FDR. A corrected p-value for multiple testing comparisons (Benjamini Y., Hochberg Y., 1995). If meeting the logFC criterion (logFC > 1 or logFC < -1 by default), an OTU must have an FDR < 0.05 to be considered as differentially abundant.

Figure 10. OTU Differential Abundance Testing results.

While this project is opened, different actions can be carried out from the sidepanel:

Summary Report

Creates an HTML report which can be saved in PDF with the main results of the Differential Abundance Testing: parameters used for the test, number of differentially abundant OTUs, experimental design, ... (figure 11).

Figure 11. OTU Differential Abundance Testing summary report.

Summary Chart

Shows a bar chart with the main results: OTUs pre and post-filtering steps, OTUs which are considered as differentially abundant and the over-/underrepresented ones (figure 12).

Figure 12. OTU Differential Abundance Testing summary chart.

Set Over/Under Tags

Establish a new FDR and Fold Change cutoff to consider OTUs as differentially abundant. FDR < 0.05 and logFC < -1 or logFC > 1 are set as default (figure 13).

Figure 13. Set Over/Under Tags.


Shows a two-dimensional heatmap in which the abundance values are represented by ranges of colors (figure 14). The dendrograms added to the left and top side are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between OTUs (left) and samples (top).

The upper bars show the experimental conditions of the study (columns) and the OTUs names are shown at the right of each row.

You can select if you want to draw the heatmap with the raw counts or with the CPM values, and if any transformation is necessary (logarithm in base 2, Z-score or both). 

Figure 14. Heatmap.

Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.