## Comparative Analysis

# Introduction

This section explains the tools for the comparison of identified OTUs and functional annotation compositions between samples.

The 2 first tools, sample comparison chart, and graph are visual and allow to compare function abundances between samples. GO Slim generalizes GO annotations to make them comparable.

OTU Differential Abundance Testing identifies over and underrepresented OTUs between samples and conditions with the help of edgeR, a Bioconductor package in R.

# Sample Comparison

## Sample Comparison Chart

This feature helps to compare annotations between different samples with distribution charts. It also helps to compare GO annotations from EggNOG and PfamScan (or other tools) for the same sample. First, the different samples have to be selected. It is also possible to load external annotations through File > Load > Load Metagenomic GO Annotations and to load them here (figure 1).

**Figure 1.** Sample comparison charts: input data page.

The second wizard page allows configuring the distribution chart (figure 2).

**Columns to Compare:**Only annotations that exist in all selected data-sets, can be selected here.**Normalize Counts:**In most cases normalizing the counts between 0 and 1 gives better results, because sample sizes are seldom equal.**GO Categories:**Create charts for each of the 3 main GO categories.**Propagation of GO Terms:**The GO hierarchy is reflected in the resulting chart and helps to compare less and more specific GO annotations at higher levels.**GO Level Filter:**Obviously, GOs at higher levels are represented in higher numbers (if propagation is enabled). This option makes it possible to focus on specific levels.

**Figure 2.** Configuration page.

On the right we can see the comparison of two different samples, both annotated with EggNOG Mapper. Cellular Component GO levels from 5 and lower are shown, ordered by maximum difference. The graphic visualizes that the red sample has major activity in intracellular parts and external encapsulating structures, while the blue sample works in different parts of the cell.

The graphic can be plotted as vertical or horizontal bars, lines, or area charts. Samples can be included or excluded, their colors can be changed, as well as their labels. The remaining options are self-explaining (figure 3).

**Figure 3.** Sample comparison GO chart.

## Sample Comparison GO Graph

The colored GO graph on the right side visualizes the same data as above. Only GOs that appear in both samples are shown (Sample Filter = 2). The graph nodes are colored with different areas for each sample. The area's sizes depend on the relative counts (figure 4).

**Figure 4.** Sample comparison GO graph.

## GO Slim

GO Slim is a reduced version of the Gene Ontology that contains a selected number of relevant GOs. More specifically, GO annotations are generalized and lifted up in the hierarchy. This can be seen as a way to normalize GO annotations to simplify comparison between samples.

# Differential Abundance Analysis of Taxa

The Differential Abundance Analysis of Taxa is a tool to identify Operational Taxonomic Units (OTUs) that significantly differ between two microbial communities. This feature is based on edgeR, which belongs to the Bioconductor project, and implements statistical tests to evaluate the significance of OTU abundances between contrast and a reference group.

**Figure 5.** Differential Abundance Testing: presentation of results.

With a Taxonomic Classification result opened, go to **Metagenomics → Comparative Analysis → Differential Abundance Analysis of Taxa**. In the wizard, you can select the parameters to run the test. It is divided into three different sections: filtering and normalization (figure 6), experimental design (figure 8), and statistical test (figure 9).

### First Wizard Page - Filtering and Normalization

OTUs with low counts will not be considered for the test as they provide little evidence of differential abundance. There are two different filtering steps:

**Counts per Million Filter.**Set a filter to exclude OTUs with low counts across all samples. Filtering is performed on a count-per-million (CPM) basis to account for differences in library sizes between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million total counts). Set this value to 0 if no filtering is desired.**Minimum Samples Filter.**Set a minimum number of samples in which the CPM has to be above the previous filter. If this value is set to e.g. 5, at least 5 of the samples have to show a count above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has 2 replicates for each condition or group, an OTU should be counted in at least 2 samples). Set this value to 0 if no filtering is desired.

In this test, the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. 5 different options are available for the normalization step:

**TMM (Trimmed Mean of M-values).**The M-values are weighted according to inverse variances and computed by the delta method for logarithms of binomial random models.**TMMwsp (TMM with singleton pairing).**This is a variant of TMM that is intended to perform better for data with a high proportion of zeros (default).**RLE (Relative Log Expression).**Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).**Upper-quartile.**75% quantiles for the counts of each library are used to calculate the scale factors.**None.**All normalization factors are set to 1.

**Figure 6.** Differential Abundance Testing wizard: filtering and normalization page.

### Second Wizard Page - Experimental Design

Here, the two groups for the test, reference, and contrast, have to be specified. You can select the groups by choosing which samples from the taxonomic classification project you want to include in each one, or by loading an experimental design file and selecting the conditions you want to test.

#### Select samples (no experimental design file loaded)

Select the samples to be considered for the test and divide them into two groups or conditions. The **Contrast Group** will be the samples that will be tested against the **Reference Group**.

**Figure 8.** Differential Abundance Testing wizard: experimental design page.

## Experimental design file

You can load your **experimental design file**. This file must contain the sample names in the first column and the experimental conditions of each sample in the following ones, as can be seen in figure 7. Please make sure the sample names in the first column of this experimental design file match exactly with the samples in the taxonomic classification result.

This experimental design file must be in **tsv format** (tab-separated values file). In this kind of files, each field is separated with a tab character. Please do not use spaces and avoid strange characters when writing your experimental design file to be sure that it will be correctly read and processed.

Once the file is properly loaded, you can select an **experimental factor** from the experimental design and the conditions to test in both, Contrast and Reference group. You can also select samples separately as described in the previous section if the **Select Samples** option is checked.

If a paired design is desired, a **Pairing Factor** from the experimental design can be optionally selected to adjust for the baseline difference of this factor. Note that this option is only available if you have provided an experimental design file.

**Experimental Design**

Sample Lake Time PAB Preta Afternoon PMB Preta Morning VAB Verde Afternoon VMB Verde Morning

**Figure 7.** Experimental Design file.

### Third Wizard Page - Statistical Test

You can **Test at Specific Taxonomic Levels **to only consider results for a specific taxon (species, genus, family, ...).

Here, you can select the statistical test to be used to detect the differentially abundant OTUs. The test will suppose that the OTU counts across groups are distributed as negative binomial random variables. Two different kinds of tests are available:

**Exact Test.**Run an Exact Test to detect a difference in mean between two groups of OTU abundance libraries, reference and contrast groups. This test is performed for each OTU and can only be used if no pairing factor is selected.**Generalized Linear Model.**Fit a negative binomial generalized log-linear model (GLM) to the counts for each OTU. Two different GLM tests are allowed:**GLM Likelihood Ratio Test.**This mode conducts likelihood ratio tests for the coefficients in the linear model using the Cox-Reid dispersion estimates.**GLM Quasi Likelihood F-Test.**It is similar to the LRT test, except that it replaces likelihood ratio tests with empirical Bayes quasi-likelihood F tests. This test provides a more robust and reliable error rate control when the number of replicates is small.

**Figure 9.** Differential Abundance Testing wizard: statistical test page.

## Results

Once the taxonomic abundance analysis has finished, a new** table with the results** will open (figure 10). Each row of this table corresponds to a different tested OTU. Each column contains:

**Tags.**Indicate if a specific OTU is overrepresented -OVER- (FDR < 0.05 and logFC > 1) or underrepresented -UNDER- (FDR < 0.05 and logFC < -1) in the contrast sample.**FC (Fold Change).**The ratio between the mean abundance value of a specific OTU in the contrast condition and this value in the reference condition, if the mean abundance value in the contrast group is bigger than in the reference group. If this value is bigger in the reference group, then the FC is calculated as the ratio between the mean abundance value in the reference condition and the value in the contrast condition with a negative sign. By default, an OTU is defined as overrepresented if FC > 2, and it is underrepresented if FC < -2.**LogFC.**The log2 FC. By default, an OTU is defined as overrepresented if logFC > 1, and it is underrepresented if logFC < -1 if it is statistically significant (FDR < 0.05 by default).**LogCPM.**The average log2-counts-per-millions.**LR (Likelihood Ratio).**Likelihood Ratio statistic for the GLM (only if GLM LR test is selected).**F.**Quasi-likelihood F-statistic for the GLM (only if GLM QL test is selected).**P-value.**The p-value for the null hypothesis of non-differential abundance.**FDR.**A corrected p-value for multiple testing comparisons (Benjamini Y., Hochberg Y., 1995). If meeting the logFC criterion (logFC > 1 or logFC < -1 by default), an OTU must have an FDR < 0.05 to be considered as differentially abundant.

**Figure 10.** OTU Differential Abundance Testing results.

## Side panel

### Summary Report

Creates an HTML report which can be saved in PDF with the main results of the Differential Abundance Testing: parameters used for the test, number of differentially abundant OTUs, experimental design, ... (figure 11).

**Figure 11.** OTU Differential Abundance Testing summary report.

### Summary Chart

Shows a bar chart with the main results: OTUs pre and post-filtering steps, OTUs which are considered as differentially abundant, and the over-/underrepresented ones (figure 12).

**Figure 12.** OTU Differential Abundance Testing summary chart.

### Set Over/Under Tags

Establish a new FDR and Fold Change cutoff to consider OTUs as differentially abundant. FDR < 0.05 and logFC < -1 or logFC > 1 are set as default (figure 13).

**Figure 13.** Set Over/Under Tags.

### Heatmap

Shows a two-dimensional heatmap in which the abundance values are represented by ranges of colors (figure 14). The dendrograms added to the left and top side are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between OTUs (left) and samples (top).

The upper bars show the experimental conditions of the study (columns) and the OTUs names are shown at the right of each row.

You can select if you want to draw the heatmap with the raw counts or with the CPM values, and if any transformation is necessary (logarithm in base 2, Z-score or both).

**Figure 14.** Heatmap.

## References

Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.

# Differential Abundance Analysis of Functions (Pfam / EggNOG)

The **Metagenomics Functional Differential Abundance Analysis** tool is designed to detect which functional annotations are enriched between two different environmental conditions. The statistical test of this tool is based on an over-dispersed Poisson generalized linear model, specifically designed to detect the differentially abundant annotations between metagenomes.

The analysis needs one metagenomic annotation, Pfam or EggNOG, per sample and the condition to which each sample belongs as inputs. To generate the annotations, some previous steps have to be performed in OmicsBox. In starting from having raw reads for each sample, there is the need to perform a Quality Control, a metagenomic assembly, a gene-finding step, and finally the annotation. All these steps need to be running **for each sample** to finally have one annotation file per sample as shown in Figure 15. The annotation files will be used as input for the metagenomics functional differential abundance test.

This tool is based on the **ShotgunFunctionalize **library and on **HirBin**.

The Functional Differential Abundance Test only allows selecting **one type of annotation (Pfam or EggNOG).**

**Figure 15. **Steps previous to the Functional Differential Abundance Analysis.

## General workflow

The general workflow of this tool is drawn below (figure 16).

**Figure 16. **Functional Differential Abundance Analysis: general workflow.

To run the analysis, go to **Metagenomics → Comparative Analysis → Differential Abundance Analysis of Pfam / EggNOG**. The wizard allows the selection of the annotation files, the experimental conditions, and other parameters for the test (figure 17 and figure 18).

### First Wizard Page - Input

On this page, the annotations files of the different samples have to be uploaded as input (figure 17).

In this section, you can also select which **items do you want to compare**. It depends on the type of annotations you provided:

**Pfam Annotations.**You can compare at Pfam**Domain**level or at Pfam**Family**level.**EggNOG Annotations**. You can compare at**Cluster of Orthologous Groups**level (COGs, arCOGs, and KOGs), or at**KEGG****Pathways**level.

**Figure 17.** Functional Differential Abundance Analysis wizard: input page.

### Second Wizard Page - Configuration

Here, the two groups for the test, reference and contrast, have to be specified (figure 18). You can select the groups by choosing which samples from the annotation files previously loaded you want to include in each condition.

#### Filtering

Some annotations are poorly present in the dataset and can cause false positives or alter the statistical test including non-testable features. In **Counts Filter**, you can set the minimum number of times a feature has to be annotated to be included in the statistical test. We highly recommend setting this filter to a number at least equal to the number of samples of the dataset (default = 1).

**Figure 18.** Functional Differential Abundance Analysis wizard: configuration page.

## Results

Once the test has finished, a **table with the results **will open (figure 19). Each row of this table corresponds to a specific annotation at the selected level (*Items to compare* option, described above). Each column represents:

**Tags.**Indicate if the annotation is overrepresented -OVER- or underrepresented -UNDER- in the contrast condition. Thresholds are set by default in 0.05 for FDR and 2 and -2 for FC.**Feature and****Description.**The feature ID and its description.**FC (Fold Change).**The ratio between the mean abundance value of a specific annotation in the contrast condition and this value in the reference condition, if the mean abundance value in the contrast group is bigger than in the reference group. If this value is bigger in the reference group, then the FC is calculated as the ratio between the mean annotation abundance value in the reference condition and the value in the contrast condition with a negative sign. By default, an annotation is defined as overrepresented if FC > 2, and it is underrepresented if FC < -2.**LogFC.**The log2 FC.**Std.Error**. The standard deviation of the coefficient point estimate in the GLM.**P-value.**The p-value for the null hypothesis of an equal number of annotations between conditions.**FDR.**A corrected p-value for multiple testing comparisons (Benjamini Y., Hochberg Y., 1995). If meeting the logFC criterion (logFC > 1 or logFC < -1 by default), an annotation must have an FDR < 0.05 to be considered as over or underrepresented in the contrast group.

**Figure 19.** Functional Differential Abundance Analysis main results.

## Side Panel

### Summary Report

Creates an HTML report which can be saved in PDF with the main results of the differential abundance test: parameters used for the test, number of enriched annotations, experimental design, and top 10 over and underrepresented annotations ordered by logFC and FDR (figure 20).

**Figure 20.** Functional Differential Abundance Analysis summary report.

### Summary Chart

Shows a bar chart with the main results: annotations pre and post-filtering steps, annotations that are considered as enriched, and the over-/underrepresented ones (figure 21).

**Figure 21**

### Set Over/Under Tags

Establish a new FDR and Fold Change cutoff to consider an annotation as significant. FDR < 0.05 and logFC < -1 or logFC > 1 are set as default (figure 22).

**Figure 22. **Set Over/Under Tags.

### Summary Dot Plot

Shows a dot plot with the main results of the test (figure 23). You can select the date which will be included in this chart on the wizard page: represent the **over-or the under-represented features**, order them by **FC or by FDR,** and **how many annotations** the graph will contain (top 10, top 20, etc.).

Once displayed, each row of the graph contains an enriched feature. The **X-axis** represents the effect size (logFC), the **dot color** represents the significance (FDR), and the **dot size** represents the number of genes in the global dataset annotated with this feature.

**Figure 23.** Functional Differential Abundance dot plot.

## References

Erik Kristiansson, Philip Hugenholtz, Daniel Dalevi, ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes, Bioinformatics, Volume 25, Issue 20, 15 October 2009, Pages 2737–2738, https://doi.org/10.1093/bioinformatics/btp508

Österlund, T., Jonsson, V. & Kristiansson, E. HirBin: high-resolution identification of differentially abundant functions in metagenomes.

*BMC Genomics**18,*316 (2017). doi:10.1186/s12864-017-3686-6