## Pairwise Differential Expression Analysis

# Introduction

This tool is designed to perform differential expression analysis of count data arising from RNA-seq technology. This application, based on the edgeR program, allows the identification of differentially expressed genomic features (e.g. genes) in a pairwise comparison of two different experimental conditions. The software package **edgeR** (empirical analysis of DGE in R), which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.

Please cite edgeR as: Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.

**Figure 1: **Differential Expression Analysis Interface

## Expression Data

The pairwise differential expression analysis application expects gene expression levels in the form of a count table. In OmicsBox, count tables can be generated via the **Create Count Table** application.

Count tables can also be imported from a text file. Go to **File → Load → Load Count Table** (Figure 2) and select your .txt file containing the count table.

Notes:

This application only accepts raw counts without any type of normalization.

Replicates for each experimental condition are necessary.

**Figure 2: **Load Count Table from File

# Run Pairwise Differential Expression Analysis

Go to **transcriptomics → Run Differential Expression Analysis** and choose the "Pairwise Differential Analysis" option. Here you can specify the following parameters, which are divided into three different sections: Preprocessing Data (Figure 3), Experimental Design (Figure 4), and Comparison and Test (Figure 5).

## Preprocessing Data Page

**Filter low count genes:****CPM Filter:**Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads).**Samples reaching CPM Filter:**Set a minimum number of samples in which the gene's CPM is above the filter level (is expressed). If this value is set to e.g. five, at least 5 of the samples have to be above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has two replicates for each condition (or group), a gene should be expressed in at least two samples). Set value to 0 if no filter is desired.

**Calculate normalization factors to scale the raw library sizes:****Normalization Method:**Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. For further details please refer to the edgeR User's Guide. You can select the normalization method to be used:**TMM:**Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).**TMM with Zero Pairing:**This is a variant of TMM that should perform better for data with a high proportion of zeros.**RLE:**Relative log expression. Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).**Upper-quartile:**75% quantile for the counts for each library is used to calculate the scale factors.**None:**No normalization method is applied.

**Figure 3: **Preprocessing Data Page

## Experimental Design Page

**Experimental design file:**Select your .txt file containing your experimental factors with the experimental conditions associated with each sample in tab-delimited format. As shown below, rows correspond to samples and columns to experimental factors. Make sure that the names in the first column of the experimental design table are exactly the same as the sample names in the count table header. If your experimental design file has fewer samples than in the count table, only the samples contained in this file will be analyzed.

Name Strain SRR3666079 FA1090 SRR3666080 FA1090cpxA SRR3666081 FA1090cpxR SRR3666082 FA1090 SRR3666083 FA1090cpxA SRR3666084 FA1090cpxR SRR3666085 FA1090 SRR3666086 FA1090cpxA SRR3666087 FA1090cpxR SRR3666088 FA1090 SRR3666089 FA1090cpxA SRR3666090 FA1090cpxR

**Figure 4: **Experimental Design Page

## Comparison and Test Page

**Design Type:**Choose the design type to adjust the analysisSimple design: Makes a pairwise comparison between samples belonging to two experimental conditions. You only have to select the experimental factor of interest and establish the comparison selecting the reference and contrast conditions in ``Primary Target''.

Paired design: Makes a pairwise comparison between samples belonging to two experimental conditions, adjusting for baseline differences of other experimental factors. In this design, you have to establish the conditions for the comparison in ``Primary Target'' and the experimental factor for baseline difference in ``Secondary Target''. This design type is appropriate for paired or blocking design, or experiments with batch effects.

Multifactorial Design: Makes a pairwise comparison between samples belonging to two experimental conditions with two experimental factors. For this design, you have to select the two experimental factors of interest and establish the reference and contrast group for each in ``Primary Target'' and ``Secondary Target''. This design type is appropriate if you want to analyze the effects of combined experimental conditions on gene expression.

**Statistical Test:**Select a statistical test.Exact Test: Based on the quantile-adjusted conditional maximum likelihood (qCML) methods (similar to Fisher's exact test). It is only applicable to datasets with a single factor design (simple design).

GLM (Likelihood Ratio Test): Based on fitting negative binomial Generalized Linear Models (GLMs) with the Cox-Reid dispersion estimates. Is a good choice for inferences with GLMs.

GLM (Quasi Likelihood F-Test): The empirical Bayes quasi-likelihood F-test is an alternative to the Likelihood Ratio Test and provides a more robust and reliable error rate control when the number of replicates is small.

**Robust**: Estimation is strengthened against potential outlier genes.

**Figure 5: **Comparison and Test Page

# Results

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis'' tool, a new tab is opened containing the results (Figure 6). The results table contains the differential expression statistics, where each row corresponds to a feature:

**logFC:**A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).**logCPM:**The average log2-counts-per-millions.**LR:**Likelihood ratio statistic for the GLM (Likelihood Ratio Test).**F:**Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).**FDR:**False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).**Tags:**Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).

Genes that have not passed the filtering step are not shown in the new tab.

Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object.

**Figure 6: **Pairwise Differential Expression Results

A result page will show a summary of the pairwise differential expression analysis results (Figure 7).

**Figure 7: **Results Summary

If you want to filter for differential expression based on other FDR and/or logFC cutoffs, you can go to **Side Panel **→ **Set Up/Down Tags** (Figure 8) and establish new values for both cutoffs. Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs. To check the updated summary results go to **Side Panel **→ **Result Summary **and it can be exported in pdf.

**Figure 8: **Set Up/Down Tags

## Charts and Statistics

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the **Side Panel **of the Pairwise Results viewer.

### Results Chart

Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features, and down-regulated features (Figure 9).

**Figure 9: **Result Summary

### MDS Plot

Generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 10).

**Figure 10: **MDS Plot

### Volcano Plot

A scatter plot constructed by plotting the negative log of the adjusted p-values (FDR) on the y-axis versus the log of the fold changes on the x-axis (Figure 11). Upregulated and downregulated genes are shown in green and red respectively.

**Figure 11: **Volcano Plot

### MA Plot

A scatter plot showing the log of the fold changes on the y-axis versus the average of the log of the CPM on the x-axis. Differentially expressed genes are highlighted (Figure 11).

**Figure 11: **MA Plot

### Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 12). The dendrograms added to the left and top sides are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.

Genes that will be displayed can be selected in the wizard. There are three options:

The Top 50 differentially expressed genes (ranked by FDR).

All differentially expressed genes.

Provide an ID list containing the genes to represent.

Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".

Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.

**Figure 12: **Heatmap

# Enrichment Analysis

It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis can be found under the **Side Panel **of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.

## Fisher's Exact Test

Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). This set of genes can be the differentially expressed genes of differential expression analysis, a set of genes related to a phenotype of interest, etc. Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.

The subset of genes that will be considered as a Test-set (Figure 13) has to be provided. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".

The project containing the functionally annotated sequences that will be used as a reference background set should be provided.

The remaining parameters are explained in the Fisher's Exact Test section.

**Figure 13: **Fisher’s Exact Test

## Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

The ranked gene list will be computed using the following formula:

**Rank = sign(logFC) * -log10(P-Value)**

The project containing the functionally annotated sequences that will be used as a reference background set should be provided (Figure 14).

The rest of the parameters are explained in the Gene Set Enrichment Analysis section.

**Figure 14: **Gene Set Enrichment Analysis