## Pairwise Differential Expression Analysis

## Introduction

This tool is designed to perform differential expression analysis of count data arising from RNA-seq technology. This application, based on the edgeR program, allows identification of differentially expressed genomic features (e.g. genes) in a pairwise comparison of two different experimental conditions. The software package **edgeR** (empirical analysis of DGE in R), which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.

Please cite edgeR as:

## General Workflow

The workflow for the analysis of differential expression is described in the following scheme (Figure 2).

## Load Data

Go to **File → Load → Load Count Table** and select your .txt file containing the count table in tab-delimited format (Figure 3). It is also possible to create a Count Table within OmicsBox through the "Create Count Table" functionality (see Quantify Expression section).

**Figure 3: **Count Table File

The Count Table can be saved as 'Count Table' object (**File → Save**).

Notes:

This application only accepts raw counts without any type of normalization.

Replicates for each experimental condition are necessary.

## Run Pairwise Differential Expression Analysis

Go to **transcriptomics → Run Differential Expression Analysis** and choose the "Pairwise Differential Analysis" option. Here you can specify the following parameters, which are divided into three different sections: Preprocessing Data (Figure 4), Experimental Design (Figure 5) and Comparison and Test (Figure 6).

### Preprocessing Data Page

**Filter low count genes:****CPM Filter:**Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads).**Samples reaching CPM Filter:**Set a minimum number of samples in which the gene's CPM is above the filter level (is expressed). If this value is set to e.g. five, at least 5 of the samples have to be above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has two replicates for each condition (or group), a gene should be expressed in at least two samples). Set value to 0 if no filter is desired.

**Calculate normalization factors to scale the raw library sizes:****Normalization Method:**Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. For further details please refer to the edgeR User's Guide. You can select the normalization method to be used:**TMM:**Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).**TMM with Zero Pairing:**This is a variant of TMM that should perform better for data with a high proportion of zeros.**RLE:**Relative log expression. Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).**Upper-quartile:**75% quantile for the counts for each library is used to calculate the scale factors.**None:**No normalization method is applied.

**Figure 4: **Preprocessing Data Page

Experimental Design Page

**Experimental design file:**Select your .txt file containing your experimental factors with the experimental conditions associated with each sample in tab-delimited format. As demonstrated in Figure 7, rows correspond to samples and columns to experimental factors. Make sure that the names in the first column of the experimental design table are exactly the same as the sample names in the count table header. If your experimental design file has fewer samples than in the count table, only the samples contained in this file will be analyzed.

**Figure 7: **Experimental Design File

**Figure 5: **Experimental Design Page

### Comparison and Test Page

**Design Type:**Choose the design type to adjust the analysisSimple design: Makes a pairwise comparison between samples belonging to two experimental conditions. You only have to select the experimental factor of interest and establish the comparison selecting the reference and contrast conditions in ``Primary Target''.

Paired design: Makes a pairwise comparison between samples belonging to two experimental conditions, adjusting for baseline differences of other experimental factors. In this design, you have to establish the conditions for the comparison in ``Primary Target'' and the experimental factor for baseline difference in ``Secondary Target''. This design type is appropriate for paired or blocking design, or experiments with batch effects.

Multifactorial Design: Makes a pairwise comparison between samples belonging to two experimental conditions with two experimental factors. For this design, you have to select the two experimental factors of interest and establish the reference and contrast group for each in ``Primary Target'' and ``Secondary Target''. This design type is appropriate if you want to analyze the effects of combined experimental conditions on gene expression.

**Statistical Test:**Select a Statistical Test:

Exact Test: Based on the quantile-adjusted conditional maximum likelihood (qCML) methods (similar to Fisher's exact test). It is only applicable to datasets with a single factor design (simple design).

GLM (Likelihood Ratio Test): Based on fitting negative binomial Generalized Linear Models (GLMs) with the Cox-Reid dispersion estimates. Is a good choice for inferences with GLMs.

GLM (Quasi Likelihood F-Test): The empirical Bayes quasi-likelihood F-test is an alternative to the Likelihood Ratio Test and provides a more robust and reliable error rate control when the number of replicates is small.

Robust: Estimation is strengthened against potential outlier genes.

**Figure 6: **Comparison and Test Page

## Results

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis'' tool, a new tab is opened containing the results (Figure 8). The results table contains the differential expression statistics, where each row corresponds to a feature:

**logFC:**A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).**logCPM:**The average log2-counts-per-millions.**LR:**Likelihood ratio statistic for the GLM (Likelihood Ratio Test).**F:**Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).**FDR:**False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).**Tags:**Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).

Genes that have not passed the filtering step are not shown in the new tab.

**Figure 8: **Table Viewer

Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object. If you want to see both count table and results, go to the File Manager and open the two .b2g files together.

A result page will show a summary of the pairwise differential expression analysis results (Figure 9).

**Figure 9: **Results Summary

If you want to filter for differential expression based on other FDR and/or logFC cutoffs, you can go to **Side Panel **→ **Set Up/Down Tags** and establish new values for both cutoffs. Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs. To view, the updated summary results go to **Side Panel **→ **Result Summary **and it can be exported in pdf.

## Charts and Statistics

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the **Side Panel **of the Pairwise Results viewer.

### MDS Plot

Generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 10(a)).

**Figure 10(a): **MDS Plot

### Results Chart

Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features and down-regulated features (Figure 10(b)).

**Figure 10(b): **Result Summary

### Volcano Plot

A scatter plot that is constructed by plotting the negative log of the adjusted p-values (FDR) on the y-axis versus the log of the fold changes on the x-axis (Figure 10(c)). Upregulated and downregulated genes are shown in green and red respectively.

**Figure 10(c): **Volcano Plot

### MA Plot

A scatter plot showing the log of the fold changes on the y-axis versus the average of the log of the CPM on the x-axis. Differentially expressed genes are highlighted (Figure 10(d)).

**Figure 10(d): **MA Plot

### Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 10 (e)). The dendrograms added to the left and top side are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.

Genes that will be displayed can be selected in the wizard. There are three options:

The Top 50 differentially expressed genes (ranked by FDR).

All differentially expressed genes.

Provide an ID list containing the genes to represent.

Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".

Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.

**Figure 10(e): **Heatmap

## Enrichment Analysis

It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis, can be found under the **Side Panel **of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.

### Fisher's Exact Test

Choose the subset of genes that will be considered as Test-set. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".

The project containing the functionally annotated sequences that will be used as a reference background set should be provided.

The rest of the parameters are explained in the Fisher's Exact Test section.

### Gene Set Enrichment Analysis

The ranked gene list will be computed using the following formula:

**Rank = sign(logFC) * -log10(P-Value)**

The project containing the functionally annotated sequences that will be used as a reference background set should be provided.

The rest of the parameters are explained in the Gene Set Enrichment Analysis section.