Pairwise Differential Expression Analysis (Without Replicates)

Content of this page:

Introduction

Detecting genes that are differentially expressed between two experimental conditions (e.g. diseased vs healthy individuals) is a fundamental part of understanding the molecular basis of phenotypic variation. To carry out this task, there are statistical tools designed to perform differential expression analysis of the count data arising from RNA-seq technology (e.g. edgeR and maSigPro are statistical packages integrated into OmicsBox). However, these tools usually require the presence of replicates (both biological and technical) of each experimental condition that will be testedThis is a problem in cases where no replicates are available.

The Pairwise Differential Expression Analysis (Without Replicates) functionality offers a strategy for analyzing RNA-seq datasets that do not have replicates. It is based on the software package NOISeq, which belongs to the Bioconductor project. NOISeq is a novel nonparametric approach for the identification for differentially expressed genes from RNA-Seq count data. NOISeq creates a null or noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the genes in samples within the same condition. This reference distribution is then used to assess whether the M and D values computed between two conditions for a given gene are likely to be part of the noise or represent a true differential expression. 

NOISeq method was designed to compute differential expression on RNA-Seq data even when there are no replicates available for any of the experimental conditions. In this scenario, NOISeq can simulate technical replicates. The simulation relies on the assumption that read counts follow a multinomial distribution, where probabilities for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These mapping probabilities are approximated by using counts in the only sample of the corresponding experimental conditions. Given the sequencing depth (total amount of reads) of the unique available sample, the size of the simulated is a percentage of this sequencing depth, allowing a small variability. 

Please remember that to obtain really reliable statistical results, biological replicates are needed.

Please cite NOISeq as:

Tarazona S, Furio-Tari P, Turra D, Di Pietro A, Nueda MJ, Ferrer A and Conesa, A (2015). "Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package." Nucleic Acids Research, 43(21), e140.

Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A and Conesa A (2011). "Differential expression in RNA-seq: a matter of depth." Genome Research, 21(12), 2213-2223.


Run Pairwise Differential Expression Analysis (Without Replicates)

Go to transcriptomics → Run Differential Expression Analysis and choose the "Pairwise Differential Expression Analysis (Without replicates) option. This application requires a Count Table object as input data. Use the Gene-level or Transcript-level expression quantification functionality to obtain a count table from RNA-Seq data.  It is also possible to load a count table from a tabular text file (go to File → Load → Load Count Table). The wizard allows to adjust analysis parameters (Figure 1 and Figure 2).

Preprocessing Data

  • Filter low count genes:
    • CPM FIlter: Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads). To pass the filter, the gene's CPM should be above the filter level in at least one sample (contrast or reference sample).
  • Normalization procedure: 
    • Normalization Method: Normalization is an important step to make the samples comparable and to remove possible biases (as sequencing depth bias) in count data. The normalization methods available for this analysis are:

      • TMM: Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
      • RPKM: Reads Per Kilobase per Million mapped reads. This method corrects for gene length and the number of sequencing reads (gene length is required).
      • Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors for normalization.
      • None: No normalization method is applied.
    • Feature Length File: For RPKM normalization load a tab-delimited file (or ID-Value object) with two columns containing the name and length of each gene or genomic feature.

Figure 1: Preprocessing Data Page

Analysis Options

  • Replicates Simulation:
    • Number of Simulated Replicates: Set the number of replicates to be simulated for each condition.
    • Size of the Simulated Replicates: Establish the percentage of the total reads used to simulate each sample. 
    • Variability: Variability in the simulated sample total reads. 
  • Targets:
    • Contrast Condition: Choose the sample to be treated as contrast condition. Genes classified as UP will be upregulated in this sample. 
    • Reference Condition: Choose the sample to be treated as reference condition. Genes classified as DOWN will be upregulated in this sample. 

Figure 2: Analysis Options Page

Results

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis (Without Replicates)" feature, a new tab is opened containing the results (Figure 3). The results table contains the differential expression statistics, where each row corresponds to a feature:

  • Contrast Condition: Normalized expression values for the contrast condition sample.
  • Reference Condition: Normalized expression values for the reference condition sample.
  • M: Is the log2-ratio of the two conditions. 
  • D: The value of the difference between the conditions. 
  • Probability: The probability of differential expression for each feature. It is obtained by comparing the M and D values of a given feature against the noise distribution. If the probability is higher than a given threshold (0.9 by default), the feature is considered to be differentially expressed between conditions.
  • Ranking: Is a summary statistic of M and D values equal to -sign(M)*sqrt(M^2 + D^2), which can be used as a ranked value in gene set enrichment analysis (GSEA).

Genes that have not passed the filtering step are not shown in the results tab.

Figure 3: Table Viewer

Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object. If you want to see both count table and results, go to the File Manager and open the two .box files together.

A result page will show a summary of the pairwise differential expression analysis results (Figure 4).

Figure 4: Results Summary

If you want to filter for differential expression based on another probability threshold, you can go to Side Panel → Set Up/Down Tags and establish a new threshold value. Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs. To view, the updated summary results go to Side Panel → Result Summary and it can be exported in PDF.

During the Pairwise Differential Expression Analysis (without replicates), raw counts are transformed according to the normalization method selected in the analysis configuration. Go to Export Normalized Counts (side bar) to export normalized counts to a tabular text file. 

Charts and Statistics

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panel of the Pairwise Results Viewer.

MDS Plot

It generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 5(a)).

This plot is only available if the input count table contains more than 2 samples (although only two of them are compared).

Figure 5(a): MDS Plot

Results Chart

Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features and down-regulated features (Figure 5(b)).

Figure 5(b): Results Chart

Expression Plot

A scatter plot showing the average expression values of each condition (Figure 5(c)). Differentially expressed features considering the probability threshold (0.9 by default) will be highlighted in red.

Figure 5(c): Expression Plot

MD Plot

A scatter plot showing the log-fold change (M) and the absolute value of the difference in expression between conditions (D). D values are displayed in a log-scale (Figure 5 (d)). 

Figure 5(d): MD Plot

Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 5 (e)). The dendrograms added to the left and top side are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top). 

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates. 

Genes that will be displayed can be selected in the wizard. There are three options:

  • The Top 50 differentially expressed genes (ranked by FDR). 
  • All differentially expressed genes.
  • Provide an ID list containing the genes to represent.


Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags". 

Furthermore, the wizard allows to adjust the type of expression data that will be represented, as well as the transformation that can be applied to this data.

Figure 5(e): Heatmap

Enrichment Analysis

It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis, can be found under the Side Panel of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.


Fisher's Exact Test

Choose the subset of genes that will be considered as Test-set. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags". 

The project containing the functionally annotated sequences that will be used as a reference background set should be provided. 

The rest of the parameters are explained in the Fisher's Exact Test section.

Gene Set Enrichment Analysis

The "Probability Threshold for Ranked List" parameters allows setting a filter to exclude those genes whose probability value is not above it. The ranked gene list will be created using the "ranking" statistic. 

The project containing the functionally annotated sequences that will be used as a reference background set should be provided. 

The rest of the parameters are explained in the Gene Set Enrichment Analysis section