Single-cell RNAseq Clustering

Introduction

This tool is designed to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) data. That is, it aims to find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. Prior to the clustering, this tool allows filtering and preprocessing the data in order to make it suitable for the clustering algorithm (Figure 1). This application is based on the widely-used Seurat package.

Please cite Seurat as:

Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018). https://doi.org/10.1038/nbt.4096

Stuart, T., Butler, A., Hoffman, P. et.al. Comprehensive Integration Of Single-Cell Data. Cell 177 (7): 1888-1902.e21 (2019). doi:10.1016/j.cell.2019.05.031.

Figure 1. General workflow for the single-cell RNA seq clustering analysis

Run single-cell RNA sequencing clustering

Go to transcriptomics → Single Cell RNA-seq →scRNA-seq Clustering. A new wizard will be opened. Here you can modify several parameters of the analysis, which are divided into five different sections: Input (Figure 2), Filtering (Figure 4), Multi-sample Analysis (Figure 7), Processing (Figure 9), and Clustering (Figure 11).

Input

On this wizard (Figure 2), you can specify a file with a count table. This count table must contain cells in columns, genes in rows, and each value corresponds to the gene expression level (Figure 3). This gene expression level can be the raw number of reads or UMIs detected for each gene and cell, or even already-normalized values.

It can be added either in a .box or a .txt file via the Count Table Project and Count Table file options, respectively.

  • The Count Table Project is a .box file generated by OmicsBox. It can be obtained by:

    • Loading a .txt count table into OmicsBox going to “File → Load → Load Count Table (expression data)” (see File Menu section) and saving the object.

    • Creating it through the "Create Count Table" functionality (see Quantify Expression section). This tool is only suitable for full-length scRNA-seq data (e.g. SMARTseq, SMARTer, etc.).

  • The Count Table File is a .txt file, as it is shown in Figure 3. Moreover, it is possible to specify the column separator in the Column Separator option, so it accepts different file formats. It accepts files with columns separated by TAB (“\t”), Space (“ “), Comma (“,”), and Semicolon (“;”). It has to be taken into account that the analysis does not allow for NA values, so there are two options to handle them in the NA Values parameter: “Skip Line”, that is, the whole line containing a NA value is removed, or “Assume Zero Values”, in which the NA is replaced by a 0 and the whole line is kept.

Figure 2: Input Page

Figure 3. Count table file example.

Configuration 1. Filtering Data Page.

In addition to filtering low-quality reads like in bulk RNA-seq analysis, it is important to filter low-quality cells. This can be easily achieved by filtering cells with an excessive or low total number of reads or counts. The first case could correspond to doublets or multiples, that is, groups of two or more cells that have been sequenced together. The second case could correspond to broken cells that could have lost part of their RNA content. Broken cells can also be identified by the percentage of total reads or counts that correspond to mitochondrial RNA (mtRNA). It is so because cells with broken cellular membranes could still conserve the mtRNA inside the mitochondrial membrane, so the percentage of mtRNA is greater compared to intact cells. Furthermore, it is also advisable to filter cells with a low number of mapped reads, because it could be indicative that the RNA content was degraded. This last case can be evidenced by a low number of detected features. So in order to perform this step, the available filters are (Figure 4):

  • Minimum Cells. Include features (genes, exons, etc.) detected in at least this number of cells. This filter is meant to exclude features that are not very informative. This filter removes rows from the count table.

The following filters remove cells (columns) from the count table:

  • Minimum Counts. Discard cells with less than this number of reads/counts.

  • Maximum Counts. Discard cells with more than this number of reads/counts.

  • Minimum Features. Discard cells with less than this number of features.

  • Maximum Features. Discard cells with more than this number of features.

  • Maximum % Mitochondrial Genes. Discard cells with more than this percentage of mitochondrial genes.

  • Mitochondrial Genes File. File with a list of mitochondrial genes, one per line (Figure 5). The gene IDs or names must correspond to the ones used in the count table.

Figure 4. Filtering Data Page.

Figure 5. Mitochondrial Genes File example.

Configuration 2. Multi-sample Analysis Page.

This is an additional step needed when the input count table contains data coming from multiple samples. These samples could come from wild-type and mutant organisms, from different time spans, from control and stimulated samples, etc. All these situations could cause changes in gene expression that could make a joint analysis of all the data difficult, with cells clustering both by condition and by cell type (Figure 6-A). In order to avoid that, the Multi-sample Data Integration step aims to integrate scRNA-seq datasets by identifying common cell types based on common sources of variation. As a result, this step enables the identification of shared populations across datasets (Figure 6-B and C) and thus further downstream comparative analyses. For more details about the integration algorithm please read the papers from the Seurat Team: Butler et al. 2018 and Stuart et al. 2019. In this step, it is possible to modify the following parameters (Figure 7):

  • Multi Sample Analysis. This option must be checked in order to perform the Data Integration step.

  • Experimental Design File. File with a list of all cells with the condition they belong, one per line. Each line must have two or more columns separated by a tab: the first one with a cell ID and the other ones with the conditions they belong to (Figure 8).

  • Condition. Choose the condition to integrate datasets by. That is, the condition that you don’t want to interfere in the clustering of cells.

The next parameters modify the behavior of the integration algorithm. In case the data integration has not been successful, it is advisable to modify these parameters to try to improve the results. See Figure 6-A for a “bad” and Figure 6-B for a “good” integrations. The algorithm consists of two steps: it firsts computes a dimensional reduction using Canonical Correlation Analysis (CCA) and then performs a mutual nearest neighbors (MNNs) algorithm to identify shared subpopulations across datasets. In this way it finds “anchors” to integrate the datasets, that is, equivalent cells across datasets.

  • N. Dimensions for Integration. The number of dimensions to use from the Canonical Correlation Analysis for the integration step.

  • K Anchor. How many neighbors (k) to use during the MNNs when picking anchors.

  • K Filter. How many neighbors (k) to use during the MNNs when filtering anchors.

  • K Score. How many neighbors (k) to use during the MNNs when scoring anchors.

  • K Weight. How many neighbors (k) to use during the MNNs when weighing anchors.

Figure 6.  UMAP plots of 10,799 cells coming from wild type (WT) and Rb mutant Drosophila eye disc, prior to (A) and post (B) alignment. After alignment, cells across conditions are grouped together based on shared cell type, allowing for a single joint clustering (C) to detect 11 populations.

Figure 7. Multi-sample Data Integration Page

Figure 8. Experimental Design File example.

Configuration 3. Data Processing Page.

These steps are meant to prepare data for the clustering analysis (Figure 9).

Normalization

Normalization aims to reduce differences in gene expression due to technical variation. This step tries to ensure that the observed heterogeneity within cells is due to biological reasons, rather than technical biases.

  • Normalize Data. Check this option to perform the normalization step. Extremely recommendable unless your data is already normalized or you are sure that your data does not need it.

  • Normalization Method. Which normalization method to use:

    • Regularized Negative Binomial Regression. This option uses the SCTransform normalization method designed by the Seurat team. This method applies a generalized linear model for each gene, with counts as the response and sequencing depth as the explanatory variable. This is used to learn the parameters of the model that depend on a gene’s average expression. Then, a negative binomial regression is applied and the Person residuals (the difference between the expected and the observed expression) are treated as the normalized expression levels. For more details about the statistical method please see Hafemeister and Satija 2019.

    • Log Normalization. Feature (gene) counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. This is then natural-log transformed using log1p.

    • Relative Counts. Feature counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. No log transformation is applied.

    • Centered Log Ratio Transformation (CLR). For each feature, the CLR-transformation is defined as the logarithm of the feature counts for one cell, divided by the geometric mean of all counts for that cell.

Data Adjustment

These steps are necessary whether the normalization is done or not, in order to prepare data for the dimensional reduction step.

  • High Variable Genes. Number of high variable genes to keep for further analysis. Keeping only those genes reduce both computational cost and non-relevant signals.

  • Scale Data. If checked, it scales features to have unit variance.

  • Center Data. If checked, it centers features to have zero mean.

Data Correction

  • Regress Out Mitochondrial Genes. Check this option to reduce the heterogeneity between cells associated with the expression of mitochondrial genes. This could prevent grouping cells during the clustering step that present a higher expression of those genes. In some cases, this information could not be informative since the higher expression of mitochondrial genes could be caused by technical reasons (e.g. because the mRNA has been leaked during cell manipulation) rather than by biological reasons. In order to perform this analysis, it must be provided a file with a list of mitochondrial genes, one per line (Figure 5). If there’s already a mitochondrial genes file uploaded on the Configuration 1 page, it will appear on the current page as well. It is possible to browse for another file if desired.

  • Regress Out Cell Cycle Genes. Check this option to reduce the heterogeneity between cells associated with the expression of cell cycle genes. This could prevent grouping cells during the clustering step that are in the same developmental stage, independently of its type. In some cases, these differences in cell cycle may be uninformative but, in other cases, they could be treated as indicative of proliferating cell populations which can be different across treatment conditions, for example. So whether to check this option or not will depend on the dataset under study and the target of the experiment. In order to perform this analysis, it must be provided a file with a list of cell cycle genes, one per line, with the cell cycle phase it belongs to (S or G2/M) separated by a tab (Figure 10).

Data Reduction

Common single-cell RNAseq analysis implies from hundreds to thousands of cells and tens of thousands of genes. That is, it is really high dimensional data, so it is advisable to reduce the dimensionality of the dataset in order to reduce the computational cost of further analysis. The most widely used method for dimensional reduction is Principal Component Analysis (PCA). Keeping only the firsts Principal Components for further analysis reduces the dimensionality of the data while keeping the heterogeneity of the dataset since they explain the largest amount of variance present in the sample. For more details please see "Orchestrating Single-Cell Analysis", 2020.

  • Principal Components. Number of Principal Components to compute from the Principal Component Analysis. It must be smaller than the number of cells present in the count table.

Figure 9. Data Processing Page.

Figure 10. Cell Cycle Genes File.

Configuration 4. Clustering Page

This step performs the actual cell clustering. It groups cells with similar expression patterns, which should correspond to the same cell type. For this analysis, Seurat uses a graph-based clustering algorithm. For more information, please see "Orchestrating Single-Cell Analysis", 2020.

Clustering

These parameters affect the clustering algorithm (Figure 11).

  • Define Dimensions by. How to decide the number of dimensions to use from the dimensional reduction (PCA) for the clustering step.

    • Elbow Point. This option automatically decides the number of dimensions to use. The assumption is that each of the top PCs capturing biological signals should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained from the last “biological” PC to the others. This is the so-called “Elow Point”. It should be taken into account that strong biological variation in the early PCs will shift the elbow point to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following it. So it could be interesting to repeat the analysis changing the number of dimensions established by the Elbow Point to see if it improves the results.

    • Manual. With this option, it is necessary that the user specifies the number of PCs to use in the “Number of Dimensions” parameter.

  • Number of Dimensions. The number of dimensions to use from the dimensional reduction for the clustering step. This option is only available if the “Manual” option from the “Dimensions Choice” is selected.

  • k-value. The number of neighboring cells computed in the clustering algorithm. Normally, a greater k-value would produce a smaller number of clusters, and vice versa.

  • Resolution. This parameter determines how "fine" the clustering is: values above 1.0 would produce a larger number of clusters, and vice versa.

UMAP visualization

These parameters affect only the UMAP visualization. The Uniform Manifold Approximation and Projection (UMAP) method is a non-linear dimensionality reduction technique (similar to PCA), but this time it is used to plot the high-dimensional single-cell data into two dimensions. The UMAP is used for visualization purposes because it represents the variability of single-cell data more accurately than the PCA. For more information, please refer to "Orchestrating Single-Cell Analysis", 2020.

  • Point’s Minimum Distance. This controls how tightly are the points (cells) compressed together.

  • Point’s Spread. This controls how expanded the points are. In combination with the minimum distance, this determines how clustered the points are.

Figure 11. Clustering Page.

Results

Once the input counts have been processed and analyzed via the "scRNA-seq clustering '' tool, a new tab is opened containing the Clustering Results (Figure 12). This new tab contains a count table but, this time, columns are cell clusters. Each slot holds the mean gene expression of each gene, taking into account all the cells belonging to that cluster.

It also generates a Summary Report (Figure 13):

  • The “Input” section specifies the name of the input count table file used.

  • The “Data Overview” table shows the value of some statistics before and after the filtering step: regarding the total number of cells, the total number of counts (or reads) per cell, and the total number of features per cell.

  • The “Experimental Design” section informs about the conditions present in the Experimental Design File if specified.

  • The “Clustering Results” table shows the total number and a link to a list of cells belonging to each cluster. Before the table, there’s one line specifying the number of dimensions from the PCA used for the clustering step.

  • The “Analysis Parameters” table shows the parameters used for the clustering analysis.

In addition, two charts are generated: the Elbow Plot (Figure 14) and the UMAP (Figure 15). The Elbow Plot shows the amount of variance (given by the standard deviation) explained by successive PCs. This helps to decide how many principal components to use for the clustering algorithm, in case you want to re-run the clustering with different parameters. In the UMAP visualization, the X and Y axis are the values in the UMAP coordinate 1 and 2 of each cell, respectively, and colored by the cluster assigned by the graph-based clustering algorithm. You can play with the options available in the toolbar: change cluster color, rename it, zoom in and out, etc.

Figure 12. scRNA-seq Clustering results example.

Figure 13. Part of Summary Report example.

Figure 14. Elbow Plot chart example.

Figure 15. UMAP chart example.

Side Panel Features

Summary Report

It shows the Summary Report previously explained in the above “Results” section.

Cluster-Cell Relationship

It opens a new tab with an ID-value list with clusters and cells belonging to each one of them (Figure 16).

Figure 16. Cluster-Cells relationship example tab.

Charts

UMAP

It shows the UMAP plot. It has different coloring options available in the wizard (Figure 17):

  • Color by Cluster. It shows the same UMAP plot as the one explained in the above “Results” section (Figure 15).

  • Color by Gene Expression. It shows the UMAP plot but this time cells (dots) are colored according to the expression level of the gene specified in the “Gene” text box (Figure 18 A).

  • Color by Condition. It shows the UMAP with cells colored by the condition they belong to (Figure 18 B). This option is only available if an Experimental Design is specified during the Clustering Analysis (see “Configuration 2. Multi-sample Analysis Page” in the “Run single-cell RNA sequencing clustering” section).

Expression Profile

With this feature, you can see the expression levels of known gene markers across the different clusters in order to identify putative cell types. To this end, a Bubble Plot is generated with clusters in rows and the specified genes in columns (Figure 19). The size of the dot represents the percentage of cells expressing the gene, that is, the percentage of cells that have a gene expression level greater than 0. The color represents the average gene expression in that cluster. You can configure the following options in the wizard (Figure 20):

Input genes.
  • Input Genes. You can specify here which genes to plot. The gene name or ID should correspond to the one in the input count table used during the clustering analysis (see section “Input” in the “Run single-cell RNA sequencing clustering” section). It can be done in two ways:

    • Text: write the genes to plot in the “Genes List” text box, one per line.

    • File: specify a file containing the genes to plot, one gene per line.

Plot Options.

This affects the visualization.

  • Scale Gene Expression. When checked, it applies the Z-Score transformation to scale average gene expression across clusters. It allows the visualization of both highly and lowly expressed genes on the same color scale. It should be noted that this may exaggerate the results, but it is still advisable if you are going to plot genes with different expression level ranges. If unchecked, it colors the dots by the raw average gene expression of each cluster.

Figure 17. UMAP plot wizard (A) with no experimental design and (B) with experimental design.

Figure 18. UMAP plots (A) colored by gene expression and (B) colored by condition.

Figure 19. Bubble Plot example.

Figure 20. Expression Profile Wizard.

Extract Cluster(s)

It may be the case that one or more big clusters have been obtained during the analysis, so it could be desirable to extract them and re-run the clustering to obtain sub-clusters of cells within it. That could make it possible to find more specific cell types.

With the Extract Cluster(s) feature, it is possible to specify one or more clusters to extract by checking them in the “Cluster(s) to Extract” option of the wizard (Figure 21). The result is a new tab with a normalized count table containing only the cells that belong to the selected cluster(s) (Figure 22). A new clustering analysis can be performed with it.

This job could take a while for big datasets.

This feature extracts the normalized counts, not the raw ones. So in the case of running a new clustering, it is advisable not to normalize the counts again.

Figure 21. Extract Cluster(s) wizard.

Figure 22. Extracted count table.

Remove Cluster(s)

It is possible to select which cluster(s) to remove by checking them in the “Cluster(s) to Remove” option of the wizard (Figure 23). This feature removes the clusters and cells belonging to them from the clustering object. They won’t appear anymore in the cluster’s count table, UMAP, heatmap, etc.

This option may be useful if you want to remove clusters prior to performing downstream analyses (e.g. non-interesting or contaminant cells, spurious clusters, etc.).

The result is a modified clustering result (Figure 24). It must be saved in order to conserve the modifications.

This job could take a while for big datasets.

Figure 23. Remove Cluster(s) wizard.

Figure 24. Removed clusters 0 and 1 from the Clustering result.

Context menu options

Rename Cluster

Once you know the cell type of one cluster, it could be interesting to rename it so it has a more descriptive name. In order to do that, go to the cluster column and click the mouse right button. It will open a list of options. On that list, click on “Rename Cluster”. A new wizard will appear to type the new name. The result is the same project but with the renamed cluster (Figure 25). It has to be saved in order to conserve the modification.

Figure 25. Rename a cluster (column) from a Seurat clustering object.