Single-Cell Clustering

Introduction

This dataset contains Single-cell RNA sequencing data coming from FACS sorted cells, sequenced using the SMART-seq2 protocol for the library construction. These cells come from fluorescent transgenic zebrafish lines that label distinct blood cell types. The aim is to study the hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution.

Dataset description

  • Organism: Danio rerio

  • Instrument: Illumina NextSeq 500

  • Library construction: SMART-seq2

  • Layout: Paired-end. 38 pb / read.

  • Number of cells: 246

Publication

Tang, Q., Iyer, S., Lobbardi, R., Moore, J., Chen, H., & Lareau, C. et al. (2017). Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using RNA sequencing. Journal Of Experimental Medicine214(10), 2875-2887. https://doi.org/10.1084/jem.20170976

 Abstract

Recent advances in single-cell, transcriptomic profiling have provided unprecedented access to investigate cell heterogeneity during tissue and organ development. In this study, we used massively parallel, single-cell RNA sequencing to define cell heterogeneity within the zebrafish kidney marrow, constructing a comprehensive molecular atlas of definitive hematopoiesis and functionally distinct renal cells found in adult zebrafish. Because our method analyzed blood and kidney cells in an unbiased manner, our approach was useful in characterizing immune-cell deficiencies within DNA–protein kinase catalytic subunit (prkdc), interleukin-2 receptor γ a (il2rga), and double-homozygous–mutant fish, identifying blood cell losses in T, B, and natural killer cells within specific genetic mutants. Our analysis also uncovered novel cell types, including two classes of natural killer immune cells, classically defined and erythroid-primed hematopoietic stem and progenitor cells, mucin-secreting kidney cells, and kidney stem/progenitor cells. In total, our work provides the first, comprehensive, single-cell, transcriptomic analysis of kidney and marrow cells in the adult zebrafish.

Original Data

Bioinformatic Analysis

The first step in Single-Cell RNAseq analysis usually consists of the identification of clusters of cells, that is, groups of cells that share similar expression patterns. This information is useful because these groups putatively correspond to different cell types, which are the object of study. The clustering algorithm needs as input a count table, that is, a table containing genes in rows, cells in columns, and gene expression level in values. Creating the count table consists of two steps: aligning the reads coming from each one of the cells to a reference genome or transcriptome, and counting how many reads align in each gene or transcript. Figure 2. Distribution of library sizes (total number of reads for each cell).

1.- scRNA-seq Quantification

As we mentioned, the very first step is to obtain the gene expression values of the cells in the dataset. OmicsBox implements the STARsolo pipeline to quantify scRNA-Seq datasets. This pipeline performs both the alignment of reads to the reference genome and the per-cell quantification. Moreover, it’s capable of handling both 3'-enrichment (or UMI-based) and full-length scRNA-seq data. For this example dataset, we will be using the full-length specifications.

Application

Single Cell RNA-Seq Quantification

Input

Parameters

  • Library Type: Full-length Technologies

  • Reference genome: Danio_rerio.GRCz11.dna_rm.primary_assembly.fa

  • Annotation Danio_rerio.GRCz11.104.gtf

  • Exon Feature: exon

  • Overhang: 38

  • Paired-End: specify all fastq files here

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Strandness: Unstranded

  • Provide Cell IDs: false

  • 2-pass Mapping: false

  • Min. Intron Length: 20

  • Max. Intron Length: 1000000

  • Max. # of Mismatches: 999

  • Max. # of Multiple Alignments: 20

  • Include Chimeric Alignments: false

  • Max. Distance Between Mates: 1000000

  • Feature Counting: Exons+Introns

  • Project Name: zebrafish

  • Save BAM File: false

2.- Filtering

It is highly recommended to filter both lowly expressed genes and cells i n order to reduce technical noise and improve the performance of downstream analyses.

Application

Single-cell RNA-Seq Filtering

Input

Parameters

  • Minimum Cells: 2

  • Minimum Counts: 50

  • Maximum Counts: 2587492

  • Minimum Features: 10

  • Maximum Features: 4988

  • Filter by % of Mitochondrial Genes: false

Execution Time

Around 3 minutes.

Output

3.- scRNA-seq Clustering

The next step is to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) data. That is, find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. To this end, OmicsBox’s tool uses the widely-used Seurat package.

Application

Single-cell RNAseq Clustering

Input

Parameters

  • Input Type: Count Table Project

  • Normalize Data: true

  • Normalization Method: Regularized Negative Binomial Regression

  • High Variable Features: 3000

  • Scale Data: false

  • Center Data: true

  • Regress Out Mitochondrial Genes: false

  • Regress Out Cell Cycle Genes: false

  • Principal Components: 50

  • Integration Factor: None

  • Define Dimensions by: Manual

  • Number of Dimensions: 20

  • k-value: 20

  • Resolution: 0.8

  • Point's Minimum Distance: 0.3

  • Point's Spread: 1.0

Execution Time

Around 10 minutes.

Output

A total of 6 different clusters have been obtained (Figure 3). These 6 groups putatively correspond to different cell types, but we don’t know to which ones for the moment.

In addition, the Elbow Plot helps to assess if the number of dimensions used during the clustering is adequate. In this case, 20 is near the elbow point so it is a good approximation (Figure 4). There is more information about this plot in OmicsBox’s manual.

Figure 3. UMAP of the cells colored by the cluster they belong to. The cluster label has been obtained with the graph-based algorithm used by Seurat.

Figure 4. The amount of variance (Y-axis) explained for each of the dimensions (X-axis) during Principal Component Analysis.

3.- Marker-based Cell Type Identification

The last step in this pipeline is to label the clusters. One approach is to look at the expression of genes characteristic of a cell type, the so-called marker genes. These genes are more expressed in a particular cell type in comparison with the other. Thus, if we identify in which cluster they are more expressed we can annote that cluster with the corresponding cell type. For this example, the list of marker genes is obtained from the original paper (Tang et al., 2017).

OmicsBox offers different visualizations to help with the marker-based cell type identification.

Application

Single-cell RNAseq Clustering - Charts

Input

Parameters

Expression Profile

  • Input Genes: File

  • Genes File: genes.txt

  • Scale Gene Expression: true

UMAP

  • Color by Gene Expression: true

  • Gene:

    • zfpm1

    • lyz

    • il7r

    • cd79b

Execution Time

Seconds.

Output

By looking at the Expression Profile (Figure 5) and the information contained in the original paper (Tang et al., 2017), the cell types can be easily assigned:

  • cluster_1 → HSCs

  • cluster_2 → Neutrophils

  • cluster_3 → B Cells

  • cluster_4 → T Cells

  • cluster_5 → NK Cells

  • cluster_6 → Thrombocytes

This visualization it’s really useful since it allows to check for the expression level of multiple genes at the same time.

Still, another interesting visualization that also helps to the cell type identification is the UMAP colored by the gene expression level (Figure 7). For example, it can be easily seen that the gene lyz is more expressed in cluster_2 in comparizon with the rest of the clusters. Since this is a marker gene for Neutrophils, we can assign this label to cluster_2.

Now that we have assigned a label to each cluster, we can rename de columns of the scRNA-Seq Clustering project so they have a more meaningful name (Figure 6). This can be achieved with the “Rename Cluster” tool available in the Context Menu. The new labels will be applied to the plots as well (Figure 8).

Figure 5. Expression profile of zebrafish clustering results. Marker-genes are in columns, and clusters in rows. The size of the dot represents the percentage of cells expressing the gene in the cluster. The color of the dot represents the average gene expression computed with the cells belonging to each cluster.

Figure 6. Renamed scRNA-Seq Clustering Project.

Figure 7. UMAP with cells colored by the expression level of the genes lyz, zfpm1, il7r, and cd29b.

Figure 8. Renamed UMAP colored by cluster.