Single-Cell Clustering


This dataset contains Single-cell RNA sequencing data coming from FACS sorted cells, sequenced using the SMART-seq2 protocol for the library construction. These cells come from fluorescent transgenic zebrafish lines that label distinct blood cell types. The aim is to study the hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution.

Dataset description

  • Organism: Danio rerio

  • Instrument: Illumina NextSeq 500

  • Library construction: SMART-seq2

  • Layout: Paired-end. 38 pb / read.

  • Number of cells: 246


Tang, Q., Iyer, S., Lobbardi, R., Moore, J., Chen, H., & Lareau, C. et al. (2017). Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using RNA sequencing. Journal Of Experimental Medicine214(10), 2875-2887.


Recent advances in single-cell, transcriptomic profiling have provided unprecedented access to investigate cell heterogeneity during tissue and organ development. In this study, we used massively parallel, single-cell RNA sequencing to define cell heterogeneity within the zebrafish kidney marrow, constructing a comprehensive molecular atlas of definitive hematopoiesis and functionally distinct renal cells found in adult zebrafish. Because our method analyzed blood and kidney cells in an unbiased manner, our approach was useful in characterizing immune-cell deficiencies within DNA–protein kinase catalytic subunit (prkdc), interleukin-2 receptor γ a (il2rga), and double-homozygous–mutant fish, identifying blood cell losses in T, B, and natural killer cells within specific genetic mutants. Our analysis also uncovered novel cell types, including two classes of natural killer immune cells, classically defined and erythroid-primed hematopoietic stem and progenitor cells, mucin-secreting kidney cells, and kidney stem/progenitor cells. In total, our work provides the first, comprehensive, single-cell, transcriptomic analysis of kidney and marrow cells in the adult zebrafish.

Original Data

Bioinformatic Analysis

The first step in Single-Cell RNAseq analysis usually consists of the identification of clusters of cells, that is, groups of cells that share similar expression patterns. This information is useful because these groups putatively correspond to different cell types, which are the object of study. The clustering algorithm needs as input a count table, that is, a table containing genes in rows, cells in columns, and gene expression level in values. Creating the count table consists of two steps: aligning the reads coming from each one of the cells to a reference genome or transcriptome, and counting how many reads align in each gene or transcript.

1.- RNA-seq Alignment

The first step of the pipeline is to align the raw reads against the reference genome or transcriptome. The choice will depend on the data available and whether it is preferable to quantify by gene (then reference genome should be used) or by transcript (then reference transcriptome should be used).

In this example, reads will be mapped against a reference genome. Since the species under study is a eukaryotic organism, that is, it has exons in its genome, the right software to use is STAR. STAR is a fast RNA-Seq read mapper, with support for splice-junction and fusion read detection, and it was designed to align non-contiguous sequences directly to a reference genome.


RNA-Seq Alignment - STAR.



  • Input Files: * specify here all the FASTQ files *

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Provide Annotations: true

  • Annotation File: Danio_rerio.GRCz11.104.gtf

  • Overhang: 35

  • 2-pass Mapping: true

  • Sort by Coordinate: true

  • Min. Intron Length: 20

  • Max. Intron Length: 1000000

  • Max. Distance Between Mates: 1000000

  • Max. # of Multiple Alignments: 20

  • Max. # of Mismatches: 999

  • Include Chimeric Alignments: false

  • Add Read Group Information: false

  • Save Splice Junctions: false

  • Save Unmapped Reads: false

Execution Time

Around 4’5 hours.


The percentage of reads aligned to only one location in the reference genome is high, indicating that the alignment was good (Figure 1).

Figure 1. Relative alignments per category (unique mapped, multi-mapped, chimeric, unmapped) for each of the cells. Only 30 cells are displayed.

2.- Expression Quantification

The next step is to quantify the gene expression for each one of the cells. The aim is to obtain the count table necessary for the clustering algorithm.

Since we’ve used a reference genome in the alignment step, the option to choose at this point is the “Gene-level Quantification”. This tool expects files with aligned sequencing reads in SAM/BAM format and a GTF/GFF file with coordinates of genomic features. It uses the HTSeq package to count how many reads map to each feature of interest (e.g. genes, exons, etc.).


Create Count Table, Gene-level Quantification



  • Feature File: Danio_rerio.GRCz11.104.gtf

  • Quantification Level: gene

  • Group by: gene_name

  • Strand Specificity: Non Strand Specific

  • Overlap Mode: Union

  • Lowest Mapping Quality: 10

Execution Time

Around 20 minutes.


The output is a count table with one cell in each column. It is possible to generate a boxplot with the distribution of the library sizes (total number of reads for each cell) (Figure 2). This output is useful to detect outlier cells, that is, cells with too much or too few counts. It is recommendable to filter them during the clustering algorithm.

Figure 2. Distribution of library sizes (total number of reads for each cell).

3.- scRNA-seq Clustering

The next step is to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) data. That is, find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. To this end, OmicsBox’s tool uses the widely-used Seurat package.


Single-cell RNAseq Clustering



  • Input Type: Count Table Project

  • Minimum Cells: 1

  • Set Minimum Counts Cutoff: true

  • Minimum Counts: 100000

  • Set Maximum Counts Cutoff: true

  • Maximum Counts: 1500000

  • Set Minimum Features Cutoff: true

  • Minimum Features: 500

  • Set Maximum Features Cutoff: false

  • Filter by % of Mitochondrial Genes: false

  • Multi Sample Analysis: false

  • Normalize Data: true

  • Normalization Method: Regularized Negative Binomial Regression

  • High Variable Features: 3000

  • Scale Data: false

  • Center Data: true

  • Regress Out Mitochondrial Genes: false

  • Regress Out Cell Cycle Genes: false

  • Principal Components: 50

  • Define Dimensions by: Manual

  • Number of Dimensions: 20

  • k-value: 20

  • Resolution: 0.8

  • Point's Minimum Distance: 0.3

  • Point's Spread: 1.0

Execution Time

Around 10 minutes.


A total of 6 different clusters have been obtained (Figure 3). These 6 groups putatively correspond to different cell types, but we don’t know to which ones for the moment.

In addition, the Elbow Plot helps to assess if the number of dimensions used during the clustering is adequate. In this case, 20 is near the elbow point so it is a good approximation (Figure 4). There is more information about this plot in OmicsBox’s manual.

Figure 3. UMAP of the cells colored by the cluster they belong to. The cluster label has been obtained with the graph-based algorithm used by Seurat.

Figure 4. The amount of variance (Y-axis) explained for each of the dimensions (X-axis) during Principal Component Analysis.

4.- Marker-based Cell Type Identification

The last step in this pipeline is to label the clusters. One approach is to look at the expression of genes characteristic of a cell type, the so-called marker genes. These genes are more expressed in a particular cell type in comparison with the other. Thus, if we identify in which cluster they are more expressed we can annote that cluster with the corresponding cell type. For this example, the list of marker genes is obtained from the original paper (Tang et al., 2017).

OmicsBox offers different visualizations to help with the marker-based cell type identification.


Single-cell RNAseq Clustering - Charts



Expression Profile

  • Input Genes: File

  • Genes File: genes.txt

  • Scale Gene Expression: true


  • Color by Gene Expression: true

  • Gene:

    • zfpm1

    • lyz

    • il7r

    • cd79b

Execution Time



By looking at the Expression Profile (Figure 5) and the information contained in the original paper (Tang et al., 2017), the cell types can be easily assigned:

  • cluster_1 → HSCs

  • cluster_2 → Neutrophils

  • cluster_3 → B Cells

  • cluster_4 → T Cells

  • cluster_5 → NK Cells

  • cluster_6 → Thrombocytes

This visualization it’s really useful since it allows to check for the expression level of multiple genes at the same time.

Still, another interesting visualization that also helps to the cell type identification is the UMAP colored by the gene expression level (Figure 7). For example, it can be easily seen that the gene lyz is more expressed in cluster_2 in comparizon with the rest of the clusters. Since this is a marker gene for Neutrophils, we can assign this label to cluster_2.

Now that we have assigned a label to each cluster, we can rename de columns of the scRNA-Seq Clustering project so they have a more meaningful name (Figure 6). This can be achieved with the “Rename Cluster” tool available in the Context Menu. The new labels will be applied to the plots as well (Figure 8).

Figure 5. Expression profile of zebrafish clustering results. Marker-genes are in columns, and clusters in rows. The size of the dot represents the percentage of cells expressing the gene in the cluster. The color of the dot represents the average gene expression computed with the cells belonging to each cluster.

Figure 6. Renamed scRNA-Seq Clustering Project.

Figure 7. UMAP with cells colored by the expression level of the genes lyz, zfpm1, il7r, and cd29b.

Figure 8. Renamed UMAP colored by cluster.