scRNA-Seq Trajectory Inference Analysis

Introduction

The Trapnell lab developed Monocle3, a scRNA-Seq data analysis toolkit. Trajectory inference in scRNA-Seq analysis, as implemented by this tool, is used to reconstruct the developmental trajectory of cells, mapping out their developmental paths or states. One essential concept in this analysis is pseudotime, which measures the progression of individual cells along these trajectories. The minimal input required are raw counts, also known as expression count tables, and starting points in a text file. Once the Monocle3 performs trajectory inference, it assigns a pseudotime value to each cell, reflecting its position or stage along the developmental trajectory. Then cells can be grouped based on the pseudotime range to conduct differential expression analysis. This process reveals the differential genes responsible for determining cell fate.

This tool is based on the R package Monocle3. Please cite Monocle3 as:

Qiu, Xiaojie, et al. “Reversed Graph Embedding Resolves Complex Single-Cell Trajectories.” Nature Methods, vol. 14, no. 10, 21 Aug. 2017, pp. 979–982, 10.1038/nmeth.4402.

Qiu et al.“Single-Cell mRNA Quantification and Differential Analysis with Census.” Nature Methods, vol. 14, no. 3, 23 Jan. 2017, pp. 309–315, 10.1038/nmeth.4150

Trapnell, Cole, et al. “Pseudo-Temporal Ordering of Individual Cells Reveals Dynamics and Regulators of Cell Fate Decisions.” Nature Biotechnology, vol. 32, no. 4, 1 Apr. 2014, pp. 381–386, 0.1038/nbt.2859.

Figure 1. Monocle3 Wizard in OmicsBox

Accessing Monocle3 in Omics Box

Generate the scRNA count table using STARsolo (located under Transcriptomics → Single Cell RNA-Seq → Create Count Table) or filter the counts with Seurat (found in Transcriptomics → Single Cell RNA-Seq → Create Count Table → Filtering). After completing the scRNA-Seq quantification or Seurat's count filtering, the count table will appear in the Main Table Output. On the side panel of this output, click "Trajectory Analysis" to initiate Monocle3 (refer to Figure 2). This action uses the selected scRNA-Counts as input for the Trajectory Inference Wizard.

Figure 2. Trajectory Inference with Monocle3, available as the side panel option of scRNA-Seq Count Tables

Select Starting Points (Root Cells) of the Trajectory

Select the root node (a collection of root cells) for Monocle3, as it serves as the reference point for trajectory construction. OmicsBox offers two methods to provide this information: through a list of root cells or by choosing a column from the cell metadata.

  1. List of Root Cells: Submit root cells such as Progenitor cells, Start Cells, or Undifferentiated cells. These cells act as the starting points for the trajectory. Provide this list in a text file format (.tsv or .txt) containing cell names, with one name per line.

  2. Cell Metadata File: Choose a tab-separated file containing meta-information or experimental details about the cells.

    1. Select Column: The options available, based on the information in the Cell Metadata file, will display all potential columns with root information.

    2. Starting Point: After selecting the column with potential starting point information, decide on the actual starting point. If you're considering experimental time, the starting point might be the initial capture time (0h) or something similar. Alternatively, it could be a specific type of cell, like a hematopoietic stem cell.

Please Supply Syntactically Correct Names

Monocle3 is written in R and requires syntactically correct naming. Therefore, please refrain from supplying illegal characters in the above files. Visit the last section of the manual to read about syntactically correct naming.

Figure 3. Selection of starting points for the trajectory analysis

Configuration 1. Data Pre-processing & Clustering

  1. Normalization Method: Normalization aims to minimize non-biological variation. Two options are available are log-normalization and size-factor normalization. Log normalization standardizes data, especially useful for columns with high variance. Size-factor normalization removes bias from each cell by dividing its counts by its size factor. The user can also skip the normalization by selecting "none".

  2. Principal component analysis (PCA): This classic dimension reduction method creates linear combinations of gene expressions termed as principal components (PCs). These PCs, orthogonal to each other, effectively capture the gene expression variation and often have a lower dimensionality.

    1. Dimensions: This refers to the number of dimensions post-PCA. For datasets exceeding 5,000 cells, selecting the top 50 principal components is advisable.

    2. Scaling: Scaled data facilitates model learning. When dealing with variables in different units, scaling before PCA computation is beneficial.

  3. Uniform Manifold Approximation and Projection (UMAP): This dimension reduction technique, akin to t-SNE for visualization, is also suitable for general non-linear dimension reduction.

    1. Minimum Distance: Dictates UMAP's cell clustering tightness. Low values result in dense cell clusters, while higher values emphasize preserving broad topological structure.

    2. Neighbors: This balances local versus global structures. Lower values direct UMAP to concentrate on local structures, while higher values emphasize a broader view, potentially sacrificing fine details.

  4. Clustering

    1. Nearest Neighbors: Specifies the number of nearest neighbors for the Nearest Neighbours algorithm.

    2. Allow disjoint graph: Activating this merges different partitions into a single trajectory. Otherwise, distant partitions are allotted “Infinite“ pseudotime.

    3. Allow loops: Activation discovers potential cyclic trajectories within the data.

    4. Resolution: Sets the granularity of cell clustering. It gauges how cells group based on their expression profiles, either finely or coarsely.

Figure 4. Parameter tuning for Monocle3-based trajectory analysis.

Side Panel Actions

After completing the analysis and obtaining the results, you can see the side-panel actions. The currently available side panel options include:

  1. Summary Report: Produces a summary of the analysis.

  2. Extract Count: Retrieves the count of different Pseudotime Ranges.

  3. Differential Expression: Analyzes the differences in gene expression.

Actions: Summary Report

Generates Summary Report of the Analysis

Figure 5. Side Panel Action after the Monocle3 trajectory analysis in OmicsBox

Actions: Extract Cluster Counts

  1. Extract Counts: Clusters represent the pseudotime ranges. Use this option to choose the pseudotime ranges of interest. The system will then extract the associated cells and their gene/feature counts.

Actions: Differential Expression

For differential expression analysis, refer to the single-cell differential expression tutorial. In this context, the pseudotime range labels are used instead of Cluster labels.

Figure 6. OmicsBox wizard for the extraction of raw counts from Monocle3 results

Side Panel Charts

  1. Trajectory UMAP: This visualization provides a UMAP (Uniform Manifold Approximation and Projection) of the trajectory.

  2. Expression Trends: Charts the trends in gene expression.

  3. Expression UMAP: Offers a UMAP based on gene expression.

  4. Distribution of Cell in Pseudotime: Displays how cells are distributed across different pseudotime values.

Both Trajectory UMAP and Distribution of Cell in Pseudotime directly produce the UMAP and bar plot, respectively.

Figure 7. Options for data visualization after trajectory analysis in OmicsBox

Expression Trend

  1. Gene ID/ Name: Choose the feature or gene for which you want to plot the trend.

  2. Scaling: Large variations in counts can sometimes obscure finer details. Scaling adjusts the data range to highlight these subtleties.

  3. Log Transform: If a dataset contains minimal differences among large values, applying a log transformation can magnify these variations, making them more explicit. The process adds a pseudo-count of 0.5 to raw counts before log transformation.

  4. Smoothness: Modulate the trend line's smoothness to fit your preference.

  5. Color Cells By: Opt to color the cells based on inferred clusters or partitions from Monocle3.

Figure 8. Wizard for plotting gene/feature expression trends along pseudotime using monotonic spline in OmicsBox

Expression UMAP

  1. Gene ID/ Name: Select feature/ gene to for which the trend is plotted

  2. Scaling: When visualizing data, large differences in counts can overshadow more subtle differences. Scaling can help bring out the details by adjusting the range of the data.

  3. Log Transform: For datasets with small differences in large values, a log transformation can accentuate these differences, making them more noticeable and easier to visualize. A pseudo-count of 0.5 is added to raw counts before taking the log.

Figure 9. Wizard for plotting gene/feature expression on UMAP embeddings in OmicsBox

Output

Monocle3 in OmicsBox delivers several outputs aligned with a conventional trajectory analysis. These outputs comprise a primary output table, three key plots, and a succinct report detailing the parameters used and the results obtained.

  1. Main Output Table with Pseudotime Information: This table provides detailed pseudotime data for the analyzed cells.

  2. Trajectory UMAP: A visualization showing the trajectory of cells using the UMAP (Uniform Manifold Approximation and Projection) technique.

  3. Expression UMAP: A plot that illustrates how the expression of a particular gene varies across cells in UMAP embeddings.

  4. Expression Trends: Displays the trends in gene expression across pseudotime.

  5. Distribution of Cells Over Pseudotime Ranges: This visualization depicts how cells are spread out across various pseudotime values.

Output Table Fields:

  1. Cell: The names of the cells as provided in the count table and experimental design file.

  2. Pseudotime: The pseudotime assigned to each cell by Monocle3. Cells that haven't been allocated a pseudotime will not have a value in this column.

  3. Pseudotime Range: This represents the clusters of pseudotime. For cells without an assigned pseudotime, this field will explicitly state so. The intervals for these ranges are left-closed (right-open).

  4. Cluster: The clusters to which the cells have been assigned.

  5. Partition: This refers to the assigned super-cluster or partition.

Figure 10. Main output table

Results (Trajectory UMAP)

The Trajectory UMAP is a visualization that combines a UMAP coloured by the continuum of pseudotime with a superimposed line graph. This line graph represents the overall progression pattern among the cells. Using the pseudotime slider, users can focus on cells within a particular pseudotime range. If a cell hasn't been assigned a pseudotime, the visualization will display the progression line without any coloured cells associated with that specific cell. Additionally, it offers an interactive selection of cells, allowing users to select a starting cell and run trajectory analysis interactively.

Figure 11. Interactive Trajectory UMAP of Monocle3 in OmicsBox

Expression Trend

The expression trend plots the expression of a chosen feature gene per cell against its pseudotime using the monotonic spline interpolation method. This visualization offers insights into the expression trends of a specific gene feature along the pseudotime in different cell clusters or partitions.

Figure 12. Expression trend of the selected feature in OmicsBox

Expression UMAP

The expression of a specific feature or gene is depicted on the UMAP, illustrating the variation in gene expression across the cells in UMAP embeddings.

Figure 13. Expression of the selected feature on a UMAP in OmicsBox

Results (Distribution of Cells Across Pseudotime Range)

This visualization displays the distribution of cells based on their pseudotime range, showcasing the number of cells within each specific range. By correlating this distribution with cell type annotations, one can identify progenitor cell types or intermediate cell states, as these often possess lower pseudotime values. The intervals for these ranges are defined as left-closed (right-open).

Figure 14. Number of cells distributed across pseudotime

Syntactical Correct Naming

Please refer to the following table while preparing the data for analysis. Following rules will make analysis more robust and will enable integration with other tools (both by BioBam and Open Source (R, Python, Excel)).

Incorrect Naming

Correct Naming

Spaces

Embryo Time

Embryo.Time or Embryo_Time

Quotes

Embryo'Time

Embryo.Time or Embryo_Time

Mathematical Operators

Embryo+Time

Embryo.Time or Embryo_Time

Backslash

Embryo\Time

Embryo.Time or Embryo_Time

Preceded by a numeric

2Embryo

Embryo2 or Embryo.2 or Embryo_2

Symbols ($, @, # etc)

Embryo$Time or $EmbryoTime

Embryo.Time or Embryo_Time

Web links

Not supported