Long-Read Characterization with SQANTI3

Introduction

SQANTI3 is a bioinformatics tool designed for the quality control and filtering of full-length transcripts sequenced with PacBio’s long-read technology. It is designed as the next step of the IsoSeq pipeline. The interest in this tool comes from the usefulness of long-read transcriptome sequencing to describe eukaryotic transcriptomes and replace the use of second-generation sequencing. Illumina short-reads cannot contain a whole transcript and are not able to well-characterize eukaryotic transcriptomes.

This tool can reveal the nature and novelty found by long-read sequencing by classifying transcripts based on the comparison between their splice junctions and the reference transcriptome provided. It combines the long read-defined transcripts (in a FASTA/Q or GTF format) with the reference annotation and with other optional data to provide a wide range of descriptors of transcript quality. SQANTI3 generates a comprehensive report to facilitate quality control and filtering of the isoform models and performs two different tasks, both of them equally important:

  1. Isoform classification and quality control for long read-defined transcriptomes: the categories in which transcripts are classified, together with a long list of attributes and descriptors, allow to carefully inspect the properties of their transcriptome and identify potential problems generated data processing.

  2. Filter for long read-defined transcriptomes: with the supplied information, users can set different parameters to remove potential false positive isoforms.

Please cite SQANTI3 as:

Tardaguila, M., De La Fuente, L., Marti, C., Pereira, C., Pardo-Palacios, F. J., Del Risco, H., ... & Conesa, A. (2018). SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome research, 28(3), 396-411.

(Latest publication is still in process).

Run SQANTI3 for transcriptome characterization

SQANTI3 can be found in the Transcriptomics Module of OmicsBox under Long-Read Analysis → Characterization of Long-Read Transcriptome with SQANTI3. The wizard consists of 5 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6).

Input 1

  • First of all, SQANTI3 can be fed with long-read information in two different ways:

    • GTF/GFF file: this file can be obtained collapsing isoforms from a SAM file using different tools, like cDNA cupcake.

    • PacBio Long Reads: in OmicsBox, SQANTI3 can also accept FASTA/FASTQ files that are the output of IsoSeq3. If this output is chosen, these reads will be mapped against the reference genome using Minimap2 and cDNA cupcake will collapse the isoforms to have a GTF file.

It is recommended that users generate their own GTF/GFF files so as to know exactly which isoforms are being analysed.

  • Reference Genome: this tool also needs the reference genome in FASTA format. Please check that the chromosome/scaffolds names are the same in the reference annotation and the reference genome.

  • Annotation of the Reference Genome: it must be introduced in GTF format. This file will be taken as reference to describe the degree of novelty of each transcript. Make sure your annotation file is based on the correct reference genome version.

Input 2

This tool can also accept short reads (paired-end or single-end) to validate isoforms taking into account the coverage in splice junctions. The expression of each isoform can also be estimated with paired-end short-reads (not with single-end because it is very time-consuming).

If paired-end date is introduced, the upstream and downstream files patterns must be typed in order to differentiate each kind of file.

It is strongly recommended to use short-read data, even if it comes from a different experiment, to validate isoforms using splice junctions coverage.

Input 3

In this page, extra files can be introduced in order to have additional details in the output.

  • Transcription Start Site (TSS) Annotation File: bed file with information of the TSS in a genome. This kind of information can only be used for human and mouse yet. To have more information please visit the referenceTSS webpage.

  • File with PolyA Motifs: Text file with one polyA motif for that species in each line. In the PolyASite webpage different files for H. sapiens, C. elegans and M. musculus can be found with PolyA information, and the last column can be parsed to get the polyA motifs.

  • File With FL Counts: Text or TSV file with Full Length Counts. This information can be obtained, for example, using a cDNA cupake script (get_abundance_post_collapse.py).

Configuration

In this page the parameters for both steps of SQANTI3 can be set:

  • Quality control:

    • Ignore Transcript ID Nomenclature: allow the usage of transcript IDs non related with PacBio's nomenclature (PB.X.Y).

    • Min. length of Reference Transcript: minimum reference transcript length.

    • Skip ORF Prediction: check it to skip ORF prediction so as to save time. If ORF prediction is checked, the translated transcriptome could not be returned.

    • Set of Splice Sites: set of splice sites to be considered as canonical. If the set is going to be changed, type the new splice sites separated with comma and no spaces.

  • Filtering:

    • Filtering: check the box to filter transcripts. This step is strongly recommended.

    • Adenine Percentage: adenine percentage at genomic 3' end to flag an isoform as intrapriming.

    • Adenines in a Row: number of successive adenines at genomic 3' end to flag an isoform as intrapriming.

    • Distance to Annotated TSS: maximum distance to an annotated 3' end to preserve as a valid 3' end and not filter out.

    • Minimum Short-read Coverage: minimum junction coverage for each isoform (only used if paired-end short reads have been used).

    • Filter Mono Exonic Transcripts: filter out all mono exonic transcripts.

Output

  • Set Prefix for Output Files: prefix to add to the output filenames.

  • Information Files: directory to save files with all information of isoforms and junctions, and the transcriptome in GTF format.

  • Isoform FASTA file: directory to save isoform sequences in FASTA format.

  • Translated transcriptome FASTA file: directory to save translated isoform sequences in FASTA format. This is only possible if ORF prediction has not been skipped.

Results

SQANTI3 has the following outputs:

  • Table with the main information of the isoforms (Figure 7).

  • Report with information of the number of genes, isoforms per category and types of junctions before and after filtering isoforms (if not filtering is applied, only information of the first step will be shown) (Figure 8).

  • Charts:

    • General information charts.

    • Distance to TSS and TTS.

    • Filtering.

    • Short and Long Read Coverage.

    • Redundancy Analysis.

    • Quality Features.

  • Files:

    • Transcriptome Annotation File.

    • Transcriptome File.

    • Translated Transcriptome File (optional).

    • Junctions File.

    • Classification File.

Table with information of the isoforms

The next columns can be seen (Figure 7):

  • Category: structural category of the transcript.

  • Isoform: ID of the isoform that comes from the collapsing algorithm.

  • Chr: name of the chromosome/scaffold where the encoding gene is located.

  • Length: isoform length.

  • Exons: number of exons of the isoform.

  • Gene ID: gene ID that appears in the annotation file, if annotated.

  • Transcript ID: transcript ID that appears in the annotation file, if annotated.

  • Gene exons: number of exons of the gene if that gene is annotated. If it is novel, this information is not shown.

  • RT switching: TRUE if one of the junctions could be a RT switching artifact.

  • Min. coverage: minimum junction coverage based on STAR algorithm. If no short reads are given, this column is not shown.

  • FL counts: FL count associated with this isoform if a FL counts file is provided. Otherwise this column is not shown.

  • Isoforms expression: short read expression for this isoform if paired-end reads are provided, otherwise this column is not shown.

  • Gene expression: short read expression for the gene associated with this isoform (summing over all isoforms) if paired-end reads are provided, otherwise this column is not shown.

  • Predicted NMD: TRUE if there's a predicted ORF and CDS ends at least 50 bp before the last junction; FALSE if otherwise. NA if non-coding.

  • CAGE: TRUE if the PacBio transcript start site is within a CAGE Peak. If a TSS annotation file is not given, this column is not shown.

  • PolyA motif: if a polyA motif list is given, shows the top ranking polyA motif found within 50 bp upstream of end. Otherwise, this column is not shown.

Report

In the report, there are two main parts: in the first one, some Job Information can be seen (Figure 8). There are three tables:

  • Table at the gene level: information of the number of annotated genes and novel genes before and after filtering.

  • Table at the isoform level: information of the number of isoforms in the different structural categories before and after filtering and description of each structural category.

  • Table at the splice junction level: information of the number of each type of splice junction before and after filtering and description of each type of splice junction.

Intergenic and Genic Intron transcripts are the structural categories that belong to novel genes. These categories, along with Genic Genomic and Antisense, are not common, so they should not be frequent in the report.

In the second part, the main parameters used in SQANTI3 job are shown (Figure 9).

Charts

In the side panel of the table (Figure 7) there are different action buttons with different chart categories:

  • General Information Charts: these charts are related to general characteristics.

    • Isoforms per Gene: frequency of the number of isoforms.

    • Isoforms per Structural Category: number of isoforms in the different structural categories.

    • Transcript Length per Structural Category: boxplots of the length of the different structural categories.

    • Isoforms per Type of Gene: stacked barcharts with the number of isoforms per gene in annotated genes and novel genes.

    • Exons per Structural Category: boxplots of the number of exons that are present in the different structural categories.

  • Distance to TSS and TTS: these charts are related to the distance of Full Splice Match (FSM) and Incomplete Splice Match (ISM) transcripts to annotated Transcription Start Sites (TSS) and Transcription Termination Sites (TTS).

    • Distances of FSM to TTS: histogram of the distribution of the distance of FSM isoforms to a TTS. If a polyA motif file is introduced, only isoforms with a polyA motif found will be shown in this chart.

    • Distances of ISM to TTS: idem with ISM isoforms.

    • Distances of FSM to TSS: histogram of the distribution of the distance of FSM isoforms to a TSS. If a CAGE annotation file is introduced, only isoforms within a CAGE will be shown in this chart.

    • Distances of ISM to TSS: idem with ISM isoforms.

  • Filtering: these charts are related to the removal of transcripts after filtering.

    • Reasons of Transcripts Removal: percentage of isoforms removed because of intrapriming, RT Switching, or because of Low Coverage/non-canonical junctions. The current filtering rules are as follow:

      • If a transcript is FSM, then it is kept unless the 3' end is unreliable (intrapriming).

      • If a transcript is not FSM, then it is kept only if all of below are true:

        • 3' end is reliable.

        • Does not have a junction that is labeled as RTSwitching.

        • All junctions are either canonical or has short read coverage above the threshold that was set by the user.

    • Isoforms Before and After Filtering: stacked barcharts with the number of isoforms for each structural category before and after filtering.

  • Short and Long Read Coverage: these charts are related to coverage analysis of transcripts by short reads and long reads.

    • Short Read Coverage: stacked barchart with the number of transcripts whose splice junctions are covered by short reads by structural category.

    • Long Read Counts on Genes: boxplots with the number of FL counts per type of gene (annotated or novel). The plotted value is the logarithm to base 2 of the Counts per Million plus 1.

    • Long Read Counts on Categories: idem with structural categories.

    • Saturation Plot: saturation plots are used to assess the performance of RNA-seq experiments and determine the optimal sequencing depth for a given experiment. Saturation plots are generated by plotting the number of unique transcripts detected versus the total number of reads (sequencing depth) for a given RNA-seq experiment. The vertical line shows the saturation point, that is to say, the point where the curve starts flattening (e.g when more sequencing depth is not going to make you find new transcripts).

    • Increment Plot: this plot is similar to the other one, but shows the increase of discovered transcripts as sequencing depth is increased. Because of that, after the saturation point, the bars showing the increase of discovered transcripts should be shorter.

  • Redundancy Analysis: these charts are related to redundancy analysis of transcripts. Redundancy reflects the fact that transcripts, after being validated, can often incorporate 3’ and 5’ end variability, which can lead to the detection of multiple FSM and/or ISM isoforms per reference transcript differing in their start and/or end positions. If a CAGE annotation file and/or a polyA motif file is/are introduced, that data is taken into account.

    • Redundancy Analysis of FSM Transcripts

    • Redundancy Analysis of ISM Transcripts

    • Redundancy Analysis of Both Types

  • Quality Features: these charts summarize features that represent good/bad quality of transcripts per structural categories.

    • Good quality features: percentage of transcripts per structural category with different features that show good quality:

      • All_CJ: every splice junction is a canonical junction.

      • Annotated: Distance of query isoform 3' start to the closest start end of any transcripts of the matching gene is smaller than the distance introduced in Configuration 2 wizard (Figure 5)

      • Cage_support: if CAGE annotation file is added, this bar appear in the chart. Isoform is within a CAGE site.

      • Covered: if short reads files are introduced, this bar is added to the chart. Isoform has short read coverage.

      • PolyA_support: if polyA motif file is introduced, this bar is added to the chart. Isoform has a polyA motif.

    • Bad quality features: percentage of transcripts per structural category with different features that show bad quality:

      • RTSwitching: isoform has at least one junction that might be marked as RT Switching. This information only appear if the filtering step has run.

      • Non_all_CJ: not every splice junction is a canonical junction.

      • Non_covered: if short reads files are introduced, this bar is added to the chart. Isoform has not read coverage.

      • Predicted_NMD: isoform’s predicted ORF and CDS ends at least 50bp before the last junction. This information only appear if the filtering step has run.

Files

  • Transcriptome Annotation File: this GTF annotation file shows the location in the genome of every transcript and exon.

  • Transcriptome File: Multifasta file with the nucleotide sequence of every isoform.

  • Translated Transcriptome File (optional): Multifasta file with the translated amino acidic sequence of every isoform. This file can only be returned if ORF prediction is not deactivated.

  • Junctions File: Text file with information of every splice junction found.

  • Classification File: Text file with information of every isoform found.

Isoform extraction from FASTA

Another useful tool that has been implemented is the extraction of isoforms from the transcriptome file that SQANTI3 returns. To use this tool, just select every isoform whose sequence is needed and right click in one of them and then click in “Extract Isoforms from FASTA”. Then a wizard will open (Figure 16), and the FASTA transcriptome file and the output folder must be introduced. The output filename will be extracted_transcriptome-file-name.fasta.