Long-Read Isoform Definition with FLAIR
Long-read sequencing technologies are becoming increasingly popular for transcriptome analysis because they can provide higher accuracy and completeness of transcript assembly and gene annotation. This is especially true for complex transcriptomes. Nevertheless, as short reads must be assembled in order to get a sequenced transcriptome, long reads need to be preprocessed in order to define isoforms. To do this, OmicsBox now provides FLAIR (Full-Length Alternative Isoform analysis of RNA).
The FLAIR tool is a computational pipeline designed for the correction, isoform definition and quantification of transcriptomes using long-read sequencing technologies such as PacBio or Oxford Nanopore. The pipeline is based on a combination of alignment-based methods (using Minimap2) and subsequent de novo assembly to collapse long reads and get isoforms. This tool is optimized for the specific characteristics of long-read sequencing data such as high error rates and long read lengths. The tool is able to handle complex gene structures and alternative splicing events that may be challenging to detect with short-read data alone. In addition, FLAIR is able to quantify the discover isoforms.
Please cite FLAIR as:
Tang, A. D., Soulette, C. M., van Baren, M. J., Hart, K., Hrabeta-Robinson, E., Wu, C. J., & Brooks, A. N. (2020). Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nature communications, 11(1), 1-12.
Run FLAIR for Long-Read Isoform Definition
FLAIR can be found under Transcriptomics → Long-Read Analysis → Long-Read Isoform Definition with FLAIR. The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4 and Figure 5).
First of all, FLAIR requires some necessary files:
Long-Reads Files: FASTA/Q files containing long reads proceeding from PacBio or ONT technologies.
Reference Genome: FASTA file with the reference genome.
Genome Annotation: GTF file with annotations of the reference genome.
Make sure that the reference genome and the reference annotation have the same version.
Additionally, some optional files can be included.
Aligned Reads: BAM files proceeding from aligning the input long-read FASTA/Q files. This option can be enabled if you want to map your long reads with other tool rather than Minimap2 (aligner used by FLAIR) or if you want to use a specific set of parameters.
Short-read BAM Files: These BAM files must be the result of aligning short reads with a gapped mapper like STAR. They will be converted into a BED file that will be used in the alignment and correction steps to support the existence of the splice junctions that appear in the long-read BAM file.
Reads Manifest: A tab-delimited file used in the quantification step with no header and these columns: sample, condition, batch, filename.
The reads manifest can only be uploaded if the ‘Quantify Reads’ option is checked. In addition, the last column, the filename, must be equal to the filename of some input added in the Long-Read Files box of the first page.
In this page, the parameters for every FLAIR step can be set:
Native RNA: use native-RNA specific alignment parameters for minimap2. This parameter is a flag that tells the input is original RNA sequences as input, rather than using pre-processed or adapter-trimmed sequences.
Min. Mapping Quality: minimum mapping quality score of read alignment to the genome.
Retain Secondary Alignments: retain that number of secondary alignments from minimap2 (i.e. alignments of the same read in other parts of the genome). Please proceed with caution, changing this setting is only useful if you know there are closely related homologues elsewhere in the genome. It will likely decrease the quality of Flair's final results.
Window Size: window size for correcting splice sites.
Minimum Supporting Reads: minimum number of long reads to call an isoform.
Window Size for TSS and TTS: window size for comparing transcripts starts (TSS) and ends (TTS).
Ends Determined at Isoform Level: when specified, TSS/TTS for each isoform will be determined from supporting reads for individual isoforms and not from genes.
Get TSS and TTS from Supporting Reads: do not use TSS/TTS from the input GTF to adjust isoform TSS/TTS. Instead, each isoform will be determined from supporting reads.
How to Treat Redundant Isoforms:
No redundancy control: best TSSs/TTSs chosen for each unique set of splice junctions.
TSS/TTS that maximize length: choose this to maximize length of transcripts.
Most supported TSS/TTS: single most supported TSS/TES by reads.
How to Filter Isoforms:
Filter based on support: this is the default filter.
Filter out subset isoforms: any isoforms that are a proper set of another main isoform are removed.
Both options: as the name states, both previous options are used.
Both options and remove single-exons isoforms: the same as before but also single-exons isoforms. These isoforms are typically considered to be noise in transcriptome sequencing data and are often removed.
Common Parameters (for more than one step of above):
Minimum Mapping Quality: minimum mapping quality of a read assignment to an isoform.
Stringent Mode: supporting reads must cover 80% of their isoform and extend at least 25 nt into the first and last exons. If those exons are themselves shorter than 25 nt, the requirement is that the read must start within 4 nt from the start or end within 4 nt from the end.
Check Splice Sites: enforces coverage of 4 out of 6 bp around each splice site and no insertions greater than 3 bp at the splice site.
Trust Ends: specify if reads are generated from a long read method with minimal fragmentation.
Transcriptome Annotation: transcriptome annotation in GTF format. SQANTI3 can check the quality of the assembled transcriptome using this file.
Transcriptome Sequences: sequences of each isoform of the transcriptome in FASTA format.
Isoform-Read Map File: text file that links each defined isoform with the long-reads collapsed.
Counts File: only if quantification has been applied. This file can be also used in SQANTI3 as the file with the full-length counts.
SQANTI3 has the following outputs:
Transcriptome Annotation (GTF file). This file can be included in SQANTI3 to do the quality control and characterization of transcripts.
Sequence Transcriptome (FASTA file). File with the sequence of all the defined isoforms.
Isoform-Read Map File. This file link each defined isoform with the long-reads used to define it.
Counts File. File with the number of counts per isoform and per sample. This file can also be included in SQANTI3 as an additional file.
Report with information of the correction and collapsing steps.
Length Distribution Chart.
This report shows the number of valid and dismissed transcripts during the correction step, and then, the number of isoforms created and of transcripts used for it. Finally, the chosen parameters are displayed.
Length Distribution Chart
Histogram with the distribution of lengths of the defined isoforms in the collapsing step. This histogram might be interesting in order to know the acceptable range of isoform lengths and know which threshold must be set in SQANTI3.