RNA-Seq de novo Assembly
De novo transcriptome assembly is one of the most frequent analyses performed in bioinformatics and it consists of reconstructing the transcriptome from RNA sequencing data, assembling short nucleotide sequences into longer ones without the use of a reference genome. This functionality is based on Trinity, a well-known de novo sequence assembler software developed at the Broad Institute and the Hebrew University of Jerusalem.
Trinity combines three independent software modules applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
Please, cite Trinity as:
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nature Biotechnology, 29(7):644-52.
Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required.
Sequencing Format: Select the format in which the sequencing reads are provided. All files should contain reads in the same format, FASTA or FASTQ.
Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ format.
If your data comes from SRA, be sure to dump the FASTQ file like so:
SRA_TOOLKIT/fastq-dump --defline-seq @$sn[_$rn]/$ri --split-files SRR3233859
SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR3233859
Paired-end configuration: In the case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample.
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.
Figure 1: Input Data Page
Strand Specificity: This option defines the strandedness of the RNA-seq reads:
Non-Strand Specific: This refers to non-strand-specific protocols.
Strand Specific Forward: For single-end data, the single read is in the sense (forward) orientation. In the case of paired-end data, the first read of fragment pair is sequenced as sense (forward), and the second is in the antisense strand (reverse).
Strand Specific Reverse: For single-end data, the single read is in the antisense (reverse) orientation. In the case of paired-end data, the first read of fragment pair is sequenced as anti-sense (reverse), and the second read is in the sense strand (forward). Typical of the dUTP/UDG sequencing method.
Minimum Contig Length: Minimum assembled contig length to report. Trinity uses 200 bp as default value.
Assess the Read Content: To assess the read composition of the assembly, input RNA-Seq reads are aligned to the transcriptome assembly using Bowtie2. Reads that map to the assembled transcript are captured and counted, including the properly paired and those that are not. Check this option to obtain the read representation charts and table.
Note that it is an expensive operation, so the process will take more time.
Construct Super Transcripts: SuperTranscripts provide a gene-like view of the transcriptional complexity of a gene. A SuperTranscript is constructed by collapsing unique and common sequence regions among splicing isoforms into a single linear sequence.
Do Not Normalize Reads: Trinity normalizes input reads to optimize the assembly procedure. Set this option to skip this step.
Normalization is highly recommended to deal with large datasets. Turning off normalization is not recommended for most applications.
Normalization Maximum Read Coverage: Set the maximum read coverage to which the data will be normalized.
Minimizing Falsely Fused Transcripts: If the transcriptome RNA-seq data under study are derived from a gene-dense compact genome, fusion transcripts can be minimized. This option is only available for paired-end data. In compact fungal genomes, it is highly recommended.
Note that it is an expensive operation, so avoid using it unless necessary.
Pair Distance: Maximum length expected between fragment pairs (500 nucleotides by default). Reads outside this distance are treated as single-end.
Figure 2: Configuration Page 1
Minimum K-mer Coverage: The minimum count for K-mers to be assembled by the Inchworm algorithm.
Maximum Reads Per Graph: The maximum number of reads to anchor within a single graph.
Minimum Glue: The minimum number of reads needed to glue two Inchworm contigs.
Maximum Cluster Size: The maximum number of Inchworm contigs to be included in a single Chrysalis cluster.
Assembly Algorithm: The assembly algorithm to use during the Butterfly step: the original algorithm or Pasafly. Pasafly is a PASA-like algorithm for maximally supported isoforms.
Path Reinforcement Distance: The minimum overlap of reads with growing transcript path. Set to 1 for the most lenient path extension requirements.
No Path Merging: By default, alternative transcript candidates are merged if they are found to be too similar. This is determined by taking into account similarity, mismatches, and gaps. If this option is checked, all final transcripts candidates are output (including SNP variations). Otherwise, if in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative compatible read (pair-path) support is retained, and the other discarded.
Minimum Percent Identity: Minimum percent identity for two paths to be merged into single paths. The identity is calculated as the number of matches divided by the shorter length.
Maximum Allowed Differences: Maximum allowed differences encountered between path sequences to combine them.
Maximum Internal Gap: The maximum number of internal consecutive gap characters allowed for paths to be merged into single paths.
The parameters on this page are only for expert or experimental usage.
Figure 3: Configuration Page 2
Transcript to Gene Mapping: Select a location to place the “transcript to gene” mapping file. It is a tab-delimited file with the information to map from transcript (isoform) identifiers to gene identifiers. It could be used in downstream analysis such as Transcript-level Quantification.
Figure 4: Output Data Page
When the RNA-seq de novo assembly completes, it creates a sequence table containing the assembled transcripts sequences (Figure 5). Trinity groups transcripts into clusters based on shared sequence content. Such a transcript cluster can be considered as a 'gene'.
This information is encoded in the Trinity FASTA accession. An example FASTA entry for one of the transcripts is formatted like so:
Isoform 1: TRINITY_DN869_c0_g1_i1
Isoform 2: TRINITY_DN869_c0_g1_i2
The accession encodes the Trinity 'gene' and 'isoform' information. In the example above, the accession 'TRINITY_DN869_c0_g1_i1' indicates Trinity read cluster 'TRINITY_DN869_c0, gene 'g1', and isoform 'i1' and 'i2'. Because a given run of trinity involves many clusters of reads, each of which are assembled separately, and because the 'gene' numbering is unique within a given processed read cluster, the 'gene' identifier should be considered an aggregate of the read cluster and corresponding gene identifier, which in this case would be 'TRINITY_DN869_c0_g1'.
If the Construct Super Transcript option was checked, two additional outputs will be generated:
SuperTranscripts in FASTA format.
Transcript structure annotation in GFF format.
Figure 5: Sequence table project containing the sequences of the assembled transcripts
Furthermore, a result page will show a summary of the RNA-seq de novo assembly results (Figure 6). It contains the following information:
Details of input FASTQ files.
Results overview that informs about the number of total transcripts and genes detected, the percentage of GC, and the total assembled bases.
Statistics based on the lengths of the assembled transcriptome contigs. The conventional Nx length statistic means that at least x% of the assembled transcript nucleotides are found in contigs that are at least of Nx length. For example, the N50 means that at least half of all assembled bases are in transcript contigs of at least the N50 length value.
The RNA-Seq Read Representation, that allows assessing the read composition of the assembly. It shows the number of reads that map to the assembled transcripts, including the properly paired and those that are not (details below).
Figure 6: Summary report
Finally, two charts showing the read representation of the assembly are generated (Figure 7). These charts display the number of reads of each input file sorted by different categories (the second chart represents the same information in percentages). Bowtie2 is used to align the reads to the transcriptome and then the number of the single-end reads or proper pairs and improper or orphan read alignments are counted.
Figure 7: Read Representation Chart