RNA-Seq de novo Assembly
De novo transcriptome assembly is one of the most frequent analyses performed in bioinformatics and it consists of reconstructing the transcriptome from RNA sequencing data, assembling short nucleotide sequences into longer ones without the use of a reference genome. This functionality is based on Trinity, a well-known de novo sequence assembler software developed at the Broad Institute and the Hebrew University of Jerusalem.
Trinity combines three independent software modules applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
Please, cite Trinity as:
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nature Biotechnology, 29(7):644-52.
Run RNA-Seq de novo Assembly
- Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required.
Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ format.
If your data comes from SRA, be sure to dump the FASTQ file like so:
- SRA_TOOLKIT/fastq-dump --defline-seq @$sn[_$rn]/$ri --split-files SRR3233859
- SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR3233859
- Paired-end configuration: In the case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample.
- Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.
Figure 1: Input Data Page
- K-mer Size: The term k-mer refers to all possible subsequences of the given length that are contained in a read. In sequence assembly, k-mers are used during the construction of De Bruijn graphs. The choice of the k-mer size has different effects on the sequence assembly, so it is advisable to try different values and check the results to choose the best one. Trinity suggests using a k-mer size of 25 (default value). The maximum value allowed is 32.
- Strand Specificity: This option defines the strandedness of the RNA-seq reads:
- Non-Strand Specific: This refers to non-strand-specific protocols.
- Strand Specific Forward: For single-end data, the single read is in the sense (forward) orientation. In the case of paired-end data, the first read of fragment pair is sequenced as sense (forward), and the second is in the antisense strand (reverse).
- Strand Specific Reverse: For single-end data, the single read is in the antisense (reverse) orientation. In the case of paired-end data, the first read of fragment pair is sequenced as anti-sense (reverse), and the second read is in the sense strand (forward). Typical of the dUTP/UDG sequencing method.
- Minimum Contig Length: Minimum assembled contig length to report. Trinity uses 200 bp as default value.
- Assess the Read Content: To assess the read composition of the assembly, input RNA-Seq reads are aligned to the transcriptome assembly using Bowtie2. Reads that map to the assembled transcript are captured and counted, including the properly paired and those that are not. Check this option to obtain the read representation charts and table.
Note that it is an expensive operation, so the process will take more time.
- Construct Super Transcripts: SuperTranscripts provide a gene-like view of the transcriptional complexity of a gene. A SuperTranscript is constructed by collapsing unique and common sequence regions among splicing isoforms into a single linear sequence.
Minimizing Falsely Fused Transcripts: If the transcriptome RNA-seq data under study are derived from a gene-dense compact genome, fusion transcripts can be minimized. This option is only available for paired-end data. In compact fungal genomes, it is highly recommended.
Note that it is an expensive operation, so avoid using it unless necessary.
- Pair Distance: Maximum length expected between fragment pairs (500 nucleotides by default). Reads outside this distance are treated as single-end.
Figure 2: Configuration Page
- Transcript to Gene Mapping: Select a location to place the transcript to the gene mapping file. It is a tab-delimited file with the information to map from transcript (isoform) identifiers to gene identifiers. It could be used in downstream analysis such as the Transcript-level Quantification.
Figure 3: Output Data Page
When the RNA-seq de novo assembly completes, it creates a sequence table containing the assembled transcripts sequences (Figure 4). Trinity groups transcripts into clusters based on shared sequence content. Such a transcript cluster can be considered as a 'gene'.
Figure 4: Sequence table project containing the sequences of the assembled transcripts
This information is encoded in the Trinity FASTA accession. An example FASTA entry for one of the transcripts is formatted like so:
- Isoform 1: TRINITY_DN869_c0_g1_i1
- Isoform 2: TRINITY_DN869_c0_g1_i2
The accession encodes the Trinity 'gene' and 'isoform' information. In the example above, the accession 'TRINITY_DN869_c0_g1_i1' indicates Trinity read cluster 'TRINITY_DN869_c0, gene 'g1', and isoform 'i1' and 'i2'. Because a given run of trinity involves many clusters of reads, each of which are assembled separately, and because the 'gene' numbering is unique within a given processed read cluster, the 'gene' identifier should be considered an aggregate of the read cluster and corresponding gene identifier, which in this case would be 'TRINITY_DN869_c0_g1'.
If the Construct Super Transcript option was checked, two additional outputs will be generated:
- SuperTranscripts in FASTA format.
- Transcript structure annotation in GFF format.
Furthermore, a result page will show a summary of the RNA-seq de novo assembly results (Figure 5). It contains the following information:
- Details of input FASTQ files.
- Results overview that informs about the number of total transcripts and genes detected, the percentage of GC and the total assembled bases.
- Statistics based on the lengths of the assembled transcriptome contigs. The conventional Nx length statistic means that at least x% of the assembled transcript nucleotides are found in contigs that are at least of Nx length. For example, the N50 means that at least half of all assembled bases are in transcript contigs of at least the N50 length value.
- The RNA-Seq Read Representation, that allows assessing the read composition of the assembly. It shows the number of reads that map to the assembled transcripts, including the properly paired and those that are not (details below).
Figure 5: Summary report
Finally, two charts showing the read representation of the assembly are generated (Figure 6 and Figure 7). These charts display the number of reads of each input file sorted by different categories (the second chart represents the same information in percentages). Bowtie2 is used to align the reads to the transcriptome and then the number of the single-end reads or proper pairs and improper or orphan read alignments are counted.
Figure 6: Read Representation Chart
Figure 7: Read Representation (%) Chart