RNA-Seq Alignment

Content of this page:

Introduction

Read alignment is a common process applied to high-throughput sequencing data, being one of the first stages required for many different types of analysis. In the RNA-Seq scenario, this process is used to quantify gene expression. The goal of the read alignment is to map short sequencing reads efficiently to a large reference genome to identify the 'correct' genomic loci from which the read originated whilst taking into account errors in the sequence reads. The RNA-Seq Alignment functionality of OmicsBox is based on STAR (Spliced Transcript Alignment to a Reference). 

STAR is a fast RNA-Seq read mapper, with support for splice-junction and fusion read detection, and it was designed to align non-contiguous sequences directly to a reference genome. STAR aligns reads by finding maximal mappable prefix hits between reads (or read pairs) and the genome, using a suffix array index strategy. Different parts of a read can be mapped to different genomic positions, corresponding to splicing or RNA-fusions. If genomic annotations are provided, the genome index includes known splice-junctions from annotated gene models, allowing for sensitive detection of spliced reads. 

The binary nature of the suffix array search results in a favorable logarithmic scaling for the search time with the reference genome length. Since this approach has high memory requirements, it is executed in the Omics Cloud, which provides a high-performance computing environment, allowing fast alignments even against large genomes. 

Please cite STAR as:

Dobin A, Davis CA, Schlesinger F, et al (2012). "STAR: ultrafast universal RNA-seq aligner." Bioinformatics, 29(1):15-21.

Run RNA-Seq Alignment

This functionality can be found under Transcriptomics → RNA-Seq Alignment. The wizard allows to select input files and adjust analysis parameters (Figure 1Figure 2 and Figure 3).

Input

  • Sequencing Data: Choose single-end or paired-end reads. Note that if the paired-end option is selected, two files per sample are required.
  • Input Reads: Provide the files containing RNA sequencing reads. These files are assumed to be in FASTQ format or compressed FASTQ format (.gz). 
  • Paired-end configuration: In case of paired-end reads, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files. 
    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files. 

      For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

  • Genome Sequences: Specify a FASTA file that contains the genome reference sequences. Multiple reference sequences, e.g. chromosomes or scaffolds, can be provided. It is strongly recommended to include major chromosomes as well as un-placed and un-localized scaffolds since a substantial number of reads may map to these scaffolds (e.g. ribosomal RNA). These reads would be reported as unmapped if the scaffolds are not included, or may be aligned to wrong loci on the chromosomes. On the other hand, patches and alternative haplotypes should not be included in the genome. 

Figure 1: Input Data Page

Configuration

  • Provide annotations: This option allows providing a file with annotated genes and transcripts in GTF/GFF format (GTF is recommended). The aligner will extract splice junctions from this file and use them to improve the accuracy of the alignment. While this is optional, using annotations is highly recommended. Chromosome names in the GTF annotation file have to match chromosome names in the FASTA genome sequences file. 
    • Annotation File: Select the file containing the annotated genes and transcripts in GTF/GFF format.
    • Overhang: Establish the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. This length should be equal to the length of the read -1. For instance, for 100 bp paired-end reads, the ideal value is 99. In case of reads with varying length, the ideal value is the maximum read length -1. 
  • 2-pass Mapping: This option allows a most sensitive novel junction discovery. The aligner algorithm is executed first to collect the junctions. These junctions are used for a second pass mapping. 
  • Sort by Coordinate: The aligner will output BAM files sorted by coordinates. 
  • Minimum Intron Length: Specify the minimum intron size. A genomic gap is considered intron if its length is equal to or greater than the given value. Otherwise, it is considered a deletion.
  • Maximum Intron Length: Specify the maximum intron size. 
  • Maximum Number of Mismatches: Set the maximum number of mismatches allowed per read or read pair. 
  • Maximum Number of Multiple Alignments: Establish the maximum number of multiple alignments allowed per read. If exceeded, the read is considered unmapped.
  • Include Chimeric Alignments: This option allows to include the chimeric alignments together with normal alignments in the main BAM file. The format of chimeric alignments follows the latest SAM/BAM specifications.
  • Maximum Distance Between Mates: Specify the maximum genomic distance between two mate pairs.

Figure 2: Configuration Data Page

Output

  • Save Splice Junctions: Save high confidence collapsed splice junctions in tab-delimited format. Note that STAR defines the junction start/end as intronic bases. Files will be named as sample_name.Sj.out.tab, and will be placed in the "Alignment Files" destination folder.
  • Save Unmapped Reads: Save unmapped and partially mapped. Files will be named as sample_name_Unmapped.fastq.gz, and will be placed in the "Alignment Files" destination folder. 
  • Alignment Files: Select a folder to save the output BAM files. Take into account that one BAM file will be generated for each input FASTQ sample, so make sure there is enough disk space to store them.

Figure 3: Output Data Page

Results

The main outputs are the BAM files (Figure 4). A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as the mapping position, and a variable number of optional fields for flexible or aligner specific information:

  1. QNAME: Query template (read) name. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
  2. FLAGSAM flags summarize many properties of reads, represented by flag bits, into a single number:
    • Read is paired.
    • Read is mapped in a proper pair.
    • Read is unmapped.
    • Mate is unmapped.
    • Read reverse strand.
    • Mate reverse strand.
    • Read is from the first pair.
    • Read is from the second pair.
    • Alignment isn't primary.
    • Read fails platform/vendor quality checks.
    • Read is PCR or optical duplicate.
  3. RNAME: Reference sequence name. If @SQ header lines are present, RNAME must be present in one of the SQ-SN tag.
  4. POS: 1-based leftmost mapping position of the first CIGAR operation. The first base in a reference sequence has coordinate 1.
  5. MAPQ: Mapping quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
  6. CIGAR: A string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are: 
    • M: Align match.
    • I: Insertion to the reference. 
    • D: Deletion from the reference. 
    • N: Skipped region from the reference.
    • S: Soft clipping.
    • H: Hard clipping.
    • P: Padding (silent deletion from padded reference).
    • =: Sequence match
    • X: Sequence mismatch
  7. RNEXT:  Reference sequence name of the primary alignment of the next read in the template. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. 
  8. PNEXT: a 1-based position of the primary alignment of the next read in the template.
  9. TLEN: Signed observed template length.
  10. SEQ: Segment sequence.
  11. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format).

In addition to these 11 obligatory fields, optional fields may be included. All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string.

For more information about the SAM format, visit the SAM Format Specification Page

You can check the meaning of a FLAG number using the SAM Flag Translator

Figure 4: Output Directory

In addition to the BAM files, splice junctions and unmapped reads are generated if the corresponding options were checked.

Splice junctions are stored in tab-delimited format. The columns have the following meaning:

  1. Chromosome.
  2. The first base of the intron (1-based).
  3. The last base of the intron (1-based).
  4. Strand:
    1. 0: undefined.
    2. 1: +.
    3. 2: -.
  5. Intron motif:
    1. 0: non-canonical.
    2. 1: GT/AG.
    3. 2: CT/AC.
    4. 3: GC/AG.
    5. 4: CT/GC.
    6. 5: AT/AC.
    7. 6: GT/AT.
  6. Annotation:
    1. 0: unannotated.
    2. 1: annotated (only if splice junctions database is used)
  7. The number of unique mapping reads crossing the junction. 
  8. The number of multi-mapping reads crossing the junction.
  9. Maximum spliced alignment overhang.

The unmapped files contain the unmapped reads and partially mapped reads (i.e. mapped only one mate of a paired-end read). These reads are stored following the FASTQ specification. Note that if paired-end data were provided, two unmapped files are expected for each sample, one containing upstream unmapped reads and the other containing downstream unmapped reads.

Finally, a report and a chart are generated with complementary information. The report shows a summary of the RNA-Seq Alignment results (Figure 5). This page contains information about the reference genome sequences, the input FASTQ files, and a results overview. The last section is divided into four subsections: unique reads, multi-mapping reads, chimeric reads, and unmapped reads.

The bar chart (Figure 6) shows the number of reads of each input file sorted by different categories, according to how the read was aligned to the reference sequence:

  • Unique reads: Reads that have been assigned once to a location of the reference sequence. 
  • Multi-mapping reads: Reads that have been assign to more than one location of the reference sequence. 
  • Chimeric reads:  Reads that have been aligned to two distinct portions of the reference sequence. 
  • Unmapped Reads: Reads that have not been assigned to any reference transcript.

Finally, the Genome Browser allows you to visualize genomic coordinates (GFF/GTF) in a side-scrolling way. Several tracks can be added to the browser, the currently supported tracks are VCF, DNA Fasta and BAM. The BAM track shows the reads of a BAM file and if the sequence track is active, it will also highlight the differences between the read sequence and the sequence track. 

Figure 5: Summary Report


Figure 6: Alignments per Category Chart