DNA-Seq Alignment

Introduction

Read alignment is a common process applied to high-throughput sequencing data, being one of the first stages required for many different types of analysis. In the DNA-Seq scenario, this process is applied for variant calling and before the polishing procedure. The goal of the read alignment is to map short sequencing reads efficiently to a large reference genome to identify the 'correct' genomic loci from which the read originated whilst taking into account errors in the sequence reads. The DNA-Seq Alignment functionality of OmicsBox is based on BWA (Burrows-Wheeler Aligner).

The Burrows-Wheeler Alignment algorithm is a read alignment package that is based on a backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. The BWA-MEM algorithm is used, which performs local alignment. It may produce multiple primary alignments for different parts of a query sequence.

Run DNA-Seq Alignment

This functionality can be found under Genome Analysis → DNA-Seq Alignment. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3 and Figure 4).

Input Data

  • Input Reads: Select the files containing sequencing reads. These files are assumed to be in FASTQ format. Both, single and paired-end data are accepted.

  • Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.

    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.

    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

  • Reference Genome: Specify a FASTA file with the genome reference sequences. Multiple reference sequences (e.g. chromosomes or scaffolds) are allowed.

It is not recommended to provide masked genome sequences since the algorithm will force those reads that originate in repeats to map (falselyI somewhere else in the genome.

Figure 1: Input Data Page

Algorithm Options

  • Minimum Seed Length: Matches shorter than this value will be missed. The alignment speed is usually intensive to this value unless it significantly deviates 20.

  • Band Width: Essentially, baps longer than this value will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option.

  • Z-dropoff: Also known as Off-diagonal X-dropoff. Stop extension when the difference between the best and the current extension score is above |i-j|*A+Z-dropoff, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST’s X-dropoff except that it does not penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension but also reduces poor alignments inside a long good alignment.

  • Trigger Re-seeding: Look for internal seeds inside a seed longer than {Minimum Seed Length} * Trigger Re-seeding. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster to alignment speed but lower accuracy.

  • Seed Occurrence: Seed occurrence for the 3rd round seeding.

  • Skip Seeds: Discard a seed if it has more than this number of occurrences in the genome.

  • Drop Chains: Drop chains shorter than this fraction of the longest overlapping chain.

  • Discard Chains: Discard a chain if seeded bases shorter than this value.

  • Mate Rescue Rounds: Perform at most this number of rounds of mate rescues for each read.

  • Skip Mate Rescue: Skip the mate rescue procedure.

  • Skip Pairing: In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair. The mate rescue is performed, unless the Skip Mate Rescue option is also in use.

Figure 2: Algorithm Options Page

Scoring Options

  • Matching Score: Score for a sequence match.

  • Mismatch Penalty: Penalty for a mismatch. The sequence error rate is approximately: {.75 * exp[-log(4) * Mismatch Penalty/Matching Score]}.

  • Gap Open Penalty for Deletions: Gap open penalty for deletions.

  • Gap Open Penalty for Insertions: Gap open penalty for insertions.

  • Gap Extension Penalty for Deletions: Gap extension penalty for deletions. A gap of size k cost '{-O} + {-'Gap Extension Penalty'}*k'.

  • Gap Extension Penalty for Insertions: Gap extension penalty for insertions. A gap of size k cost '{-O} + {-'Gap Extension Penalty'}*k'.

  • 5'-end Clipping Penalty: Penalty for 5'-end clipping. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of the query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; the clipping penalty is not deducted.

  • 3’-end Clipping Penalty: Penalty for 3'-end clipping.

  • Unpaired Read Penalty: Penalty for an unpaired read pair.

Output Options

  • Minimum Score: Minimum score to output.

  • Mark Split Alignments as Primary: For split alignment, take the alignment with the smallest coordinate as primary.

  • Not Modify mapQ of Supp. Alignments: Do not modify the mapping quality of supplementary alignments.

  • Output All SE/Unpaired PE Alignments: Output all alignments for Single-End or unpaired Paired-End reads.

  • Soft Clipping for Supplementary: Use soft clipping for supplementary alignments.

  • Mark Shorter Split Hits as Secondary: Mark shorter split hits as secondary.

  • Sort BAM File: Establish how output BAM files should be sorted.

Figure 3: Scoring and Output Options Page

Output Data

  • Alignment Files: Select a destination folder to save output BAM files.

Figure 4: Output Data Page

Results

The main outputs are the BAM files. A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as the mapping position, and a variable number of optional fields for flexible or aligner specific information.

For more information about the SAM format, visit the SAM Format Specification Page

You can check the meaning of a FLAG number using the SAM Flag Translator

In addition, a report and a chart are generated with complementary information. The report shows a summary of the DNA-Seq Alignment results (Figure 5). This page contains information about the reference genome sequences, the input FASTQ files, and a results overview. The last section is divided into several subsections: globals, paired information, ACTG content, coverage, mapping quality, insert size, mismatches and indels. The bar chart (Figure 6) shows the number of mapped and unmapped reads of each input file.

Finally, the Genome Browser allows you to visualize genomic coordinates (GFF/GTF) in a side-scrolling way. Several tracks can be added to the browser, the currently supported tracks are VCF, DNA Fasta and BAM. The BAM track shows the reads of a BAM file and if the sequence track is active, it will also highlight the differences between the read sequence and the sequence track. 

Figure 5: Summary Report

Figure 6: Alignments per Category chart