DNA-Seq Polishing

Content of this page:

Introduction

Third-generation DNA sequencing technologies allows scientist to generate longer sequence reads, which can be used in whole-genome sequencing projects to yield better repeat resolution and more contiguous genome assemblies. However, although long-read sequencing technologies can produce genomes with long contiguity, the relatively high error rate of long reads has made it challenging to generate a highly accurate final sequence. An effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with short-read data. This strategy can be pursued by using short reads to “polish” the consensus built from long reads.

The DNA-Seq Polishing application is based on Pilon. Pilon is a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data but is particularly strong when supplied with paired-end data from short-read libraries (e.g. Illumina). Pilon significantly improves draft genome assemblies by correcting bases, fixing misassemblies and filling gaps. For both, haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling the identification of more biologically relevant genes.

Pilon requires as input a FASTA file of the genome along with one or more BAM files of reads aligned to the input FASTA file. Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads. It then attempts to make improvements to the input genome, including:

  • Single base differences.

  • Small indels.

  • Larger indel or block substitution events.

  • Gap filling.

  • Identification of local misassemblies, including optional opening of new gaps.

Pilon then outputs a FASTA file containing an improved representation of the genome from the read data.

Run DNA-Seq Polishing

This functionality can be found under Genome Analysis → DNA-Seq Polishing. The wizard allows to select input files and adjust analysis parameters (Figures 1 to 4).

Input Data

  • Input Fasta: Specify a FASTA file with the genome draft sequences to be polished. Multiple reference sequences are allowed (e.g. chromosomes or scaffolds).

  • Input BAMs: Provide one or more BAM files of reads aligned to the input draft genome. BAM files can be obtained using the DNA-Seq Alignment functionality, to align short-reads to the draft genome.

Figure 1: Input Page

Control Options

  • Diploid: Check this option if the sample is from a diploid organism. This will eventually affect the calling of heterozygous SNPs.

  • Issues to Fix: Select the categories o issues to try to fix:

    • SNPs: Try to fix individual base errors.

    • Indels: Try to fix small indels.

    • Gaps: Try to fill gaps.

    • Local Misassemblies: Try to detect and fix local misassemblies.

    • Ambiguous Bases*: Fix ambiguous bases to the most likely alternative.

    • Breaks*: Allow local reassembly to open new gaps. Works with the “Local Missasemblies” category.

    • Circular Elements*: Try to close circular elements when used with long corrected reads.

    • Novel Sequence*: Assemble novel sequence from unaligned non-jump reads.

*Experimental fix types. By default, Pilon corrects for SNPs, indels, gaps and local misassemblies.

  • Duplicates: Use reads marked as duplicates in the input BAMs (ignored by default).

  • IUPAC: Pilon will use IUPAC nucleotide codes in the output FASTA file to represent ambiguous bases and/or heterozygous SNPs.

  • Failed Sequencer Quality: Use reads which failed sequencer quality filtering (ignored by default).

Figure 2: Control Options Page

Heuristics Options

  • Default Quality: Assumes bases are of this quality if qualities are no present in input BAMs.

  • Flank: Controls how much of the well-aligned reads will be used, this many bases at each end of the good reads will be ignored.

  • Gap Margin: Closed gaps must be within this number of baes of true size to be closed.

  • K-mer Size: K-mer size used by internal assembler.

  • Minimum Depth: Variants (SNPs and indels) will only be called if there is coverage of good pairs at this depth or more. If this value is >= 1, it is an absolute depth. If it is a fraction < º, then minimum depth is computed by multiplying this value by the mean coverage for the region, with a minimum value of 5.

The default value is 0.1. This means that the depth to call is 10% of mean coverage or 5, whichever is greater.

  • Unclosed Gaps: Minimum size fo unclosed gaps.

  • Minimum Mapping Quality: Minimum alignment mapping quality for a read to count in pileups.

  • Minimum Base Quality: Minimum base quality to consider for pileups.

  • Skip Stray Pairs Identification: Skip marking a pass through the input BAM files to identify stray pairs, that is, those pairs in which both reads are aligned but not marked valid because they have inconsistent orientation or separation. Identifying stray pairs can help fill gaps and assemble larger insertions, especially of repeat content.

Figure 3: Heuristics Options

Output Data

  • Output FASTA: Select a file where the polished sequences will be placed.

  • Save Changes: Pilon produces a file containing a space-delimited record of every change made in the assembly. Check this option to obtain this file.

  • Output Changes: Select a file where the “changes” file will be placed. The format for the file is a follows: <Original Scaffold Coordinate> <New Scaffold Coordinate> <Original Sequence> <New Sequence>.

Figure 4: Output Page

To improve performance, both input sequences and alignments are divided into 100MB batches.


Results

Pilon generates a FASTA file (polished_sequences.fasta), containing the improved genomic sequences. Pilon renames the sequence headers by appending “_pilon” to each FASTA element name. If the “Save Changes” option is checked, Pilon returns a text file reporting all changes applied to the input sequences. The format for this space-delimited file is a follows: <Original Scaffold Coordinate> <New Scaffold Coordinate> <Original Sequence> <New Sequence>.

# Deletion
contig_103:1825 contig_103_pilon:1825 T .
# Insertion
contig_103:233958 contig_103_pilon:233948 . C
# SNP
contig_103:364767 contig_103_pilon:364756 A G
# Segmental
contig_103:1054454-1054491 contig_103_pilon:1054403-1054440 CTAAATGGTAGTTGAGAATAGTGGCTACAAGAATTATA GTAAATGGTAGTTGAGAATAGTGGCTAACAGAATCATT

In addition to the resulting files, a report and 2 charts are generated. The report shows a summary of the DNA-Seq Polishing results (Figure 5). This page contains information about the input sequencing data and a results overview. The Results Overview section shows the number of each type of change that have been applied to the input sequences.

Figure 5: Summary Report

The Nx plot (Figure 6) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs/scaffolds before and after polishing. The Fix Type Distribution Chart (Figure 7) displays the proportion of each type of change that have been applied to the input sequences.

Figure 6: Nx Plot

Figure 7: Fix Type Distribution Chart