DNA-Seq de Novo Assembly

Content of this page:

Introduction

Genome assembly refers to the process of taking a large number of short DNA reads and putting them back together to create a representation of the whole genome from which the DNA originates. De novo genome assemblies assume no prior knowledge of the source DNA sequence length, layout or composition (i.e. no reference genome is available). The goal of an assembler is to produce long contiguous pieces of sequences (contigs) from DNA-seq reads. The contigs are then joined together to form scaffolds where possible. Short-insert paired reads provide increased information for maximizing sequencing coverage, while long-insert mate paired-end reads can pair sequence fragments across greater distances. This is especially helpful to cover highly repetitive regions.

This functionality can be found under genome analysis → DNA-Seq De novo Assembly.

Two assembly strategies are available:

  • ABySS: ABySS (Assembly By Short Sequences) is a de novo, parallel, paired-end sequence assembler that is designed for short reads. It implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph. ABySS is capable of assembling large genomes.

  • SPAdes: SPAdes (St Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines based on the Bruijn Graph. SPAdes works with Illumina and IonTorrent data and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. SPAdes is designed for small genomes, and allows to assemble single-cell MDA data as well as standard isolates.

ABySS

ABySS 2.0 is a multistage de novo assembly pipeline consisting of unitig, contig, and scaffold stages.

  • At the unitig stage, the program performs the initial assembly of sequences according to the De Bruijn graph assembly algorithm. The unitig stage loads the full set of k-mers from the input sequencing reads into a hash table and stores auxiliary data for each k-mer such as the number of k-mer occurrences in the reads and the presence/absence of possible neighbor k-mers in the De Bruijn graph.

  • At the contig stage, the paired-end reads are aligned to the unitigs and the pairing information is used to orient and merge overlapping unitigs.

  • At the scaffold stage, the mate-pair reads are aligned to the contigs to orient and join them into scaffolds, inserting runs of "N" characters at gaps in coverage and for unresolved repeats.

The main innovation of ABySS 2.0 is a Bloom filter-based implementation of the unitig assembly stage. It reduces the overall memory requirements, enabling assembly of large genomes. A Bloom filter is a compact data structure for representing a set of elements that supports operations of inserting elements and querying the presence of elements. The Bloom filter data structure consists of a bit vector and one or more hash functions, where the hash functions map each k-mer to a corresponding set of positions within the bit vector (bit signature for the k-mer).

During unitig assembly, two passes are made through the input sequencing reads:

  1. In the first pass, k-mers are extracted from the reads and are loaded into a Bloom filter. The program discards all k-mers with an occurrence count below a user-specified threshold (typically in the range of two to four). In this way, k-mers caused by sequencing errors are filtered out. The retained k-mers are known as solid k-mers.

  2. In the second pass, the program identifies reads that consist entirely of solid k-mers, and extend them left and right within the De Bruijn graph to create unitigs.

Please cite ABySS 2.0 as:

Jackman SD, Vandervalk BP, Mohamadi H, et al (2017). "ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter". Genome Res. 2017;27(5):768-777.

Input Data

  • Input Reads: First, choose the type of sequencing data. Then, select the files of this type of data for the assembly. Both paired-end and single-end short reads can be provided, and both types of data can be combined in the same run.

  • Additional Data: ABySS supports additional data types as supplementary information:

    • Additional Paired-end Libraries: Paired-end libraries that will be used only for merging unitigs into contigs and will not contribute toward the consensus sequence.

    • Mate-pair Libraries: Mate-Pair libraries that will be used for scaffolding. Mate-Pair libraries that will be used for scaffolding. Mate-pair libraries do not contribute toward the consensus sequence.

    • Linked Reads: Linked reads from 10x Genomics Chromium. The linked reads are used to correct assembly errors and scaffolding.

    • Long Sequences Libraries: Provide long sequence libraries (such as RNA-Seq contigs) that will be used for rescaffolding. Long sequence libraries do not contribute toward the consensus sequence.

  • Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.

    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.

    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

Figure 1: Input Data Page

Configuration

  • K-mer Size: The term k-mer refers to all possible subsequences of the given length that are contained in a read. In sequence assembly, k-mers are used during the construction of De Bruijn graphs. The choice of the k-mer size has many different effects on the sequence assembly, it is advisable to try different values and check the results to choose the best one. It is recommended to use odd values of at least half the length of the reads.

  • Use paired De Bruijn graph: Assembly will be performed using a paired De Bruijn graph. In this mode, k-mer pairs are used, which consist of two equal-sized k-mers separated by a fixed distance. To assemble using the paired De Bruijn graph mode, specify the k-mer pair span (distance between k-mers).

  • K-mer Pair Span: Set the span of a k-mer pair (distance between k-mers).

  • Minimum Alignment Length: Establish the minimum alignment length of a read (bp). This means that there must be a perfect match of the established length between each read and its target contig.

  • Hash Functions: Set the number of Bloom filter hash functions. K-mers from each input sequencing read are loaded into the Bloom filter by computing the hash values of each k-mer sequence and setting the corresponding bit.

  • K-mer Count Threshold: Set the k-mer count threshold for Bloom filter assembly. Optimal values are typically in the range of 2-4. K-mers with an occurrence count below the threshold will be discarded.

Figure 2: Configuration Page

Output

  • Unitigs Fasta: Where to store the Fasta file containing the assembled unitigs.

  • Contigs Fasta: Where to store the Fasta file containing the assembled contigs.

  • Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffolds.

  • Long Scaffolds Fasta: Where to store the Fasta file containing the assembled long scaffolds. Note that this file is only generated if long sequence libraries were provided.

Figure 3: Output Data Page

Results

ABySS returns the assembled sequences in three FASTA files (four if long sequence libraries were provided). Each one corresponds to a different stage of the assembly procedure:

  • Unitigs: Contains sequences assembled without using paired-end information. In case you provide only single-end data, this will be the only result file, since pairing information is required to assemble contigs.

  • Contigs: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps, but no repeats.

  • Scaffolds: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps and repeats.

  • Long Scaffolds: Contains sequences that were obtained by rescaffolding using long sequences libraries.

In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 4). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly:

  • N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every unitig, contig or scaffold from longest to shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.

  • L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

  • Bloom filter False Positive Rate (FPR): The Bloom filter can generate false positives when the bit signatures of different k-mers overlap by chance. This means that a certain fraction of k-mer queries will return true even though the k-mers do not exist in the input sequencing data. Users are recommended to target a Bloom filter false positive rate (FPR) smaller than 5%. Parameters such as the k-mer size, hash functions or k-mer count threshold can influence the false positive rate.

The Nx plot (Figure 5) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for unitigs, contigs and scaffolds.

Figure 4: Summary Report

Figure 5: Nx Plot

SPAdes

SPAdes is a de novo genome assembly pipeline that can deal with data coming from several sequencing technologies and supports hybrid and single-cell assemblies. The SPAdes assembly pipeline consists of four stages:

  1. Assembly graph construction. SPAdes uses the multisized de Bruijn graph, implements new bulge/tip removal algorithms, detects and removes chimeric reads, aggregates biread information into distance histograms, and allows to backtrack the performed graph operations.

  2. k-bimer adjustment: SPAdes derives accurate distance estimates between k-mers in the genome using joint analysis of distance histograms and paths in the assembly graph.

  3. Constructs the paired assembly graph: Inspired by Paired de Bruijn graphs (PDBG) approach.

  4. Contig construction: SPAde constructs DNA sequences of contigs and the mapping of reads to contigs by backtracking graph simplifications.

SPAdes uses a modification of Hammer for error correction and quality trimming prior assembly.

In general, SPAdes uses two techniques for scaffolding.

  • SPAdes tries to estimate the size of the gap separating contigs using read pairs.

  • SPAdes, using the assembly graph, joins contigs that are separated by a complex tandem repeat, that cannot be resolved exactly, with a fixed gap size of 100 bp.

Contigs produced by SPAdes do not contain N symbols.

Please, cite SPAdes as:

Input Data

  • Input Reads: Select the files containing the sequencing libraries (reads). The assembly strategy requires at least one of these types of sequencing libraries.

    • Illumina single-end, paired-end, or high-quality mate-pairs.

    • IonTorrent single-end, paired-end, or high-quality mate-pairs.

    • PacBio CCS reads (should be provided as single-end data).

These files are assumed to be in FASTQ format. For IonTorrent data, SPAdes supports unpaired reads in unmapped BAM format.

  • IonTorrent Data: This option is required when assembling IonTorrent data. Illumina and IonTorrent libraries should not be assembled together. For IonTorrent data, SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by the Torrent Server).

  • Single-cell Data: This option is required for Multiple Displacement Amplification (MDA) single-cell data assembly.

  • Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.

    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.

    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

Figure 6: Input Data Page

  • Use Additional Mate-Pair Data: SPAdes supports mate-pair only assembly. However, high-quality mate-pair libraries are recommended in these cases. Here, regular mate-pair libraries can be provided as supplementary information. Upstream and downstream files will be distinguished using the pattern established in the previous page (Paired-end Configuration).

  • Use Data for Hybrid Assembly:

    • PacBio (CLR), Oxford Nanopore and Sanger reads can be provided for hybrid assemblies (e.g. with Illumina or IonTorrent data). SPAdes uses this data for gap closure and repeat resolution.

    • Contigs of the same genome (trusted) generated by other assembler(s) can be specified to merge them into SPAdes assembly.

    • Less reliable contigs (untrusted) can be used only for gap closure and repeat resolution.

Only contigs of the same genome should be specified since SPAdes does not work with genomes of closely-related species.

FIgure 7: Input 2 Data Page

Configuration

  • Automatic K-mer Sizes: K-mer sizes are selected automatically based on the read length and data set type:

    • If single-cell data is provided, the default values are 21, 33 and 55.

    • For multicell datasets, K values are automatically selected using maximum read length.

  • K-mer Sizes: Define a comma-separated list of k-mer sizes to be used. These must be odd and less than 128. You can find recommendations about K-mer sizes in the SPAdes documentation.

  • Read Error Correction: Performs a read error correction before assembly. Depending on the sequencing platform, the BayesHammer (Illumina) or the IonHammer (IonTorrent) tools are used for this task. This procedure is recommended to obtain high-quality assemblies but can be turned off if read error correction has been done previously.

  • Mismatch Careful Mode: Tries to reduce the number of mismatches and short indels. It also runs MismactCorrector, a post-processing tool that uses BWA.

  • Read Coverage Cutoff: Configure the read coverage cutoff value that SPAdes will use to obtain the most reliable assembled sequences. Must be a positive decimal number, or automatic, or off. When set to “Automatic” SPAdes automatically computes coverage threshold using conservative strategy.

  • Read Coverage Cutoff Value: If the “Defined by User” option is selected above, set a positive float value.

Figure 8: Configuration Page

Output

  • Contigs Fasta: Where to store the Fasta file containing the assembled contigs.

  • Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffold. Recommended for use as resulting sequences.

Figure 9: Output Data Page

Results

SPAdes returns the assembled sequences in two FASTA files:

  • Contigs: Contains resulting contigs.

  • Scaffolds: Contains resulting scaffolds (recommended for use as resulting sequences).

Contigs/scaffolds names in SPAdes output FASTA files have the following format:

>NODE_3_length_237403_cov_243.207

  • 3 is the number of the contig/scaffold.

  • 237403 is the sequence length in nucleotides.

  • 243.207 is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage.

In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 10). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly (see the explanation in the previous section).

  • N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every contig or scaffold from longest to shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.

  • L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

The Nx plot (Figure 11) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs and scaffolds.

Figure 10: Summary Report

Figure 11: Nx Plot