DNA-Seq de Novo Assembly

Content of this page:

Introduction

Genome assembly refers to the process of taking a large number of short DNA reads and putting them back together to create a representation of the whole genome from which the DNA originates. De novo genome assemblies assume no prior knowledge of the source DNA sequence length, layout or composition (i.e. no reference genome is available). The goal of an assembler is to produce long contiguous pieces of sequences (contigs) from DNA-seq reads. The contigs are then joined together to form scaffolds where possible. Short-insert paired reads provide increased information for maximizing sequencing coverage, while long-insert mate paired-end reads can pair sequence fragments across greater distances. This is especially helpful to cover highly repetitive regions.

This functionality can be found under genome analysis → DNA-Seq De novo Assembly.

Two assembly strategies are available:

  • ABySS: ABySS (Assembly By Short Sequences) is a de novo, parallel, paired-end sequence assembler that is designed for short reads. It implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph. ABySS is capable of assembling large genomes.

  • SPAdes: SPAdes (St Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines based on the Bruijn Graph. SPAdes works with Illumina and IonTorrent data and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. SPAdes is designed for small genomes, and allows to assemble single-cell MDA data as well as standard isolates.

  • Flye: Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. Flye uses the repeat graph as a core data structure.

ABySS

ABySS 2.0 is a multistage de novo assembly pipeline consisting of unitig, contig, and scaffold stages.

  • At the unitig stage, the program performs the initial assembly of sequences according to the De Bruijn graph assembly algorithm. The unitig stage loads the full set of k-mers from the input sequencing reads into a hash table and stores auxiliary data for each k-mer such as the number of k-mer occurrences in the reads and the presence/absence of possible neighbor k-mers in the De Bruijn graph.

  • At the contig stage, the paired-end reads are aligned to the unitigs and the pairing information is used to orient and merge overlapping unitigs.

  • At the scaffold stage, the mate-pair reads are aligned to the contigs to orient and join them into scaffolds, inserting runs of "N" characters at gaps in coverage and for unresolved repeats.

The main innovation of ABySS 2.0 is a Bloom filter-based implementation of the unitig assembly stage. It reduces the overall memory requirements, enabling assembly of large genomes. A Bloom filter is a compact data structure for representing a set of elements that supports operations of inserting elements and querying the presence of elements. The Bloom filter data structure consists of a bit vector and one or more hash functions, where the hash functions map each k-mer to a corresponding set of positions within the bit vector (bit signature for the k-mer).

During unitig assembly, two passes are made through the input sequencing reads:

  1. In the first pass, k-mers are extracted from the reads and are loaded into a Bloom filter. The program discards all k-mers with an occurrence count below a user-specified threshold (typically in the range of two to four). In this way, k-mers caused by sequencing errors are filtered out. The retained k-mers are known as solid k-mers.

  2. In the second pass, the program identifies reads that consist entirely of solid k-mers, and extend them left and right within the De Bruijn graph to create unitigs.

Please cite ABySS 2.0 as:

Jackman SD, Vandervalk BP, Mohamadi H, et al (2017). "ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter". Genome Res. 2017;27(5):768-777.

Input Data

  • Input Reads: First, choose the type of sequencing data. Then, select the files of this type of data for the assembly. Both paired-end and single-end short reads can be provided, and both types of data can be combined in the same run.

  • Additional Data: ABySS supports additional data types as supplementary information:

    • Additional Paired-end Libraries: Paired-end libraries that will be used only for merging unitigs into contigs and will not contribute toward the consensus sequence.

    • Mate-pair Libraries: Mate-Pair libraries that will be used for scaffolding. Mate-Pair libraries that will be used for scaffolding. Mate-pair libraries do not contribute toward the consensus sequence.

    • Linked Reads: Linked reads from 10x Genomics Chromium. The linked reads are used to correct assembly errors and scaffolding.

    • Long Sequences Libraries: Provide long sequence libraries (such as RNA-Seq contigs) that will be used for rescaffolding. Long sequence libraries do not contribute toward the consensus sequence.

  • Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.

    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.

    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

Figure 1: Input Data Page

Configuration

  • K-mer Size: The term k-mer refers to all possible subsequences of the given length that are contained in a read. In sequence assembly, k-mers are used during the construction of De Bruijn graphs. The choice of the k-mer size has many different effects on the sequence assembly, it is advisable to try different values and check the results to choose the best one. It is recommended to use odd values of at least half the length of the reads.

  • Use paired De Bruijn graph: Assembly will be performed using a paired De Bruijn graph. In this mode, k-mer pairs are used, which consist of two equal-sized k-mers separated by a fixed distance. To assemble using the paired De Bruijn graph mode, specify the k-mer pair span (distance between k-mers).

  • K-mer Pair Span: Set the span of a k-mer pair (distance between k-mers).

  • Minimum Alignment Length: Establish the minimum alignment length of a read (bp). This means that there must be a perfect match of the established length between each read and its target contig.

  • Hash Functions: Set the number of Bloom filter hash functions. K-mers from each input sequencing read are loaded into the Bloom filter by computing the hash values of each k-mer sequence and setting the corresponding bit.

  • K-mer Count Threshold: Set the k-mer count threshold for Bloom filter assembly. Optimal values are typically in the range of 2-4. K-mers with an occurrence count below the threshold will be discarded.

Figure 2: Configuration Page

Output

  • Unitigs Fasta: Where to store the Fasta file containing the assembled unitigs.

  • Contigs Fasta: Where to store the Fasta file containing the assembled contigs.

  • Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffolds.

  • Long Scaffolds Fasta: Where to store the Fasta file containing the assembled long scaffolds. Note that this file is only generated if long sequence libraries were provided.

Figure 3: Output Data Page

Results

ABySS returns the assembled sequences in three FASTA files (four if long sequence libraries were provided). Each one corresponds to a different stage of the assembly procedure:

  • Unitigs: Contains sequences assembled without using paired-end information. In case you provide only single-end data, this will be the only result file, since pairing information is required to assemble contigs.

  • Contigs: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps, but no repeats.

  • Scaffolds: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps and repeats.

  • Long Scaffolds: Contains sequences that were obtained by rescaffolding using long sequences libraries.

In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 4). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly:

  • N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every unitig, contig or scaffold from longest to shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.

  • L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

  • Bloom filter False Positive Rate (FPR): The Bloom filter can generate false positives when the bit signatures of different k-mers overlap by chance. This means that a certain fraction of k-mer queries will return true even though the k-mers do not exist in the input sequencing data. Users are recommended to target a Bloom filter false positive rate (FPR) smaller than 5%. Parameters such as the k-mer size, hash functions or k-mer count threshold can influence the false positive rate.

The Nx plot (Figure 5) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for unitigs, contigs and scaffolds.

Figure 4: Summary Report

Figure 5: Nx Plot

SPAdes

SPAdes is a de novo genome assembly pipeline that can deal with data coming from several sequencing technologies and supports hybrid and single-cell assemblies. The SPAdes assembly pipeline consists of four stages:

  1. Assembly graph construction. SPAdes uses the multisized de Bruijn graph, implements new bulge/tip removal algorithms, detects and removes chimeric reads, aggregates biread information into distance histograms, and allows to backtrack the performed graph operations.

  2. k-bimer adjustment: SPAdes derives accurate distance estimates between k-mers in the genome using joint analysis of distance histograms and paths in the assembly graph.

  3. Constructs the paired assembly graph: Inspired by Paired de Bruijn graphs (PDBG) approach.

  4. Contig construction: SPAde constructs DNA sequences of contigs and the mapping of reads to contigs by backtracking graph simplifications.

SPAdes uses a modification of Hammer for error correction and quality trimming prior assembly.

In general, SPAdes uses two techniques for scaffolding.

  • SPAdes tries to estimate the size of the gap separating contigs using read pairs.

  • SPAdes, using the assembly graph, joins contigs that are separated by a complex tandem repeat, that cannot be resolved exactly, with a fixed gap size of 100 bp.

Contigs produced by SPAdes do not contain N symbols.

Please, cite SPAdes as:

Input Data

  • Input Reads: Select the files containing the sequencing libraries (reads). The assembly strategy requires at least one of these types of sequencing libraries.

    • Illumina single-end, paired-end, or high-quality mate-pairs.

    • IonTorrent single-end, paired-end, or high-quality mate-pairs.

    • PacBio CCS reads (should be provided as single-end data).

These files are assumed to be in FASTQ format. For IonTorrent data, SPAdes supports unpaired reads in unmapped BAM format.

  • IonTorrent Data: This option is required when assembling IonTorrent data. Illumina and IonTorrent libraries should not be assembled together. For IonTorrent data, SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by the Torrent Server).

  • Single-cell Data: This option is required for Multiple Displacement Amplification (MDA) single-cell data assembly.

  • Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.

    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.

    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

Figure 6: Input Data Page

  • Use Additional Mate-Pair Data: SPAdes supports mate-pair only assembly. However, high-quality mate-pair libraries are recommended in these cases. Here, regular mate-pair libraries can be provided as supplementary information. Upstream and downstream files will be distinguished using the pattern established in the previous page (Paired-end Configuration).

  • Use Data for Hybrid Assembly:

    • PacBio (CLR), Oxford Nanopore and Sanger reads can be provided for hybrid assemblies (e.g. with Illumina or IonTorrent data). SPAdes uses this data for gap closure and repeat resolution.

    • Contigs of the same genome (trusted) generated by other assembler(s) can be specified to merge them into SPAdes assembly.

    • Less reliable contigs (untrusted) can be used only for gap closure and repeat resolution.

Only contigs of the same genome should be specified since SPAdes does not work with genomes of closely-related species.

FIgure 7: Input 2 Data Page

Configuration

  • Automatic K-mer Sizes: K-mer sizes are selected automatically based on the read length and data set type:

    • If single-cell data is provided, the default values are 21, 33 and 55.

    • For multicell datasets, K values are automatically selected using maximum read length.

  • K-mer Sizes: Define a comma-separated list of k-mer sizes to be used. These must be odd and less than 128. You can find recommendations about K-mer sizes in the SPAdes documentation.

  • Read Error Correction: Performs a read error correction before assembly. Depending on the sequencing platform, the BayesHammer (Illumina) or the IonHammer (IonTorrent) tools are used for this task. This procedure is recommended to obtain high-quality assemblies but can be turned off if read error correction has been done previously.

  • Mismatch Careful Mode: Tries to reduce the number of mismatches and short indels. It also runs MismactCorrector, a post-processing tool that uses BWA.

  • Read Coverage Cutoff: Configure the read coverage cutoff value that SPAdes will use to obtain the most reliable assembled sequences. Must be a positive decimal number, or automatic, or off. When set to “Automatic” SPAdes automatically computes coverage threshold using conservative strategy.

  • Read Coverage Cutoff Value: If the “Defined by User” option is selected above, set a positive float value.

Figure 8: Configuration Page

Output

  • Contigs Fasta: Where to store the Fasta file containing the assembled contigs.

  • Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffold. Recommended for use as resulting sequences.

Figure 9: Output Data Page

Results

SPAdes returns the assembled sequences in two FASTA files:

  • Contigs: Contains resulting contigs.

  • Scaffolds: Contains resulting scaffolds (recommended for use as resulting sequences).

Contigs/scaffolds names in SPAdes output FASTA files have the following format:

>NODE_3_length_237403_cov_243.207

  • 3 is the number of the contig/scaffold.

  • 237403 is the sequence length in nucleotides.

  • 243.207 is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage.

In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 10). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly (see the explanation in the previous section).

  • N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every contig or scaffold from the longest to the shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.

  • L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

The Nx plot (Figure 11) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs and scaffolds.

Figure 10: Summary Report

Figure 11: Nx Plot

Flye

Flye is a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs:

  1. Flye initially generates disjointigs that represent concatenations of multiple disjoint genomic segments

  2. Concatenates all error-prone disjointigs into a single string (in arbitrary order).

  3. Constructs an accurate assembly graph from the resulting concatenate.

  4. Uses reads to untangle this graph and resolves bridged repeats.

  5. Resolves bridged repeats (which are bridged by some reads in the repeat graph).

  6. Uses the repeat graph to resolve unbridged repeats (which are not bridged by any reads) using small differences between repeat copies.

  7. Output accurate contigs formed by paths in this graph.

Please, cite Flye as:

Input Data

  • Input Reads: Select the files containing the sequencing libraries (long reads). Currently, raw and corrected reads from PacBio and Oxford Nanopore Technologies are supported. Expected error rates are <30% for raw and <2% for corrected reads.

Mixing different read types is not yet supported.

  • Corrected Data: Check this option if error-corrected PacBio/ONT reads are provided. The parameters are optimized for error rates <2%. If you are getting highly fragmented assembly, most likely error rates in the reads are higher. In this case, consider assembling using the raw reads instead.

  • Subassemblies: This option generates a consensus of multiple high-quality contig assemblies, such as produced by different short/long-read assemblers). The expected error rate is <1%. If checked, select a Fasta file containing high-quality contig assemblies.

Figure 12: Input Data Page

Configuration

  • Genome Size: Provide an estimate of the size of the genome. Common suffixes are allowed, for example, "m" (mega) or "g" (giga). The genome size estimate is used to decide how many reads to correct and how sensitive the overlapper should be.

  • Automatic Minimum Overlap: The minimum overlap length for two reads to be considered overlapping is chosen automatically based on the read length distribution (reads N90) and does not require a manual setting.

  • Manual Minimap Overlap: This sets a minimum overlap length for two reads to be considered overlapping. The typical value is 3k-5k. Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps. In some rare cases (for example in case of biased read length distribution) it makes sense to set this parameter manually.

  • Number of Polishing Iterations: Polishing is performed as the final assembly stage. By default, Flye runs one polishing iteration. Additional iterations might correct a small number of extra errors (due to improvements on how reads may align to the correct assembly). If the parameter is set to 0, the polishing is not performed.

  • Plasmids: This option allows to rescue short unassembled plasmids.

  • Keep Haplotypes: Do not collapse alternative haplotypes.

  • Trestle: Trestle is an extra module that resolves simple repeats of multiplicity 2 that were not bridged by reads. Depending on the datasets, it might resolve a few extra repeats, which is helpful for small (bacterial genomes). On large genomes, the contiguity improvements are usually minimal, but the computation might take a lot of time.

Figure 13: Configuration Data Page

Output

  • Assembly Fasta: Select where to store the Fasta file containing the assembled genomic sequences.

  • Graph File: If this option is checked, select where to store the final repeat graph in a .gfa file. Note that the edge sequences might be different (shorter) than contig sequences because contigs might include multiple graph edges.

Figure 14: Output Data Page

Results

Flye returns the results in two different files:

  • Assembly (Fasta): Contains resulting contigs/scaffolds.

  • Assembly Graph (Gfa): Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences because contigs might include multiple graph edges. Repeat graphs produced by Flye could be visualized using AGB or Bandage.

In addition to the resulting files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 15). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly (see the explanation in the previous section).

  • N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every contig or scaffold from the longest to the shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.

  • L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

The Nx plot (Figure 16) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs and scaffolds.

Figure 15: Summary Report

Figure 16: Nx Plot