Metagenome Assembly

Content of this page:

Metagenome Assembly

In metagenomics, reads as such (typically Illumina 2 x 150 bp) are usually too short for direct functional characterization. Therefore, we offer metagenome assembly tools as a previous step to gene prediction and functional annotation.

metaSPAdes

SPAdes – St. Petersburg genome assembler – is an assembly toolkit containing various assembly pipelines. In OmicsBox, SPAdes is run with the --meta option, this flag is recommended when assembling metagenomic data sets (see paper for more details).

metaSPAdes (figures 1, 2 and 3) addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. Note, that SPAdes was initially designed for small genomes. It was tested on bacterial (both single-cell MDA and standard isolates), fungal and other small genomes.  Currently metaSPAdes supports only paired-end libraries. Note that metaSPAdes might be very sensitive to presence of the technical sequences remaining in the data (most notably adapter readthroughs), please run quality control and pre-process your data accordingly.

SPAdes is a de Bruijn graph-based assembler. Input reads are split into k-mers to create the graph and to find its Eulerian path, i.e. the shortest path that visits every edge exactly once. metaSPAdes employs a few modifications to avoid misassemblies, creating shorter high-quality contigs instead of a few long contigs.

metaSPAdes, in comparison to MEGAHIT, needs more resources and takes more time, but also creates better results, i.e. higher Nx values.

  • Up / Downstream Reads: Choose the files containing the paired end reads respectively. SPAdes is not able to continue if the number of upstream reads doesn't exactly match the number of downstream reads, or if the read names differ.
  • Read Orientation: For forward-reverse orientation, the forward reads correspond to the left reads and the reverse reads, to the right. Similarly, in reverse-forward orientation left and right reads correspond to reverse and forward reads, respectively.

Figure 1. MetaSPAdes assembly wizard: input page. 

K-mer sizes: SPAdes will automatically select the k-mer sizes for graph construction. If desired otherwise, please provide a comma separated list of odd k-mer sizes (1-128).

Figure 2. MetaSPAdes assembly wizard: configuration page.

  • Contigs Fasta: Choose where to save the resulting multi-fasta file.
  • Scaffolds Fasta: Choose where to save the resulting file containing the scaffolds.

Figure 3. MetaSPAdes assembly wizard: output page.

The results of SPAdes are the assembled contigs and scaffolds in two separate multi Fasta files. Additionally, Quast is used to generate some basic statistics to asses the quality of the assembly, the PDF is accompanied by an Nx distribution chart.

Bankevich A et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology, 19(5), 455-77.

Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. (2015). MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 31(10), 1674–1676, 

Nurk S., Meleshko D., Korobeynikov A. and Pevzner PA. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834.

van der Walt AJ., van Goethem MW., Ramond JB., Makhalanyane TP., Reva O. and Cowan DA. (2017). Assembling metagenomes, one community at a time. BMC genomics, 18(1), 521.

Vollmers J., Wiegand S. and Kaster AK. (2017). Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters! PloS one, 12(1), e0169662.

MEGAHIT

MEGAHIT is an NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization is needed (figures 4, 5 and 6). 

Megahit was created in the same research group that was involved in the development of SOAPdenovo and SOAPdenovo2 and may be seen as the successor of these tools. It uses a range of k-mer values for iteratively improving assemblies in a strategy adopted from the IDBA assemblers. It employs a new data structure, the "succinct de Bruijn graph", which has been designed to significantly reduce memory requirements. As an additional step to further reduce memory consumption, only k-mers occurring at a frequency above a specified cutoff are retained as “solid-k-mers”, while the rest is removed as potential sequencing errors. By default, the cutoff value is 2, so k-mers occurring at least twice are kept while singleton k-mers are discarded. Because this eliminates not only sequencing errors, but also removes information from genuinely low abundant genome fragments, a “mercy-k-mer” strategy was introduced which recovers discarded k-mers if they provide new and useful information within a trustworthy context: Discarded singleton k-mers that occur on the same read as “solid k-mers” and are needed to connect these “solid k-mers” within the de Bruin graph are recovered and added to the graph. This minimizes loss of sequencing information while still keeping the influence of sequencing errors low. 

  • Sequencing Data: Choose the type of input data: single-end, paired-end or interleaved paired-end reads If paired-end is selected, two files per sample are required and the file pattern has to be provided.
  • Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ / GZ format.
  • Paired-end configuration: When working with paired-end libraries, a so-called pattern has to be established to help the software distinguish between upstream and downstream read files. Per default, we assume the following pattern:
    • upstream: SampleA_1.fastq
    • downstream: SampleA_2.fastq

Note:

For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.

Figure 4. MEGAHIT assembly wizard: input page. 

  • Minimum Multiplicity: K-mers that appear less times are filtered out. (kmin+1)-mer with multiplicity lower than d will be discarded. You should be cautious to set d less than 2, which will lead to a much larger and noisy graph. We recommend using the default value 2 for metagenomics assembly.
  • K-mer Sizes: Provide a list of k-mer sizes for iterative graph creation. Values have to be odd and in the range 15-255. 
    • for ultra complex metagenomics data such as soil, a larger kmin, say 27, is recommended to reduce the complexity of the de Bruijn graph. Quality trimming is also recommended.
    • for high-depth generic data, large --k-min (25 to 31) is recommended.
    • smaller --k-step, say 10, is more friendly to low-coverage datasets.
  • No Mercy K-mers: Do not add mercy k-mers. Mercy k-mers are specially designed for metagenomics assembly to recover low coverage sequence. For generic dataset >= 30x, MEGAHIT may generate better results with --no-mercy option.
  • Bubble Level: Intensity of bubble merging. Bubbles occur in the de Bruijn graph when several paths start in the same vertex and end in another vertex together.
  • Bubble Merge Level L: in complex bubbles with length <= L * k-mer size are merged.
  • Bubble Merge Level S: Complex bubbles with similarity >= S are merged.
  • Prune Level: Strength of low depth pruning.
  • Prune Depth: Remove unitigs with average k-mer depths less than this value.
  • Low Local Ratio: Ratio threshold to define low local coverage contigs.
  • Max Tip Length: Remove tips shorter than this value.
  • Disable Local Assembly: The local assembly module was introduced in version 1.0 and creates local contigs between iterations with high confidence k-mers.

Figure 5. MEGAHIT assembly wizard: configuration page.

  • Contig Fasta: The final fasta file containing the assembled contigs, will be saved in this file location.


Figure 6. MEGAHIT assembly wizard: output page.

The results of Megahit are the assembled contigs in a multi Fasta file. Additionally, Quast is used to generate some basic statistics to asses the quality of the assembly, the PDF is accompanied by an Nx chart.

Li D., Liu CM., Luo R., Sadakane K. and Lam TW. (2015). MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics (Oxford, England), 31(10), 1674-6.

van der Walt AJ., van Goethem MW., Ramond JB., Makhalanyane TP., Reva O. and Cowan DA. (2017). Assembling metagenomes, one community at a time. BMC genomics, 18(1), 521.

Vollmers J., Wiegand S. and Kaster AK. (2017). Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters! PloS one, 12(1), e0169662.