Taxonomic Classification

Introduction

It is known that less than 1% of bacteria observed in natural environments can be cultured under normal laboratory conditions, so the vast majority of microorganisms are unable to be studied with traditional microbiology procedures. Metagenomics, the application of sequencing techniques to DNA from microbial communities in natural environments, allows detecting the whole diversity of microorganisms and their genes in these kinds of samples.

The first main goal of metagenomics experiments is often to detect and quantify the microorganisms that are present. This process is called taxonomic classification or profiling. There are two main strategies to evaluate the taxonomic composition of a sample: amplicon sequencing (16S/18S/ITS) and whole-genome sequencing (WGS).

Amplicon Sequencing (16S/18S/ITS)

The 16S rRNA gene has been the gold standard for the taxonomic identification of bacterial communities. This gene includes both conserved regions (used for designing primers) and hypervariable regions (V1 to V9, used to identify different taxa).

The variable areas can be amplified to observe these specific regions and to identify or quantify the presence of a microorganism by looking at this gene. There are numerous strategies to design the amplification primers: some studies suggest that sequencing should include one or more of the V2, V3, V4, V6, or V3/V4 regions, but there are still no conclusions about what are the most adequate hypervariable regions to analyze. The resolution at which taxa can be detected depends directly on the sequencing depth and on the regions chosen to amplify.

This amplicon-based approach has the main advantage of requiring little effort in sequencing, making the analysis very cheap. But the strategy also has some limitations. Between some bacterial species, there are no sufficient differences among their rRNA genes, which would allow clear separation. In addition, many bacteria have multiple rRNA gene copies in their genome, which can blur species quantification results. Other considerations, such as amplification biases or chimera formation make the 16S classification even more difficult.

There are different publicly available databases that contain information about the DNA sequence of the 16S rRNA genes for many known organisms, for example, GreenGenes (with Archaeal and Bacterial 16S sequences) or Silva (Archaeal, Bacterial, and Eukaryotic). They contain information about long subunits (LSU) and short subunits (SSU) of ribosomal genes. The strategy for the taxonomic classification with amplicon data consists of aligning the sequences to the databases.

Whole Genome Sequencing (WGS)

The use of high-throughput sequencing allows to sequence of all the genomic information from the whole community of a sample with whole metagenome shotgun sequencing (WGS or WMGS) to generate metagenomes that contain genomic information across the whole genome.

Taxonomic classification tools for WMGS match sequences - typically reads or assembled contigs - against a database of microbial genomes to identify the taxon of each sequence. In the early days of metagenomics, sequence alignments (e.g. BLAST) were often used to query reads against comprehensive databases (RefSeq or GenBank). As the reference databases and the amount of sequencing data have grown, alignment using BLAST has become computationally infeasible. This led to the development of metagenomics classifiers that provide results much faster with similar sensitivity. A variety of strategies are available for the matching step, i.e.:

  • Aligning reads to a database of reference genomes (MEDUSA, GOTTCHA).

  • Mapping k-mers (Kraken, MetaCV). 

  • Aligning marker genes only (MetaPhlAn). 

  • Translating metagenomic DNA and aligning it to protein sequences (Kaiju). 

OmicsBox comes with Kraken as the tool of choice for taxonomic classification because of its overall positive characteristics, such as being amplicon and WGS capable and showing good benchmark scores.

Taxonomic Classification with Kraken

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA short reads. It does so by examining the k-mers within a read and querying a database with those k-mers. This database contains a mapping of every k-mer in Kraken's genomic library to the lowest common ancestor (LCA) in a taxonomic tree of all genomes that contain that k-mer. The set of LCA taxa that correspond to the k-mers in a read are then analyzed to create a single taxonomic label for the read; this label can be any of the nodes in the taxonomic tree. Kraken is designed to be rapid, sensitive, and highly precise. This approach is feasible for metagenomics WGS as well as 16S/ITS amplicon read input data.

The current version of Kraken is Kraken2, which provides significant improvements to Kraken1 like faster classification speeds and smaller database sizes, which allows us to include more data. To run Kraken2 go to Taxonomic Classification > Kraken2 (Figure 1 and Figure 2). Kraken1 is also accessible under Kraken1 (Legacy). The current Kraken database version is 2020_03 for Kraken2, and 2019_06 for Kraken1, created by us. The contained species and strains in both databases can be visualized and explored through Taxonomic Classification > Database Info. 

  • Sequencing Data: Choose the type of input data: single-end, paired-end or interleaved paired-end reads If paired-end is selected, two files per sample are required and the file pattern has to be provided.

  • Reads, Contigs, or Genes: Select files that contain the desired input data. Kraken was designed to work with short reads, but works reliably with assembled sequences or genes.

  • Paired-end configuration: When working with paired-end libraries, a so-called pattern has to be established to help the software distinguish between upstream and downstream read files. Per default, we assume the following pattern:

    • upstream: SampleA_1.fastq

    • downstream: SampleA_2.fastq


Note:

For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.


Figure 1.Taxonomic Classification Wizard: input page

The second wizard page contains information about the Kraken version and how to cite it.

Figure 2. Taxonomic Classification Wizard: configuration page.

Using a Custom Kraken2 Database

This section describes how to upload and use a custom Kraken2 database for taxonomic classification in OmicsBox.

Custom Database Requirements

  • The custom database is in Kraken2 format consisting of three files: hash.k2d, taxo.k2d and opts.k2d

  • The custom database has been created from nucleotide sequence data. Kraken in OmicsBox does not accept protein-based databases.

  • The custom database is built based on the NCBI taxonomy. Other taxonomies e.g. Silva, Green Genes, or GTDB are not supported in OmicsBox.

Please have a look at this page if you want to download alternative pre-formatted Kraken2 databases:

If you want to create your own Kraken2 Database from a fasta file please follow the instructions provided by the official Kraken documentation below. This is only recommended if you know some shell scripting, know how to handle big amounts of data, and have access to a powerful Linux machine (50+GB of RAM).

How to upload a custom database

  1. Access the Cloud Files Tab within OmicsBox: On the left-hand side of the OmicsBox interface, you will find a panel with different tabs. Click on the "Cloud Files" tab to access your personal cloud storage.

  2. Create a New Folder: Right-click in any location of the "Cloud Files" tab to create a new folder with a significant name at the desired location. The folder name will determine the database name in the OmicsBox user interface.

  3. Upload the Kraken2 Database Files: Drag and drop all three database files (hash.k2d, taxo.k2d, opts.k2d) from your local computer into the newly created folder within the "Cloud Files" tab.

How to use a Custom Database

To classify taxa using your custom Kraken2 database in OmicsBox, follow these steps:

  1. Open the Taxonomic Classification dialog under the Metagenomics menu.

  2. In the analysis settings, choose your custom Kraken2 database from the "Database" dropdown.

  3. Adjust any required parameters and settings for your classification analysis.

  4. Continue with your taxonomic classification analysis in OmicsBox as usual.

Results

The results of the taxonomic classification with Kraken are:

  • The main result table shows all identified OTUs for each provided sample.

  • PDF Report that shows overall input and carried-out analysis information.

  • Stacked bar chart to compare samples at specific taxonomic levels.

  • Radial cladogram in as a Krona chart to study OTU abundances in a sample.

  • Rarefaction curves to assess the sequencing depth.

  • Chao1 (species) diversity curve to evaluate the (species) diversity in the whole data set.

  • Principal Coordinates Analysis plot to get an overview and to identify outliers.

Figure 4. Taxonomic Classification results table.

Main result table

The result table (Figure 4) shows for each analyzed sample the number of OTUs found and their confidence scores. OTU stands for operational taxonomic unit. A count of 0 means the OTU is not present in the current sample. The evidence scores vary from 0 to 1, where 1 is best. The counts shown are cumulative, meaning that they do not only show direct hits for a certain OTU, but also sum up OTUs that converge into them in the taxonomic tree i.e. the taxonomic tree's hierarchy is taken into account. We use the taxonomic hierarchy from the NCBI in a simplified form and it consists of 8 main levels instead of 33, i.e. the many levels are summarized as shown in Figure 5.

The result table can be used to filter for certain taxonomic levels via the column header filter function, e.g. showing only the species or phylum level OTUs.

Right-clicking an OTU shows the table context menu to create statistics and id lists and provides extra functionality. The Extract Sequences option allows exporting the read names and actual reads of the currently selected OTUs. This makes it possible to export all reads that have been classified as e.g. bacteria and thus reduce the dataset size for further gene finding and functional annotation. Many other scenarios and use cases exist.

Figure 5. NCBI taxonomic hierarchy.

Add, Remove, and Rename Samples

It is possible to combine taxonomic classification results by selecting Add Samples from the side panel. This will open a dialog to browse taxonomic classification results and which samples to add. A new object with the desired samples will be created.

Samples can also be removed or renamed by right-clicking the desired column/samples.

All side panel actions such as the report, bar chart, etc have to be created anew to reflect these changes.

Figure 6a. Add, Remove, and Rename Samples.

Stacked Bar Chart

The Stacked bar chart (Figure 6b) is a combined view for inter-sample comparison, separated in the 7 main taxonomic levels. Average OTUs are ordered by abundance from high to low. Only the 500 biggest OTUs are shown for each sample, the remaining are gathered into an extra group called Others. These low frequent OTUs can be analyzed in detail with the Krona Pie Chart. The button Hide Unclassified in the top-right corner shows how the percentages change when only taking into account the data that could be classified by Kraken.

The graphic can be exported as a PNG image by clicking the corresponding icon in the top-right corner.

Figure 6b. Stacked bar chart.

Krona Pie Chart

This graphic (Figure 7) shows a slightly modified Krona chart with various options in the side panel. Again, the counts are cumulative and grouped into roughly 8 main levels. However, all direct counts are shown as well, which is helpful when looking at the "below species" level, which includes subspecies and strains.

The currently visualized sample is selected from a list in the side panel, the All Combined entry shows all samples together in one chart. Furthermore, text sizes can be adjusted and OTUs can be searched. Coloring by average Kraken evidence scores is also possible.

The graphic can be exported as a PNG image and PDF by clicking the corresponding icons in the top-right corner.

Summary Report

A summary report which shows basic statistics and alpha-diversity indices for each analyzed sample. It also gives information about the percentages of reads that were classified. In addition, for each of the 7 main taxonomic levels (Superkingdom, Phylum, Class, Order, Family, Genus, and Species), the top 10 OTUs per sample are listed.

The graphic can be exported as PDF by clicking the corresponding icon in the top-right corner.

Figure 7. Krona pie chart.

In addition, more charts and statistics can be generated to offer a global visualisation of the taxonomic classification results. These charts can be found in the side panel of the taxonomic classification results.

Rarefaction Curves

Rarefaction is a technique widely used in ecology which is applied to OTU analysis. A rarefaction graph (Figure 8) represents the number of expected OTUs (Y-axis) found in n NGS reads (X-axis).

The goal of rarefaction is to determine whether sequencing coverage is deep enough to get a good estimate of the total number of the OTUs present in a specific sample.

If the rarefaction curve still presents a growing trend at its end, it means that the coverage is not enough to adequately represent the real microbial diversity of the sample. By contrast, if the curve shows a horizontal asymptote, it means that a good estimation of diversity was obtained.

Note that the results of the rarefaction technique give us a suggestion about the coverage, but they are not conclusive, i.e. rare OTUs that are present in the sample could not be yet observed even if the curve presents an asymptotic trend.

Figure 8. Rarefaction curves.

Diversity Curve

An accumulation or diversity curve (Figure 9) plots the cumulative number of distinct OTUs discovered as a function of the number of samples examined. I.e. The minimum, average, and maximum number of OTUs, when looking at 1, 2, ... N samples of the current dataset.

This curve can be used to evaluate the benefits in microbial diversity of including additional samples to the dataset or to compare this diversity across datasets with similar sampling efforts.

Figure 9. Diversity curve.

Principal Coordinate Analysis (PCoA Plot)

PCoA (Figure 10) is a two-dimensional plot in which the Bray-Curtis distances between samples are drawn. You can select an experimental condition to color the points from an experimental design, and the taxonomy level from which the distances will be calculated. 


Figure 10. PCoA plot.

References

  • Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

  • Wood DE., Lu J. and Langmead B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257.

  • Langmead B. and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-9.

  • Ondov BD., Bergman NH. and Phillippy AM. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12, 385.