Taxonomic Classification

Content of this page:

Introduction

It is known that less than 1% of bacteria observed in natural environments can be cultured under normal laboratory conditions, so the vast majority of microorganisms are unable to be studied with traditional microbiology procedures. Metagenomics, the application of sequencing techniques to DNA from microbial communities in natural environments, allows to detect the whole diversity of microorganisms and their genes in this kind of samples.

The first main goal of metagenomics experiments is often to detect and to quantify the microorganisms that are present. This process is called taxonomic classification or profiling. There are two main strategies to evaluate the taxonomic composition of a sample: amplicon sequencing (16S/18S/ITS) and whole genome sequencing (WGS).

Amplicon Sequencing (16S/18S/ITS)

The 16S rRNA gene has been the gold standard for taxonomic identification of bacterial communities. This gene includes both conserved regions (used for designing primers) and hypervariable regions (V1 to V9, used to identify different taxa).

The variable areas can be amplified to observe these specific regions and to identify or quantify the presence of a microorganism by looking at this gene. There are numerous strategies to design the amplification primers: some studies suggest that sequencing should include one or more of the V2, V3, V4, V6 or V3/V4 regions, but there are still no conclusions about what are the most adequate hypervariable regions to analyse. The resolution at which taxa can be detected depends directly on the sequencing depth and on the regions chosen to amplify.

This amplicon-based approach has the main advantage of requiring little efforts in sequencing, making the analysis very cheap. But the strategy also has some limitations. Between some bacterial species, there are no sufficient differences among their rRNA genes, which would allow clear separation. In addition, many bacteria have multiple rRNA gene copies in their genome, which can blur species quantification results. Other considerations, such as amplification biases or chimera formation make the 16S classification even more difficult.

There are different publicly available databases which contain information about the DNA sequence of the 16S rRNA genes for many known organisms, for example GreenGenes (with Archaeal and Bacterial 16S sequences) or Silva (Archaeal, Bacterial and Eukaryotic). They contain information about long subunits (LSU) and short subunits (SSU) of ribosomal genes. The strategy for the taxonomic classification with amplicon data consists of aligning the sequences to the databases.

Whole Genome Sequencing (WGS)

The use of high-throughput sequencing allows to sequence all the genomic information from the whole community of a sample with whole metagenome shotgun sequencing (WGS or WMGS) to generate metagenomes that contain genomic information across the whole genome.

Taxonomic classification tools for WMGS match sequences - typically reads or assembled contigs - against a database of microbial genomes to identify the taxon of each sequence. In the early days of metagenomics, sequence alignments (e.g. BLAST) were often used to query reads against comprehensive databases (RefSeq or GenBank). As the reference databases and the amount of sequencing data have grown, alignment using BLAST has become computationally infeasible. This led to the development of metagenomics classifiers that provide results much faster with similar sensitivity. A variety of strategies are available for the matching step, i.e.:

  • Aligning reads to a database of reference genomes (MEDUSA, GOTTCHA).
  • Mapping k-mers (Kraken, MetaCV). 
  • Aligning marker genes only (MetaPhlAn). 
  • Translating metagenomic DNA and aligning it to protein sequences (Kaiju). 

OmicsBox comes with Kraken as the tool of choice for taxonomic classification because of its overall positive characteristics, such as being amplicon and WGS capable and showing good benchmark scores.

Taxonomic Classification with Kraken

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA short reads. It does so by examining the k-mers within a read and querying a database with those k-mers. This database contains a mapping of every k-mer in Kraken's genomic library to the lowest common ancestor (LCA) in a taxonomic tree of all genomes that contain that k-mer. The set of LCA taxa that correspond to the k-mers in a read are then analyzed to create a single taxonomic label for the read; this label can be any of the nodes in the taxonomic tree. Kraken is designed to be rapid, sensitive, and highly precise. This approach is feasible for metagenomics WGS as well as 16S/ITS amplicon read input data.

The current version of Kraken is Kraken2, which provides significant improvements to Kraken1 like faster classification speeds and smaller database sizes, which allows us to include more data. To run Kraken2 go to Taxonomic Classification > Kraken2 (Figure 1 and Figure 2). Kraken1 is also accessible under Kraken1 (Legacy). The current Kraken database version is 2020_03 for Kraken2, and 2019_06 for Kraken1, created by us. The contained species and strains in both databases can be visualized and explored through Taxonomic Classification > Database Info. 

  • Sequencing Data: Choose the type of input data: single-end, paired-end or interleaved paired-end reads If paired-end is selected, two files per sample are required and the file pattern has to be provided.
  • Reads, Contigs, or Genes: Select files that contain the desired input data. Kraken was designed to work with short reads, but works reliably with assembled sequences or genes.
  • Paired-end configuration: When working with paired-end libraries, a so-called pattern has to be established to help the software distinguish between upstream and downstream read files. Per default, we assume the following pattern:
    • upstream: SampleA_1.fastq
    • downstream: SampleA_2.fastq


Note:

For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.

Figure 1.Taxonomic Classification Wizard: input page


The second wizard page contains information about the Kraken version and how to cite it.

Figure 2. Taxonomic Classification Wizard: configuration page.

Kraken1 Contaminant Screening

Metagenome sequence data is often "contaminated" by the host organism, e.g. a human gut genome sequencing project will contain reads from Homo sapiens. Therefore, Kraken1 can be combined with a prior contaminant screening to filter out reads that may stem from a host organism. We offer various model organism genomes, the screening is performed with the help of Bowtie2.

All the genomes of these model organisms are already included in the Kraken2 database, so they can be detected directly within the Taxonomic Classification analysis with Kraken2. This makes the previous screening step obsolete.

Available genomes for contaminant screening in Kraken1 are:

  • Homo sapiens
  • Arabidopsis thaliana
  • Bos taurus
  • Drosophila melanogaster
  • Escherichia coli
  • Mus musculus
  • Rattus norvegicus
  • Sus scrofa

Figure 3. Taxonomic Classification Wizard with Kraken 1: configuration page and Contaminant Screening.

Results

The results of the taxonomic classification with Kraken are:

  • Main result table, that shows all identified OTUs for each provided sample.
  • PDF Report that shows overall input and carried-out analysis information.
  • Stacked bar chart to compare samples at specific taxonomic levels.
  • Radial cladogram in form of a Krona chart to study OTU abundances in a sample.
  • Rarefaction curves to assess the sequencing depth.
  • Chao1 (species) diversity curve to evaluate the (species) diversity in the whole data-set.
  • Principal Coordinates Analysis plot to get an overview and to identify outliers.

Main result table

The result table (Figure 4) shows for each analyzed sample the number of OTUs found and their confidence scores. OTU stands for operational taxonomic unit. A count of 0 means the OTU is not present in the current sample. The evidence scores vary from 0 to 1, where 1 is best. The counts shown are cumulative, meaning that they do not only show direct hits for a certain OTU, but also sum up OTUs that converge into them in the taxonomic tree i.e. the taxonomic tree's hierarchy is taken into account. We use the taxonomic hierarchy from the NCBI in a simplified form and it consists of 8 main levels instead of 33, i.e. the many levels are summarized as shown in Figure 5.

The result table can be used to filter for certain taxonomic levels via the column header filter function, e.g. showing only the species or phylum level OTUs.

Right-clicking an OTU shows the table context menu to create statistics and id lists and provides extra functionality. The Extract Sequences option allows exporting the read names and actual reads of the currently selected OTUs. This makes it possible to export all reads that have been classified as e.g. bacteria and thus reduce the dataset size for further gene finding and functional annotation. Many other scenarios and use cases exist.

Figure 4. Taxonomic Classification results table.


Figure 5. NCBI taxonomic hierarchy.

Stacked Bar Chart

The Stacked bar chart (Figure 6) is a combined view for inter-sample comparison, separated in the 7 main taxonomic levels. Average OTUs are ordered by abundance from high to low. Only the 500 biggest OTUs are shown for each sample, the remaining are gathered into an extra group called Others. These low frequent OTUs can be analyzed in detail with the Krona Pie Chart. The button Hide Unclassified in the top-right corner shows how the percentages change when only taking into account the data that could be classified by Kraken.

The graphic can be exported as a PNG image by clicking the corresponding icon in the top-right corner.

Figure 6. Stacked bar chart.

Krona Pie Chart

This graphic (Figure 7) shows a slightly modified Krona chart with various options in the side panel. Again, the counts are cumulative and grouped into roughly 8 main levels. However, all direct counts are shown as well, which is helpful when looking at the "below species" level, which includes subspecies and strains.

The currently visualized sample is selected from a list in the side panel, the All Combined entry shows all samples together in one chart. Furthermore, text sizes can be adjusted and OTUs can be searched. Coloring by average Kraken evidence scores is also possible.

The graphic can be exported as a PNG image and PDF by clicking the corresponding icons in the top-right corner.

Summary Report

A summary report which shows basic statistics and alpha-diversity indices for each analyzed sample. It also gives information about the percentages of reads that were classified. In addition, for each of the 7 main taxonomic levels (Superkingdom, Phylum, Class, Order, Family, Genus, and Species), the top 10 OTUs per sample are listed.

The graphic can be exported as PDF by clicking the corresponding icon in the top-right corner.

Figure 7. Krona pie chart.


In addition, more charts and statistics can be generated to offer a global visualization of the taxonomic classification results. These charts can be found in the side panel of the taxonomic classification results.

Rarefaction Curves

Rarefaction is a technique widely used in ecology which is applied to OTU analysis. A rarefaction graph (Figure 8) represents the number of expected OTUs (Y-axis) found in n NGS reads (X-axis).

The goal of rarefaction is to determine whether sequencing coverage is deep enough to get a good estimate of the total number of the OTUs present in a specific sample.

If the rarefaction curve still presents a growing trend at its end, it means that the coverage is not enough to adequately represent the real microbial diversity of the sample. By contrast, if the curve shows a horizontal asymptote, it means that a good estimation of diversity was obtained.

Note that the results of the rarefaction technique give us a suggestion about the coverage, but they are not conclusive, i.e. rare OTUs that are present in the sample could not be yet observed even if the curve presents an asymptotic trend.

Figure 8. Rarefaction curves.

Diversity Curve

An accumulation or diversity curve (Figure 9) plots the cumulative number of distinct OTUs discovered as a function of the number of samples examined. I.e. The minimum, average, and maximum number of OTUs, when looking at 1, 2, ... N samples of the current dataset.

This curve can be used to evaluate the benefits in microbial diversity of including additional samples to the dataset or to compare this diversity across datasets with similar sampling efforts.

Figure 9. Diversity curve.

Principal Coordinate Analysis (PCoA Plot)

PCoA (Figure 10) is a two-dimensional plot in which the Bray-Curtis distances between samples are drawn. You can select an experimental condition to color the points from an experimental design, and the taxonomy level from which the distances will be calculated. 


Figure 10. PCoA plot.

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

Wood DE., Lu J. and Langmead B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257.

Langmead B. and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-9.

Ondov BD., Bergman NH. and Phillippy AM. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12, 385.