Multi-Locus Sequence Typing (MLST)

Content of this page


Introduction

Multi-locus sequence typing (MLST) is a useful tool for studying the genetic diversity of important public health pathogens that has provided a portable and reproducible typing system. It is a nucleotide sequence-based approach of an unambiguous procedure of characterizing isolates of bacterial species using the sequences of internal fragments of (usually) seven housekeeping genes. For this, approx. 4 450-500 bp internal fragments of each gene are used, as these can be accurately sequenced on both strands using an automated DNA sequencer. For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven loci define the allelic profile or sequence type (ST).

For more info please click here.

Please cite MLST as:

Larsen MV et al. (2012). Multilocus sequence typing of total-genome-sequenced bacteria. Journal of clinical microbiology, 50(4), 1355-61.

Run MLST

This functionality can be found under Genome Analysis → Multi-locus Sequence Typing (MLST). The wizard allows to select files and set the parameters (Figure 1 and Figure 2).

Input

  • Input File Type: Choose between Assembled or Draft Genome/Contigs (FASTA) or Raw Sequencing Reads (FASTQ). Single-end or paired-end reads can be used when selecting Raw Sequencing Reads (FASTQ). Note that if paired-end is selected, two files per sample are required.
  • Input Data: Provide the files containing sequencing reads or contigs. These files can be in FASTQ or FASTA format.
  • Paired-end configuration: In the case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample. Files whose name match with upstream and downstream patterns will be treated as paired-end data. The remaining files and those for which no partner is detected will be treated as single-end data. 
    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files. 

For example, if the upstream file is named SRR3666079_1.fastq and the downstream one SRR3666079_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern. 


Figure 1: Input Data Page

Configuration

  • MLST Configuration: Select the species database that will be used as a template for MLST prediction. If a wrong species is selected, the run may fail or the output will show no (zero) or minimal identity and coverage. MLST allele sequence and profile data are obtained from PubMLST.org

Configuration

For four organisms, two or three different MLST schemes are available. These are:

  1. Acinetobacter baumannii: (Acinetobacter baumannii #1, Acinetobacter baumannii #2)

  2. Escherichia coli: Escherichia coli #1, Escherichia coli #2)
  3. Pasteurella multocida: (Pasteurella multocida #1 (RIRDC), Pasteurella multocida #2 (multihost))
  4. Leptospira: (Leptospira #1, Leptospira #2, Leptospira #3)

Figure 2: MLST Configuration Page

Results

When the MLST completes, it creates a sequence table containing the MLST results (Figure 3). This table will contain:

Figure 3: Results Table Page


  1. Tags: It contains a quick overview of the MLST result of your sample. It will generate three possible reports:
    1. Matched: When a complete matched was found with no errors or SNPs, therefore the average identity between all the housekeeping genes templates and the query reads/sequences, as well as the average coverage, is 100%. All samples with "Matched" results will be highlighted with green. 
    2. Partial: When partial matches were found the average identity between all the housekeeping genes templates and the query, as well as the average coverage, is less or equal to 99%. It happens when some potential errors or SNPs have been detected. All samples with "Partial" results will be highlighted with orange.
    3. No Matched: The query sequence did NOT match any housekeeping gene template within the chosen MLST configuration. All samples with "No Matched" results with be highlighted with red.
  2. Name: It displays the input file name.
  3. Sequence Type: It contains the corresponding MLST sequence type. Please note that for all "Partial" results, the sequence type will have a number and an asterisk. This asterisk is to indicate that the Sequence Type number shown here is not a 100% match and alleles with discrepancies will be indicated in the "Note" column. 

    Sequence Type: Unknown

    Please note that "Unknown" can be the Sequence Type for samples with "Matched" result and the reason for this is that even though all the alleles in the query are matching 100% alleles in templates sequences in the database, the combination of the alleles does not have a MLST number assigned yet. In the case of samples with  "Partial" results, the "Unknown" Sequence Type is not because of the discrepancies, but because the combination of the alleles does not have a MLST number assigned yet either. Lastly, for samples with "No Matched" results, the "Unknown" Sequence Type is because there was not an MLST loci that matched with the input data. This is most common when the wrong MLST scheme was chosen in the MLST configuration wizard page. 

    Some "Unknown" results could also report a "Nearest ST..." if there is enough coverage and identity found between the query reads/sequences and any template sequence in the database. Figure 3 shows an example of this case in sample name "scaffolds4", in which the sequence type is reported as "Unknown, Nearest ST: 34, 196".

  4. Average Identity: All query reads/sequences that match a housekeeping gene template sequence in the database will return the percentage identity of the alignment. This percentage identity obtained for each housekeeping gene will be averaged and this average will be displayed in this column. 
  5. Average Coverage: All query reads/sequences that match a housekeeping gene template sequence in the database will return the percentage coverage of the alignment. This percentage coverage obtained for each housekeeping gene will be averaged and this average will be displayed in this column. 
  6. Notes: It contains important information relevant to the sequence type result generated. "Matched" results will NOT any information in this column, however, "Partial" results will display all the alleles with discrepancies. This discrepancy may indicate that a novel allele was found, errors or SNPs. A detailed report containing the nucleotide(s) differences and location within the alleles can be found in the "MLST Alignment Report". "No Matched" results will indicate that no MLST loci was found in the input data, make sure that the correct MLST scheme was chosen. 


When the MLST completes, it also creates a MLST Report. This contains the information relevant to the MLST run, including the input data, MLST configuration used, MLST results, and the parameters used for the analysis (Figure 4). 

Figure 4: MLST Report Page




A MLST Results report will be generated with sample or file name and all the different housekeeping genes sequences found in the query reads/sequences (Figure 5). To access this report, right-click on the row of the sample, and select "Show MLST Result" option (Figure 5).

Figure 5: Generating MLST Results


The MLST Result report contains the information relevant to that specific sample or input file. This includes the input data, MLST configuration used, MLST results, and the parameters used for the analysis (Figure 6). The MLST results portion contains a table with the locus, which is the name if the housekeeping gene that the query reads/sequences have been aligned to; the identity, which refers to the percentage of the query reads/sequences that matched a template sequence in the database; the coverage,  which refers to how much of the template sequence in the database has been covered by the query reads/sequences; the alignment length, which refers to the total number of nucleotides between the query reads/sequences that have aligned against a template sequence in the database; the allele length, which refers to the total number of nucleotides in the allele or template sequence in the database; gaps, this will indicate if any gaps or deletions have been detected; and lastly, the allele, which refers to the name of the housekeeping gene or sequence in the database. 


Figure 6: MLST Results Page for a Specific Sample or Input File

A MLST Alignments report will be generated with sample or file name and all the different housekeeping genes sequences found from the query reads aligned against the housekeeping genes template sequences (Figure 4). To access this report, right-click on the row of the sample, and select "Show MLST Alignment Report" option (Figure 7).

Figure 7: Generating MLST Alignment Report Page


The MLST Alignment report contains a detailed and colored report of the alignments. In this report, the alignment between the query reads/sequences and the template sequence in the database is divided by each allele detected. Alleles can be identified by 'pound sign' (#) in front of the allele name. Discrepancies are highlighted (Figure 8).

Figure 8: MLST Alignment Report Page