Demultiplexing or barcode splitting refers to the step in processing where you would use the barcode information in order to know which sequences came from which sample after they had all been sequenced together. Barcodes refer to the unique sequences that were ligated to your each of your individual samples’ genetic material before the samples got all mixed together. Depending on your sequencing facility, you may get your samples already split into individual fastq files, or they may be lumped together all in one fastq file with barcodes still attached for you to do the splitting. If this is the case, you should also have a mapping or barcode file telling you which barcodes correspond with which samples.
This tool takes FASTA/FASTQ files and splits them into several smaller files, Based on barcode matching. FastX-Toolkit is used for this task.
Barcode Splitter Wizard
Page 1 - Input
Reads - Select the FastQ/A files that contain sequences that have attached barcodes which link those sequences to the respective samples.
Barcode File - Select the mapping file that establishes the connection between each barcode and sample.
Barcode file format
Barcode files are simple text files. Each line should contain an identifier (descriptive name for the barcode), and the barcode itself (A/C/G/T), separated by a TAB character. Example:
#This line is a comment (starts with a 'number' sign) BC1 GATCT BC2 ATCGT BC3 GTGAT BC4 TGTCT
For each barcode, a new FASTQ file will be created (with the barcode's identifier as part of the file name). Sequences matching the barcode will be stored in the appropriate file.
Running the above example (assuming "mybarcodes.txt" contains the above barcodes), will create the following files:
/tmp/bla_BC1.txt /tmp/bla_BC2.txt /tmp/bla_BC3.txt /tmp/bla_BC4.txt /tmp/bla_unmatched.txt
The 'unmatched' file will contain all sequences that didn't match any barcode.
Figure 1: Wizard page 1 , Input Files
Page 2 - Configuration
Prefix - File prefix that will be added to the output files.
Suffix - File suffix that will be added to the output files.
Match Barcode - Match the barcodes at the beginning (5') or end (3') of each sequence.
Mismatches - Maximum number of allowed mismatches for barcodes.
Partial - Allow partial overlap of barcodes.
Without partial matching:
Count mismatches between the FASTA/Q sequences and the barcodes. The barcode which matched with the lowest mismatches count (providing the count is small or equal to '--mismatches N') 'gets' the sequences.
Example (using the above barcodes):
Matching at beginning of sequenecs and 1 mismatch:
GATCT (1 mismatch, BC1)
ATCGT (4 mismatches, BC2)
GTGAT (3 mismatches, BC3)
TGTCT (3 mismatches, BC4)
This sequence will be classified as 'BC1', because it has the lowest mismatch count.
If mismatches = 0 were specified, this sequence would be classified as 'unmatched', because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches.
Matching barcodes at the end of the sequences does the same, but from the other side of the sequence.
With partial matching (very similar to indels):
Same as above, with the following addition: barcodes are also checked for partial overlap.
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)
Matching (without partial overlapping) against BC1 yields 4 mismatches:
ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAGGATCT (4 mismatches)
Partial overlapping would also try the following match:
-ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAGGATCT (1 mismatch)
Note: Scoring counts a missing base as a mismatch, so the final mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with mismatches = 2 (meaning allowing up to 2 mismatches), this sequence will be classified as BC1.
Figure 2: Wizard Page 2, Parameters
Page 3 - Output
Output Folder - Define a folder to save the results.
Figure 3: Wizard page 3, Output Folder