IntroductionDemultiplexing or barcode splitting refers to the step in processing where you would use the barcode information in order to know which sequences came from which sample after they had all been sequenced together. Barcodes refer to the unique sequences that were ligated to your each of your individual samples’ genetic material before the samples got all mixed together. Depending on your sequencing facility, you may get your samples already split into individual fastq files, or they may be lumped together all in one fastq file with barcodes still attached for you to do the splitting. If this is the case, you should also have a mapping or barcode file telling you which barcodes correspond with which samples.
This tool takes FASTA/FASTQ files and splits them into several smaller files, Based on barcode matching. FastX-Toolkit is used for this task.
Page 2 - Configuration
Prefix - File prefix that will be added to the output files.
Suffix - File suffix that will be added to the output files.
Match Barcode - Match the barcodes at the beginning (5') or end (3') of each sequence.
Mismatches - Maximum number of allowed mismatches for barcodes.
Partial - Allow partial overlap of barcodes.
Without partial matching:
Count mismatches between the FASTA/Q sequences and the barcodes. The barcode which matched with the lowest mismatches count (providing the count is small or equal to '--mismatches N') 'gets' the sequences.
Example (using the above barcodes):
Matching at beginning of sequenecs and 1 mismatch:
GATCT (1 mismatch, BC1)
ATCGT (4 mismatches, BC2)
GTGAT (3 mismatches, BC3)
TGTCT (3 mismatches, BC4)
This sequence will be classified as 'BC1', because it has the lowest mismatch count.
If mismatches = 0 were specified, this sequence would be classified as 'unmatched', because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches.
Matching barcodes at the end of the sequences does the same, but from the other side of the sequence.
With partial matching (very similar to indels):
Same as above, with the following addition: barcodes are also checked for partial overlap.
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)
Matching (without partial overlapping) against BC1 yields 4 mismatches:
ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAGGATCT (4 mismatches)
Partial overlapping would also try the following match:
-ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAGGATCT (1 mismatch)
Note: Scoring counts a missing base as a mismatch, so the final mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with mismatches = 2 (meaning allowing up to 2 mismatches), this sequence will be classified as BC1.
Page 3 - Output
Output Folder - Define a folder to save the results.