Demultiplexing with Cutadapt
Demultiplexing, or Barcode Splitting, is the step in processing where you would use the barcode information to know which sequences came from which sample after they had all been sequenced together. Barcodes refer to the unique sequences ligated to each of your individual samples’ genetic material before the samples got all mixed together. Depending on your sequencing facility, you may get your reads already split into individual fastq files, or they may be lumped together all in one fastq file with barcodes still attached for you to do the splitting. If this is the case, you should also have a mapping or barcode file telling you which barcodes correspond with which samples.
This tool takes FASTA/FASTQ files and splits them into several smaller files based on barcode matching. Cutadapt is used for this task.
Page 1 - Input
Reads - Select the FastQ/A files that contain sequences that have attached barcodes which link those sequences to the respective samples.
Barcode File - Select the mapping file that establishes the connection between each barcode and sample.
Barcode file format
Barcodes sequences can be provided in three different formats:
1. Two-Columns TXT/CSV file: Barcode files are simple TXT files or CSV/TSV files. Each line should contain an identifier (descriptive name for the barcode), and the barcode itself (A/C/G/T), separated by a TAB character. Example:
BC1 GATCT BC2 ATCGT BC3 GTGAT BC4 TGTCT
2. Three-Columns TXT/CSV file: This format is similar to the previous one but has a third column containing the names of the files where you want to look for each barcode. Example:
BC1 GATCT filename1 BC2 ATCGT filename1 BC3 GTGAT filename2 BC3 GTGAT filename3 BC4 TGTCT filename3
In this case, each barcode will be searched only in the files indicated in the third column.
3. Fasta file: Barcode sequences are contained in a fasta file preceded by its barcode IDs as fasta headers (Having “>“ as a first character). Example:
>BC1 GATCT >BC2 ATCGT >BC3 GTGAT
For each barcode and file, a new FASTA/FASTQ file will be created (with the barcode's identifier as part of the output file name). Sequences matching the barcode will be stored in the appropriate file. The name of the new files will contain the name of the original input file as well.
Running the above example (assuming the barcode file contains the above barcodes), will create the following files:
[filename]-BC1.fastq.gz [filename]-BC2.fastq.gz [filename]-BC3.fastq.gz [filename]-BC4.fastq.gz [filename]-unknown.fastq.gz
Take into account that, in this case, .fastq.gz has been chosen as the files suffix.
The 'unknown' file will contain all sequences that didn't match any barcode.
Figure 1. Wizard page 1 , Input Files
Page 2 - Configuration
Adapter Position - Match the barcodes at the beginning (5') of the sequences, at the end (3'), or anywhere along the sequences.
Allowed Errors - Maximum number of allowed errors (mismatches and indels, if allowed) for barcodes, ranging from 0 to 10.
Allow Indels - Enable considering insertions and deletions as allowed errors.
Save Unmatched Sequences - Check to save all unmatched sequences in a 'unknown' FastQ/A file.
Include Sample Name - Check to include the input file name as a prefix of the output files.
Disabling this option generates output files with the following file name structure: [BarcodeID].fastq.gz
In this case, all reads from all input files that match with a single barcode will be placed in the same output file.
File Format - This parameter allows the selection of the output format between fasta and fastq. Additionally, it indicates the degree of compression of the output files (.gz or not).
Page 3 - Output
Output Folder - Define a folder to save the results.
Save Counts Table - Check to save a table containing the results of matching all provided barcodes in each input file.
Counts File - Define a the file name to save the barcodes count.
Cutadapt provides the following results:
Report with information for each input sample regarding the proportion of reads matched with any provided barcode.
Matches per Category Chart: Stacked bar plot representing the absolute number of reads in every input file and the number of them matched by any provided barcode.
Relative Matches per Category Chart: Stacked bar plot similar to the previous chart with the relative number of reads per sample file (Figure 4). Useful when the number of reads diverges largely between input files.
Output FastQ/A files containing all matched reads demultiplexed with their adapters trimmed. The demultiplexed reads can be grouped into these files by barcode and input file if the “Include Sample Name“ parameter has been checked. Otherwise, they will be grouped only by the provided barcodes, even if they come from different input files.
Counts table in a tabular TXT file. This file includes the count of all barcode matches along all the input samples. It also includes the total number of matches per sample and the number of unmatched sequences. It is formatted as a table, having the barcodes as rows and the samples as columns. Furthermore, it is compatible with any spreadsheet program.
Please cite Cutadapt as:
Cutadapt removes adapter sequences from high-throughput sequencing reads.
Marcel Martin. EMBnet.journal, 17(1):10-12, May 2011.