Variant Calling using BCFtools

Introduction

BCFtools is a widely-used variant calling tool, especially among non-human species, which is characterized by its small time of execution and its precision.

BCFtools uses two algorithms. The first one is called mpileup. This algorithm reads the alignments and, for each position of the genome, constructs a vertical slice across all reads covering the position (“pileup”). Genotype likelihoods are then calculated, representing how consistent are the observed data with the possible diploid genotypes. The calculation takes into account mapping qualities of the reads, base qualities, and the probability of local misalignment, per-base alignment quality (BAQ), and the user can set thresholds for each of these parameters. The second step, 'bcftools call', then evaluates the most likely genotype under the assumption of Hardy-Weinberg equilibrium.

Please cite BCFtools as:

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., ... & Li, H. (2021). Twelve years of SAMtools and BCFtools. Gigascience, 10(2), giab008.

Figure 1. Workflow of the BCFtools Mpileup-Call Pipeline

Run BCFtools for Variant Calling

BCFtools can be found under Genetic Variation → Variant Calling → BCFtools. The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4 and Figure 5).

Input

  • First of all, this tool requires two types of necessary files:

    • BAM files: alignment files in BAM format. To obtain them, you must align FASTQ files using a DNA-Seq Alignment Strategy, like BWA or Bowtie 2.

    • Reference Genome: FASTA file with the reference genome.

  • Make sure that read alignment was executed using the same reference genome as the one that is used here as input.

Figure 2. Input Page

Configuration 1

In this page, you can set the option of BAM Preprocessing using Picard and basic parameters for the mpileup step.

  • Remove Duplicates: Mark this option if you have Whole Genome Sequencing or Whole Exome Sequencing in order to remove PCR duplicates. For GBS or RADSeq dataset, this option is not recommended.

  • Adjust Mapping Quality: coefficient for downgrading mapping quality for reads containing excessive mismatches. This parameter is disable by default (0 value).

  • Maximum Depth: maximum raw per-file depth. By default, this value is set to 250, as this depth is enough to achieve good results. In this case, if the coverage depth is higher than this threshold, the algorithm will pick up 250 alignments at random. Nevertheless, if you want to tune this value higher, keep in mind that the time of execution will also be higher.

  • Minimum Mapping Quality: minimum mapping quality for an alignment to be used. The Mapping Quality quantify the probability that a read is misplaced.

  • Minimum Base Quality: minimum base quality for a nucleotide to be used. This base quality is the Phred score calculated at each position by the sequencing machine.

  • Ignore @RG Tags: in a BAM file, the RG tag means Read Group. A Read Group is a set of alignments that come from the same sample. If this option is checked and BAM files have different RG values, all alignments in a BAM files are treated as from the same sample.

  • BAQ options. These options are related to the realignment of some sequences in order to check the per-Base Alignment Quality (BAQ). BAQ evaluates the probability that there is a misalignment in each base:

    • No BAQ: disable realignment. This option is recommended as it helps to reduce false positives.

    • Redo BAQ: recalculate BAQ values in problematic regions.

    • Full BAQ: redo BAQ in all positions.

Figure 3. Configuration 1 Page

Configuration 2

In this page, you can adjust the advanced parameters for the mpileup step. These parameters do not have great consequences in the output.

  • Extension Error Probability: phred-scaled gap extension sequencing error probability. Reducing this value leads to longer indels.

  • Minimum Fraction of Gapped Reads: threshold of the proportion of reads that have a gap in order to call an indel in a position.

  • Tandem Quality: coefficient for modeling homopolymer errors. A higher value of this parameter means a higher reliance on indels in homopolymers. Nevertheless, this parameter will not affect in a high degree to your results, although if you want a higher reliance on SNPs rather than in indels, set a smaller value.

  • Skip Indel Calling: if this parameter is check, the VCF will only contain SNPs. The time of execution will also be shorter.

  • Gapped Reads for Indel: number of reads necessary to call an indel.

  • Phred Open Sequencing Error: phred-score gap opening error. If you set a smaller value, more indels will be called.

In addition, some parameters for the call step can also be set:

  • Keep Alternate Alleles: keep all alternate alleles, even those that appear in the mpileup step but are not called in any of the genotypes in the second step.

  • Use Groups: this option allows to group samples into populations and apply the Hardy Weinberg Equilibrium within population. If this option is disabled, the Hardy Weinberg Equlibrium is applied within all samples as one big population. By using this option, you gain power on SNPs shared between samples but lose power on singleton SNPs. We recommend to use this option only if you have a low-coverage dataset and you are going to do GWAS.

  • Group Experiment File: tab-delimited file with no header and two columns. The first column has sample names and the second one has population names.

Figure 4. Configuration 2 Page

Output

  • VCF File: filename of the resulting VCF file.

Figure 5. Output Page

Results

Variant Calling has the following outputs:

  • VCF file with all the variants that were found.

  • Report with summary details:

    • Input data: name of the input reference file and number of BAM files used.

    • Information about the resulting VCF: information about the types of variant found and the number of alleles per variant.

    • Adjusted parameters: as you might want to repeat the variant calling with other parameters, it is important to keep this table for reproducibility.

Figure 6. Summary Report from BCFtools.

  • Distribution charts of different quality variables found in the VCF. This charts might be important to know how to filter the VCF subsequently:

  1. Raw Read Depth Histogram: in BCFtools, Raw Read Depth (the DP field) means the number of times that site was read no matter the nucleotide it is in that position (the sum of the reference nucleotide and other variants).





Figure 7. Raw Read Depth Histogram.

2. Proportion 'Quality/Depth' Histogram: the quality column in VCF files generated using BCFtools is the Phred-scaled probability that the site has no variant. Nevertheless, it is better to rely on this quality normalized by depth.

Figure 8. Proportion 'Quality/Depth' Histogram.

3. Average Mapping Quality Histogram: the MQ value in the info field of a BCFtools VCF file relates to the average of all mapping qualities of the reads supporting the variant.

Figure 9. Average Mapping Quality Histogram.