Variant Filtering

Introduction

Variant filtering is a secondary analysis operation that follows the Variant Calling step consists of identifying highly confident variants and removing the ones that are more likely to be falsely called. To filter out those false variants, the user must select the threshold considered to be adequate for different information fields of the VCF. Keep in mind that this step is crucial to avoid the analysis of false positives, what leads to the speed up of every subsequent step because less variants are being studied.

For VCF files generated either with BCFtools or Freebayes, the user is able to put different thresholds in parameters like quality, depth, average mapping quality and quality normalized by depth. In addition, the user can remove those SNPs with more than one variant (multiallelic variants), which might be interesting to perform an association analysis. Moreover, we have incorporated two additional Freebayes-specific parameters: the user can select variants with at least one read in each strand and/or variants that are supported by reads at both sides of the strand.

Run Variant Filtering

This tool can be found under Genetic Variation → Variant Filtering. The wizard consists of 2 pages and allows to define the input and the filtering parameters, and output options (Figure 1 and Figure 2).

Input and parameters

  • VCF file: this file must come from a Variant Calling analysis.

  • Proportion ‘Quality / Counts’: proportion between the Phred quality of the SNP and the count of full observations of alternate haplotypes. If there is more than 1 alternative haplotype, the mean is taken. This filter is more powerfull than using only the Phred quality or the counts of observations by their own.

  • Raw Read Depth: total number of reads overlapping that position.

  • Phred Quality: phred-scaled quality score for the assertion that alternative allele exists and it is not a sequencing error.

  • Average Mapping Quality (MQ): mapping quality measures how unique that read is. That is to say, the probability that the read is misplaced.

  • Genotypic Options:

    • Remove Variants with Multiple Alleles: this is recommended if you are going to run a Genome-Wide Association Study.

    • Missing Genotypes per Variant: maximum fraction of genotypes that can be missed out in a variant. For example, if you set this value to 0.1, only variants with at least 10 out of 100 genotypes will be available.

    • Genotype Depth (GD) Threshold: minimum number of reads that might support that genotype.

    • Genotype Quality (GQ) Threshold: it represents the Phred-scaled confidence that the genotype assignment is correct.

    • Minimum Allele Frequency Threshold: Minimum Allele Frequency (MAF) represents the fraction of the least frequent allele in a population for a variant. Variants with a MAF smaller than this threshold will be filtered out.

If the population is just one individual (i.e., you only introduced one BAM file with aligned reads from one sample in the Variant Calling Step), Genotype Quality will be equal to Phred Quality, and Genotype Depth will be equal to Variant Raw Read Depth. You can just set "0" in that parameters in order to disable them.

  • Freebayes-specific Parameters:

    • Check Reads in Both Strands: check to verify if there is at least one read in each strand. It is recommended to check this parameter, as if a variant is real, it should have been discovered in both DNA strands.

    • Check if Reads are Balanced: check if there are at least two reads 'balanced' to each side of the site (e.g. there is at least one read place right and another one place left of the variant).

Variant Filtering strongly depends on the genotyping protocol used to obtain the dataset. The main two experiments in this field are WGS for GWAS analysis and reduced-representation techniques such as GBS or RADseq.

Table 1. Recommended Parameters

Parameters

WGS

GBS or RADSeq

Quality / Counts

2

2

Raw Read Depth

10

2

Phred Quality

20

20

Average MQ

55

55

Remove Variants with Multiple Alleles

True

True

Missing Genotypes

0.6

0.1

GD Threshold

8

1

GQ Threshold

20

1

MAF Threshold

0.05

0.05

Figure 1. Input and Parameters Page

Output

  • Filtered VCF: filename for the filtered VCF file.

Figure 2. Output Page

Results

Variant Filtering of BCFtools VCF files has the following outputs:

  • Filtered VCF file.

  • Report with information about the number of variants before and after filtering, and the used parameters.

  • Quality Control Charts: these charts are the same as the ones that appear in the Variant Calling step. In addition, the Phred Quality distribution and the MAF distributuion are also added, as they are also set as filtering parameter.

Figure 3. Summary Report of Variant Filtering

Figure 4. Phred Quality Distribution

Figure 5. Minimum Allele Frequency Distribution