Variant filtering is a secondary analysis operation that follows the Variant Calling step consists of identifying highly confident variants and removing the ones that are more likely to be falsely called. To filter out those false variants, the user must select the threshold considered to be adequate for different information fields of the VCF. Keep in mind that this step is crucial to avoid the analysis of false positives, what leads to the speed up of every subsequent step because less variants are being studied.
For VCF files generated either with BCFtools or Freebayes, the user is able to put different thresholds in parameters like quality, depth, average mapping quality and quality normalized by depth. In addition, the user can remove those SNPs with more than one variant (multiallelic variants), which might be interesting to perform an association analysis. Moreover, we have incorporated two additional Freebayes-specific parameters: the user can select variants with at least one read in each strand and/or variants that are supported by reads at both sides of the strand.
Run Variant Filtering
This tool can be found under Genetic Variation → Variant Filtering. The wizard consists of 2 pages and allows to define the input and the filtering parameters, and output options (Figure 1 and Figure 2).
Input and parameters
VCF file: this file must come from a Variant Calling analysis.
Proportion ‘Quality / Counts’: proportion between the Phred quality of the SNP and the count of full observations of alternate haplotypes. If there is more than 1 alternative haplotype, the mean is taken. This filter is more powerfull than using only the Phred quality or the counts of observations by their own.
Raw Read Depth: total number of reads overlapping that position.
Phred Quality: phred-scaled quality score for the assertion that alternative allele exists and it is not a sequencing error.
Average Mapping Quality: mapping quality measures how unique that read is. That is to say, the probability that the read is misplaced.
Remove Variants with Multiple Alleles: this is recommended if you are going to run a Genome-Wide Association Study.
Check Reads in Both Strands: check to verify if there is at least one read in each strand. It is recommended to check this parameter, as if a variant is real, it should have been discovered in both DNA strands.
Check if Reads are Balanced: check if there are at least two reads 'balanced' to each side of the site (e.g. there is at least one read place right and another one place left of the variant).
Filtered VCF: filename for the filtered VCF file.
Variant Filtering of BCFtools VCF files has the following outputs:
Filtered VCF file.
Report with information about the number of variants before and after filtering, and the used parameters.
Quality Control Charts: these charts are the same as the ones that appear in the Variant Calling step. In addition, the Phred Quality distribution is also added, as it is also set as filtering parameter.