Genetic Variation Annotation Pipeline

Introduction

Genetic Variation Analysis of Glycine Max.

Dataset Description

This dataset contains 301 Canadian soybean lines that were subjected to GBS analysis (with ApeKI digestion).

  • Organism: Glycine Max.

  • Instrument: Illumina HiSeq 2000.

  • Layout: Paired-end.

Publication

Torkamaneh, Davoud, Jérôme Laroche, and François Belzile. "Genome-wide SNP calling from genotyping by sequencing (GBS) data: a comparison of seven pipelines and two sequencing technologies." PloS one 11.8 (2016): e0161333.

 Abstract

In this work we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%).

In order to accelerate the process, a subset of 23 samples was used.

Original Data

Bioinformatic Analysis

1 - DNA-Seq Alignment

Application

DNA-Seq Alignment (BWA).

Input

Parameters

  • Single-end Data.

  • Genome Sequences: GCF_000004515.6_Glycine_max_v4.0_genomic.fna

  • Minimum Seed Length: 19

  • Band Width: 100

  • Z-dropoff: 100

  • Trigger Re-seeding: 1.5

  • Seed Occurrence: 20

  • Skip Seeds: 500

  • Drop Chains: 0.5

  • Discard Chains: 0

  • Mate Rescue Rounds: 50

  • Skip Mate Rescue: false

  • Skip Pairing: false

  • Matching Score: 1

  • Mismatch Penalty: 4

  • Gap Open Penalty (DEL): 6

  • Gap Open Penalty (INS): 6

  • Gap Extension Penalty (DEL): 1

  • Gap Extension Penalty (INS): 1

  • 5'-end Clipping Penalty: 5

  • 3'-end Clipping Penalty: 5

  • Unpaired Read Penalty: 17

  • Minimum Score: 30

  • Split Alignments as Primary: false

  • MapQ of Supp. Alignments: false

  • Output All Alignments: false

  • Soft Clipping for Supp.: false

  • Shorter Split Hits as Secondary: false

  • Sort BAM File: By Coordinates

  • Add Read Group Information: false

Execution Time

Around 30 minutes.

Output

2- Variant Calling

Application

Variant Calling by BCFtools.

Input

Parameters

  • BAM files: bam.files folder

  • Reference Genome: GCF_000004515.6_Glycine_max_v4.0_genomic.fna.gz

  • Adjust Mapping Quality: 0

  • Max. Depth: 250

  • Min. Mapping Quality: 0

  • Min. Base Quality: 13

  • Ignore @RG Tags: False

  • BAQ option: No BAQ

  • Extension Error Probability: 20

  • Minimum Fraction of Gapped Reads: 0.002

  • Tandem Quality: 500

  • Skip Indel Calling: False

  • Gapped Reads for Indel: 1

  • Phred Open Sequencing Error: 40

  • Keep Alternate Alleles: True

  • Use Groups: False

  • VCF File: bcftools.vcf.gz

Execution Time

16 minutes.

Output

3- Variant Filtering

Application

Variant Filtering.

Input

Parameters

  • Proportion ‘Quality/Counts’: 2

  • Raw Read Depth: 2

  • Phred Quality: 20

  • Average Mapping Quality: 59

  • Remove Multiple Alleles: True

  • Missing Genotypes Per Variant: 0

  • Genotype Depth Threshold: 1

  • Genotype Quality Threshold: 0

  • Minimum Allele Frequency Threshold: 0.05

  • Check Reads in Both Strands: False

  • Check if Reads are Balanced: False

Execution Time

2 seconds.

Output

4- Variant Annotation

Application:

Variant Annotation using VEP.

Input:

Execution Time

10 minutes.

Output

  • annotation.box: Table with information of each found variant.

  • report.box: Summary report with information of the type of variants, their consequences and some population genetics information.

Workflow