Genetic Variation


Genetic Variation Analysis of Glycine Max.

Dataset Description

This dataset contains 301 Canadian soybean lines that were subjected to GBS analysis (with ApeKI digestion).

  • Organism: Glycine Max.

  • Instrument: Illumina HiSeq 2000.

  • Layout: Paired-end.


Borriss R, Danchin A, Harwood CR, Médigue C, Rocha EPC, Sekowska A, Vallenet D. Bacillus subtilis, the model Gram-positive bacterium: 20 years of annotation refinement. Microb Biotechnol. 2018 Jan;11(1):3-17. doi: 10.1111/1751-7915.13043. PMID: 29280348; PMCID: PMC5743806.


In this work, we perform a detailed examination of the imputation of missing genotypes in a catalog of SNP markers obtained via a genotyping by sequencing (GBS) approach among a collection of 301 soybean lines. In addition, we extend the work by using imputation to integrate two highly complementary SNP datasets (from GBS and a SNP array). In all cases, we measured the accuracy of genotype imputation through the resequencing of a subset of the lines and show that the resulting datasets are highly accurate. In a final segment of the work, we show that performing genome-wide association scans with superior marker coverage (resulting from marker imputation) leads to improved QTL detection.

In order to accelerate the process, a subset of 23 samples was used.

Original Data

Bioinformatic Analysis

1 - DNA-Seq Alignment


DNA-Seq Alignment (BWA).



  • Single-end Data.

  • Genome Sequences: GCF_000004515.6_Glycine_max_v4.0_genomic.fna

  • Minimum Seed Length: 19

  • Band Width: 100

  • Z-dropoff: 100

  • Trigger Re-seeding: 1.5

  • Seed Occurrence: 20

  • Skip Seeds: 500

  • Drop Chains: 0.5

  • Discard Chains: 0

  • Mate Rescue Rounds: 50

  • Skip Mate Rescue: false

  • Skip Pairing: false

  • Matching Score: 1

  • Mismatch Penalty: 4

  • Gap Open Penalty (DEL): 6

  • Gap Open Penalty (INS): 6

  • Gap Extension Penalty (DEL): 1

  • Gap Extension Penalty (INS): 1

  • 5'-end Clipping Penalty: 5

  • 3'-end Clipping Penalty: 5

  • Unpaired Read Penalty: 17

  • Minimum Score: 30

  • Split Alignments as Primary: false

  • MapQ of Supp. Alignments: false

  • Output All Alignments: false

  • Soft Clipping for Supp.: false

  • Shorter Split Hits as Secondary: false

  • Sort BAM File: By Coordinates

  • Add Read Group Information: false

Execution Time

Around 30 minutes.


2- Variant Calling


Variant Calling by BCFtools.



  • BAM files: bam.files folder

  • Reference Genome: GCF_000004515.6_Glycine_max_v4.0_genomic.fna.gz

  • Adjust Mapping Quality: 0

  • Max. Depth: 250

  • Min. Mapping Quality: 0

  • Min. Base Quality: 13

  • Ignore @RG Tags: False

  • BAQ option: No BAQ

  • Extension Error Probability: 20

  • Minimum Fraction of Gapped Reads: 0.002

  • Tandem Quality: 500

  • Skip Indel Calling: False

  • Gapped Reads for Indel: 1

  • Phred Open Sequencing Error: 40

  • Keep Alternate Alleles: True

  • Use Groups: False

  • VCF File: bcftools.vcf.gz

Execution Time

16 minutes.


3- Variant Filtering


Variant Filtering.



  • Proportion ‘Quality/Counts’: 2

  • Raw Read Depth: 2

  • Phred Quality: 20

  • Average Mapping Quality: 59

  • Remove Multiple Alleles: True

  • Check Reads in Both Strands: False

  • Check if Reads are Balanced: False

Execution Time

2 seconds.


4- Variant Annotation


Variant Annotation using VEP.


Execution Time

10 minutes.


  • Table with information of each found variant.

  • Summary report with information of the type of variants, their consequences and some population genetics information.