Genome Assembly Quality Assessment
The Genome Assembly Quality Assessment tool is designed to assess the quality of the de novo genome assemblies. It does so by comparing the assembly against the reference genome. Multiple assemblies can be specified, thus allowing easy comparison between them.
This tool can be useful in different scenarios. For example, it makes it easier to compare de novo assembled genomes obtained with different assembly algorithms. Usually, there’s more than one option to perform the assembly, so it is difficult to decide at a first glimpse what will be the best for our dataset. Moreover, it may be interesting to try the same algorithm with different configurations. In all these cases, the Genome Assembly Quality Assessment tool allows the comparison of assemblies obtained with different strategies in order to try to decide the best configuration. Furthermore, once decided the best assembly strategy for our data, it may be transferrable to assemble data of similar characteristics (sequencing platform and related species, for example).
This tool is based on QUAST. Please cite QUAST as: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.
Run Genome Assembly Quality Assessment
The tool is available in genome analysis > Genome Assembly Quality Assessment.
Assemblies. One or more Genome Assembly(ies) in FASTA format. They can be the output of OmicsBox’s DNA-Seq de Novo Assembly tool.
Genome Type. Select how is the genome of the species under study:
Prokaryote implies a circular genome, so QUAST takes this feature into account and correctly processes its linear representation.
Eukaryote indicates QUAST that the genome is not circular.
Eukaryote + Large. Besides eukaryote, the genome is large (typically > 100 Mbp). Thus, QUAST uses optimal parameters for the evaluation of large genomes. It modifies the default values of the Minimum Contig Size, Minimum Alignment Length, and Max Extensive Misassemblies Size. They can be overridden manually in the corresponding parameters. In addition, this mode tries to identify misassemblies caused by transposable elements and exclude them from the number of misassemblies. See Mikheenko et al., 2018 for more details.
Reference Genome. Reference genome to compare the assemblies with. The assemblies are aligned to the reference using Minimap2 in order to obtain different quality metrics.
Fragmented Genome. The reference genome is fragmented (e.g. a scaffold reference). QUAST will try to detect misassemblies caused by fragmentation and mark them fake.
Minimum Contig Size (in bp). Contigs shorter than this minimum size won’t be taken into account to compute QUAST metrics (unless specified).
Minimum Alignment Length (in bp). Assembly(ies) are aligned against the reference genome. Alignments shorter than this value will be filtered and won’t be taken into account to compute QUAST metrics. Note that alignments shorter than 65 bp will be filtered regardless of this threshold.
Min. Identity Alignment (%). Minimum percentage of identity to consider as proper alignment. Alignments with an identity (%) lower than this value will be filtered and won’t be taken into account to compute QUAST metrics. Note that alignments with an identity (%) lower than 80% will be filtered regardless of this threshold.
Extensive Misassemblies Size. Gap or overlap size between the left and right flanking sequence of the aligned contig to be considered as a relocation. Greater gap or overlaps than this value are counted as Extensive Misassemblies, whereas lower gaps or overlaps are counted as Local Misassemblies.
Local Misassemblies Size. Gap or overlap size between the left and right flanking sequence of the aligned contig to be considered as a Local Misassembly. Shorter inconsistencies are considered as (long) indels. Note that this value must be lower than the Extensive Misassemblies Size.
Max Scaffold Gap Size. Maximum gap size between scaffolds to be detected as such. Longer inconsistencies are considered as relocations and thus, counted as extensive misassemblies. Note that this value must be greater than the extensive misassembly size.
The main result is an OmicsBox table containing a set of summary statistics for each analyzed assembly (Figure 4). Those statistics are:
Total Length. The total length in bp of the assembly.
Largest Contig. The length in bp of the largest contig in the assembly.
N50. It measures the contiguity of the assembly. The greatest the number, the more contiguous the assembly is. It is calculated by sorting the contigs from longest to shortest, then summing the length of the contigs until 50% of the total assembly size is reached. The N50 is the length of the contig in which this threshold is reached.
NG50. It’s similar to N50 but it is calculated taking into account the reference genome size instead of the total assembly size. Since the same reference size has been used to calculate the NG50 of all the assemblies, this metric is more comparable between assemblies.
L50. It’s another statistic that measures the contiguity of the assembly. The lower the number, the more contiguous the assembly is. It is calculated with the same procedure as the N50. However, the L50 is the number of contigs needed to reach 50% of the total assembly size. It is also the number of contigs with lengths equal to or greater than N50.
LG50. Similar to L50 but it is calculated taking into account the reference genome size instead of the total assembly size. It is also more suitable to compare assemblies. It’s only shown if QUAST has been able to calculate it.
Num. Misassemblies. The total number of detected misassemblies. This includes relocations, translocations, and inversions. Please visit QUAST’s manual for more information about how QUAST detects each type of misassemblies.
The Summary Report contains all the statistics calculated by QUAST (Figure 5). In addition, the parameters used during the analysis are specified as well at the end of the report. Please visit QUAST’s manual for more details about each statistic.
Shows the accumulation of contig sizes (Figure 3). The x-axis orders contigs from largest to smallest, while the y-axis represents the total size of the x-largest contigs in the assembly.
Shows the distribution of GC content in the contigs (Figure 4). The x value is the GC percentage (0 to 100%). The y value is the number of non-overlapping 100 bp windows in which GC content equals x%. For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.
The genome fraction (%) is a measure of the number of bases in the reference genome that have been aligned to at least one contig in the assembly (Figure 5). This percentage is calculated by dividing the number of aligned bases by the total number of bases in the reference genome. It's important to note that contigs from repetitive regions may align to multiple locations in the genome, potentially leading to an overestimation of the genome fraction.
Plots the values of N as a function of x, with x ranging from 0 to 100% (Figure 6). In order to calculate e.g. N50, contigs are first sorted from the longest to the shortest. Then, the contig's lengths are summed until 50% of the total assembly length is reached. N50 is the length of the contig in which that threshold is achieved.
Plots the values of NG as a function of x, with x ranging from 0 to 100% (Figure 7). NGx is calculated in the same way as the Nx, but taking as a reference the genome size instead of the assembly size. This makes NGx more comparable between assemblies. In addition, this statistic is more robust against changes in the assembly (e.g. filtering of the shortest contigs).
Number of Misassemblies
Shows the total number of misassemblies found in each assembly (Figure 8).
The x value (Feature space) is the total maximum number of misassemblies allowed in the contigs (Figure 9). The y value (Genome coverage %) is the total number of aligned bases in the contigs, divided by the reference length. The response (quality) of the assembler output is analyzed as a function of the maximum number of possible misassemblies allowed in the contigs.
Plots the values of NA as a function of x, with x ranging from 0 to 100% (Figure 8). NAx is calculated in the same way as Nx, but taking into account only the part of the assembly that is aligned to the reference genome.
Plots the values of NGA as a function of x, with x ranging from 0 to 100% (Figure 9). NGAx is calculated in the same way as NGx, but taking into account only the part of the assembly that is aligned to the reference genome.