## Charts and Statistics

# Introduction

A statistical chart is a tool that helps to learn about the shape or distribution of a sample. A chart is an effective way of presenting data because it is possible to see where data clusters and where there are only a few data values.

Some of the types of charts that are used to summarise and organise data are the dot plot, the bar charts, the histogram, the frequency polygon (a type of broken line graph), the pie chart, and the box plot.

# Run Statistics

This functionality can be found under **functional analysis → Charts and Statistics. **

For each step of the analysis, it is possible to generate different types of statistical charts.

**Project:**Charts to summarise the data related to the sequence project.**Blast:**Statistic charts to see the distribution of the Blast results.**InterProScan:**Statistic charts to see the distribution of the InterProScan results.**Mapping:**Statistic charts to see the distribution of the Mapping results.**Annotation:**Statistic charts to see the distribution of the Annotation results.**Enzyme:**Statistic charts to see the distribution of the Enzyme results.

**Figure 1:** Statistics Options

# Project Statistics

It is possible to generate different statistic charts related to the sequence project and also to understand the progress of the analysis (figure 3, figure 4 and figure 5).

**Data distribution bar chart:**Bar chart showing the number of sequences with Blast (with or without hits), GO Mapping and GO Annotation results.**Data distribution pie chart:**Pie chart showing the number of sequences with Blast (with or without hits), GO Mapping and GO Annotation results.**Analysis Progress:**Bar chart showing the cumulative number of sequences with Blast hits, InterProScan, GO Mapping and GO Annotation results.**Sequence Length Distribution:**Area chart showing the number of sequences for each sequence length.

**Figure 2: **Project Statistics

**Figure 3: **Data Distribution Bar Chart

**Figure 4: **Analysis Progress

**Figure 5: **Sequence Length

# Blast Statistics

Different BLAST statistics charts (figure 7, figure 8 and figure 9) can be generated for a global visualization of the results. These charts provide a general view of the similarity of the query set with the selected databases and can be used to choose cut-off levels for the e-value, similarity and annotation threshold parameters at the annotation step.

E-Value Distribution: This chart plots the distribution of E-values for all selected BLAST hits. It is useful to evaluate the success of the alignment for a given sequence database and help to adjust the E-Value cutoff in the annotation step.

Similarity Distribution: This chart displays the distribution of all calculated sequence similarities (percentages), shows the overall performance of the alignments and helps to adjust the annotation score in the annotation step.

Species Distribution: This chart gives a listing of the different species to which most sequences were aligned during the BLAST step.

Top-Hit Species Distribution: Bar chart showing the species distribution of all Top-Blast hits.

Hit Distribution: This chart shows a distribution of the number of hits for the blasted sequences in a data set.

Hsp Distribution: This bar chart shows the distribution of hsps per hit.

Hsp/Seq Distribution: This chart shows a distribution of percentages which represents the coverage between the hsps and their corresponding sequences.

Hsp/Hit Distribution: Same as above but for hits instead of sequences.

**Figure 6:** Blast Statistics

**Figure 7: **Similarity Distribution

**Figure 8:** Top-Hit Species Distribution** **

**Figure 9: **E-value Distribution

# InterProScan Statistics

It is possible to select InterProScan statistics to see how many sequences still do or do not have IPS results and how many sequences have GOs resulting from InterProScan.

**InterProScan Results:**This chart reflects the effect of adding the GO-terms retrieved through the InterProScan results (figure 11).**InterProScan Families Distribution:**Bar chart representing the number of sequences that belong to a particular IPS family.**InterProScan Domains Distribution:**Bar chart showing the number of sequences that belong to a particular IPS domain.**InterProScan Repeats Distribution:**Bar chart reflecting the number of sequences that belong to a particular IPS repeat.**InterProScan Sites Distribution:**Bar chart representing the number of sequences that belong to a particular IPS sites.**InterProScan IDs Distribution:**Bar chart showing the number of sequences that have been annotated with that InterProScan IDs.**InterProScan IDs by Database:**Pie chart reflecting the number of sequences of the InterProScan IDs for a particular InterProScan Database. In figure 10 the Pfam database is selected.

**Figure 10:** InterProScan Statistics Configuration Window

**Figure 11:** InterProScan Statistics

# Mapping Statistics

Three different charts are available to summarise the mapping step:

**GO Mapping Distribution:**This shows the distribution of the number of Gene Ontology candidate terms assigned to each sequence during the GO Mapping step.**EC Distribution for Sequences:**This chart shows the distribution of GO evidence codes for the functional terms obtained during the mapping step. It gives an idea about how many annotations derive from automatic/computational annotations or manually curated ones.**EC Distribution for Blast Hits:**Evidence Codes associated with the obtained GO pool.

**Figure 12: **Mapping Statistics

**Figure13: **GO Mapping Distribution

# Annotation Statistics

It is possible to summarise the number of sequences that have been annotated with the annotation rule and the following statistics are available:

**Annotation Distribution:**This chart informs about the number of GO terms assigned per sequence.**GO Annotation Level Distribution:**A bar chart that shows all GO terms for all 3 categories for a given GO level taking into account the GO hierarchy (parent-child relationships).**Annotation Score Distribution:**A chart that shows the number of sequences per annotation score.**Annotated Seqs/Seq-Length:**This shows the relation between the amount of annotated sequences and sequence lengths.**Number of GOs/Seq-Length:**This shows the relation between sequence length and the number of GOs.**GO Distribution by Level:**A bar chart that shows all the GO terms for all 3 categories for GO level 2, taking into account the GO hierarchy.Direct GO Count:

**Molecular Function:**A chart for the Molecular Function GO category, which shows the most frequent GO terms within a data-set without taking into account the GO hierarchy.**Biological Process:**A chart for the Biological Process GO category.**Cellular Component:**A chart for the Cellular Component GO category.

**Figure 14: **Annotation Statistics

An overview of the extent and intensity of the annotation can be obtained from the Annotation Distribution Chart (figure 15), which shows the number of sequences annotated with different amounts of GO-terms.

**Figure 15:** Annotation Distribution

**Figure 16:** Molecular Function Direct GO Count

# Enzyme Code Statistics

To see the main Enzyme classes in the dataset it is possible to generate a distribution Enzyme Code chart.

**Main Enzyme Classes:**This shows the distribution of the 7 main enzyme classes' overall sequences.**Second Level Classes:**It is possible to create a distribution chart of the enzyme subclasses.

**Figure 17: **Enzyme Code Statistic

**Figure 18:** Enzyme Code Distribution