BLAST

Content of this page:

OmicsBox uses the Basic Local Alignment Search Tool (BLAST) to find sequences similar to your query set. Please, refer to http://www.ncbi.nlm.nih.gov/BLASTfor details on the BLAST function. Figure 2, show the BLAST Configuration Dialog Window that controls the BLAST step. 

BLAST in OmicsBox can basically be performed in three different fashions:

  1. CloudBlast. This is a cloud-based OmicsBox Community Resource for massive sequence alignment tasks. It allows you to execute standard NCBI Blast+ searches directly from within OmicsBox in a dedicated computing cloud. CloudBlast is a high-performance, secure and cost-optimized solution for your analysis. This is a blast service totally independent from the NCBI servers to provide fast and reliable sequence alignments. Please see Run Blast using CloudBLAST section for more information.
  2. QBlast@NCBI. NCBI offers a public service that allows searching molecular sequence databases with the BLAST algorithm. The main advantages of making use of this service are its versatility and that no database maintenance is required. Therefore by selecting this option at OmicsBox no additional installations have to be done.

  3. Local BLAST against its own database. It is possible to use BLAST+ executable to query a local/own database. At https://www.blast2go.com/make-own-database-and-blast and at the Make Blast Database section one can see how to prepare and blast locally an own fasta database.

QBlast at NCBI is the only feature available for OmicsBox Basic users.

The next figure shows the menu manner to select between NCBI-, local- BLAST as well as CloudBlast, AWS Blast or blasting against an own database.

Figure 1: Select between NCBI, Local or CloudBlast

Run BLAST at the NCBI

Here, the user can specify the following parameters, which are divided into three different sections: Blast Configuration in figure 2, Advanced in figure 3 and Save Results Page figure 4:

Blast Configuration Page

  • Your e-mail address in case you are using the NCBI BLAST web service.
  • BLAST program: The algorithm you want to use:
    • blastp - Compares an amino acid query sequence against a protein sequence database.
    • blastn (-task blastn) - Compares a nucleotide query sequence against a nucleotide sequence database.
    • blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Used to find potential translation products of an unknown nucleotide sequence
    • tblastn - Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
    • blastx-fast
    • blastp-fast
    • blastp-short
    • blastn (-task megablast)
    • blastn (-task dc-megablast)
    • blastn-short
    • tblastn-fast
  • BLAST DB: The name of the database to search in (eg. nr, swissprot, pdb). To see a list of possible DBs at NCBI seehttp://data.biobam.com/ncbi_blast_dbs_protein.pdf
  • Taxonomy Filter: Search for Blast results only in the selected taxonomy.
  • BLAST expect value: The statistical significance threshold for reporting matches against database sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.
  • Number of BLAST hits: The number of alignments you want to achieve (0-100).

BLAST Description Annotator: The BDA finds the best possible description for a new sequence based on a given BLAST result.

Figure 2: Blast Configuration Page

Figure 3: Advanced Page

Figure 4: Save Results Page

Advanced Page

  • Blast Parameters:
    • Word size: One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words. The word size is adjustable in blastn and can be reduced from the default value to increase sensitivity. This word size can also be increased to increase the search speed and limit the number of database hits.
    • Low complexity filter: The BLAST programs employ the SEG algorithm to filter low complexity regions from proteins before executing a database search. The default is ON.
  • Filter Options:
    • HSP length cutoff: A Cutoff value for the minimal length of the first hsp of a balst hit, used to exclude hits with only small local alignments from the BLAST result. The given length corresponds to amino-acids or nucleotides depending on the type of performed BLAST.
    • HSP-Hit Coverage
    • Filter by description: Filter-out Blast hits by a description

Save Results Page

The results of the BLAST queries can also be directly saved to a file in different formats by selecting the corresponding checkboxes at the BLAST Save Results Page. If the chosen file already exists, upcoming results will be appended. Choose a format type to additionally save your BLAST results.

  • XML2: This is a new BLAST result provided by NCBI and can also be loaded into OmicsBox.
  • XML: It is recommended to save your BLAST results as XML as this format is supported by the OmicsBox Load BLAST Results function.
  • TXT: It saves the blast results of each sequence in text file format.
  • HTML: For each sequence, a file in HTML format will be saved.

Run BLAST using CloudBLAST

CloudBlast offers a highly optimized, self-sustained HPC solution to address a very specific need of the OmicsBox community
CloudBlast is a BLAST service totally independent from the NCBI servers to provide fast and reliable sequence alignments. It consists of a high performance computing cluster dedicated exclusively to Blast searches. 
All OmicsBox subscriptions include "ComputationUnits" to make use of this resource and allows you to perform blast searches for tens of thousands of sequences within a few days against a large collection of protein databases. Each sequence alignment performed in the system consumes a certain amount of computation time depending on the sequence length and the blast algorithm (blastx, blastp) and parameters used. The smaller the database you blast against the more sequences you can analyse with 6.000.000 ComputationUnits (see Cloud Usage in the View Menu section to know how to monitor the ComputationUnits). This means that e.g. if you blast against the vertebrate NR-subset you would be able to blast approx. one million (1.000.000) sequences. If you decide to blast against the NR database, the largest protein database available, it should allow you to blast approx. 80.000 sequences (with an average length of 800nt per sequence). 

For the advanced and save parameters page please see Advanced Page and Save Results Page sections for detailed information.

Figure 5: CloudBlast Configuration Page

Run BLAST Locally

With Local BLAST you can blast the sequences against own database. OmicsBox allows creating a Blast database from a FASTA file with the option "Make Blast Database'' (see Make Blast Database section). Download and format your database and choose the corresponding folder to see figure 6. Databases have to be formatted for NCBI Blast+.

The main parameters in the Local BLAST Configuration page are very similar to the ones in NCBI and CloudBlast. The main difference is when choosing the database as OmicsBox is expecting a .pal' file or .p*. On the Advanced Page at the "Run Parameters,'' it is possible to select the number of threads to be used. This field has not to be set up as OmicsBox detects the number of threads in the computer. The Advanced Page section provides a detailed description of each parameter. As in CloudBlast, the BLAST results will be saved in XML file format. 

Figure 6: Local Blast Configuration Page

Show BLAST Results

As the BLAST search progresses, sequences with successful BLAST results change their color on the Main Sequence Table from white to orange and the BLAST result related columns will be filled. In case no results could be retrieved for a given sequence, this row will turn dark-red.
With a mouse the right click on a sequence, the Single Sequence Menu will be displayed and it is possible to see the BLAST results for each sequence individually. Show BLAST Results (figure 7) will generate a tab in the Results containing information on the results of the similarity search of the selected sequence. For each of the obtained hits, the following information is given: Hit id and definition Gene name assigned to the hit by its accession e-value of the alignment Alignment length of the longest hsp Positive matches of the longest hsp Hsp similarity of hit: Number of hsps mapped GO-Terms with its evidence code UniProt codes of the hit sequences.


Image singleseqmenu

Figure 7: Show BLAST Results


Figure 8: Individual BLAST Result Table View


Figure 9: Individual BLAST Result in Alignment View

Statistics

Different BLAST statistics charts (Figure figure 11figure 12 and figure 13) can be generated for a global visualization of the results. These charts provide a general view of the similarity of the query set with the selected databases and can be used to choose cut-off levels for the e-value, similarity and annotation threshold parameters at the annotation step.
Additionally, a BLAST hit species distribution chart is available. To generate the BLAST Statistics charts just go to the arrow next to the "Chart'' icon and select the statistics to be displayed (see figure 10
).

Figure 10: Blast Statistics

  • E-Value Distribution: This chart plots the distribution of E-values for all selected BLAST hits. It is useful to evaluate the success of the alignment for a given sequence database and help to adjust the E-Value cutoff in the annotation step.
  • Similarity Distribution: This chart displays the distribution of all calculated sequence similarities (percentages), shows the overall performance of the alignments and helps to adjust the annotation score in the annotation step.
  • Species Distribution: This chart gives a listing of the different species to which most sequences were aligned during the BLAST step.
  • Top-Hit Species Distribution: Bar chart showing the species distribution of all Top-Blast hits.
  • Hit Distribution: This chart shows a distribution of the number of hits for the blasted sequences in a data-set.
  • Hsp Distribution: This bar chart shows the distribution of hsps per hit.
  • Hsp/Seq Distribution: This chart shows a distribution of percentages which represents the coverage between the hsps and their corresponding sequences.
  • Hsp/Hit Distribution: Same as above but for hits instead of sequences.

Image simstatFigure 11: Similarity Distribution

Image species_dist

Figure 12: Species Distribution

Image evalstatFigure 13: E-Value Distribution

Load BLAST results

If a BLAST result is already available in XML format, it can be directly loaded into OmicsBox by using Load > Load Blast Results in the File menu. You can choose here to import the Blast results as XML file or the new XML2/JSON format. These new formats can be loaded as Zip file. 
In the Load Blast Results dialog a whole directory containing a collection of BLAST XML files or a single XML file can be selected Figure 14. The BLAST results will be added to your current OmicsBox session. 
OmicsBox also allows the input of TimeLogic DeCypher Blast results. 

Figure 14: Load / Import Blast Results

Make Blast Database

This option allows creating a BLAST database from the sequence of any OmicsBox project or from a FASTA file (figure 15). This option can be found in the arrow next to the blast icon.

  • Current project: OmicsBox will use the loaded sequences to create the Blast database. Note: If the resulting database will be used for further GO mapping a proper ID and description line with "GO mappable'' information are needed.
  • FASTA file: This option allows choosing own FASTA file. The FASTA file has to be correctly formatted for NCBI Blast+.
  • Output Folder: Select the directory where to save the created Blast database.
  • Blast Database Name: Provide a name for the Blast database
  • Taxonomy Options:
    • Taxonomy ID: Introduce the NCBI species ID
    • Mapping file: If the sequences come from different species, it is possible to generate a text file with the sequence names and its species id to map to the corresponding sequence in the FASTA file. 

      Example:

      TR|A0A022PMT6|ERYGU	4155
      TR|A0A022PMU0|ERYGU	4155
      TR|A0A059BJ72|EUCGR	71139
      TR|A0A059BJ72|EUCGR	71139
      TR|A0A061FDU3|THECC	3641
      TR|A0A067DJ79|CITSI	2711


Visit the following tutorial for more information on how to create the Taxonomy ID file.




Figure 15: Make Blast Database

Retrieve Blast Top-Hit

This feature allows retrieving the sequence information of Top Blast Hits in an OmicsBox project. Data can be obtained from the NCBI, Ensembl or Uniprot web services and stored in a new project or replace the existing IDs/sequences (see figure 16). A possible use case scenario would be a so-called "Double-Blast'': The blast results of a first run are used to replace the sequence data for a second run against a different set of query sequences. Imagine an RNA-seq data-set with a high percentage of sequences without any alignments against a protein database (e.g. blastx against NR). This feature could be used to select and extract the sequences without hits (red ones) into a new project. These sequences could be basted first against a set of EST sequences. The initial unaligned sequences are now replaced with the ESTs. Now the initial blastx search is repeated again the protein.

For each Top-Hit (first significant alignment from an already performed BLAST), apply the filters (bottom part of the dialog) and search them in the corresponding database (online). 
It is possible to either replace the sequence from your data-set or to extract them into a new data-set (Action option). You can also decide whether you want to keep the original sequence names or if you want to rename them to the downloaded sequences names. The latter will add a small note to the sequence description, telling you the original name. 
The last remaining option allows you to decide whether you want to replace your sequences with the downloaded ones or if you just want to retrieve their names. This option is activated by default.


Figure 16: Retrieve Blast Top-Hit Dialog.

Retrieve Blat Top-Hit

This tool is very similar to "Retrieve Blast Top-Hit'' explained above, but it employs BLAT instead (figure 17). The dialog is therefore quite similar and the first 3 options are identical. BLAT needs a reference FASTA file which it uses to search for similar sequences. The last 2 options allow you to filter by similarity and if BLAT should consider the reverse strand.


Figure 17: Retrieve Blat Top-Hit Dialog.

This tool can be useful after running Prokaryotic Gene Finding, in order to replace the sequence names retrieved from Glimmer by the top-hit from a reference fasta. For further details, click here.

Other BLAST Functions

  • Remove Blast Results: This option will remove the BLAST results from the selected sequences.
  • Run Blast-Descriptor-Annotator (BDA): This will run the BDA algorithm. For further details, please see Blast Configuration Page section.
  • Recover original Best-Blast-Hit Description: When this option is executed the sequence description column on the Main Sequence Table will contain the top blast hit description and not the one from the BDA.