The most important source of new data for GenBank is direct submissions from scientists. GenBank depends on its contributors to help to keep the database as comprehensive, current, and accurate as possible. NCBI provides timely and accurate processing and biological review of new entries, updates existing entries and assists authors with the submission of new data.
NCBI Submission Submission Tool
The NCBI data submission tool (General Tools → Create NCBI GenBank Genome Submission Files) facilitates the creation of a GenBank ready for submission. The tool combines a reference genome (fasta file), the gene coordinates (gff file) and the functional annotations of OmicsBox, creating a feature table which will be validated with the tbl2asn program. The `tbl2asn' command-line program is used to automate the creation of sequence records (.sqn files). For further information about tbl2asn visit: http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2.
Figure 1: NCBI Submission Tool
- This tool requires an internet connection to execute the tbl2asn program, it requires a connection to the NCBI databases in order to validate the annotations.
To successfully submit the annotated sequences, it is first necessary to prepare the source of the annotations, i.e. the reference genome to which the sequences belong, the position on the genome, and the functional annotation. These files are processed by OmicsBox and validated by `tbl2asn' to create the ASN1 file (.sqn) and the validation files (Figure 2).
Figure 2: General Workflow
Three elements are necessary to create the submission files:
- Reference genome: This file provides the organisms nucleotide sequence and may contain one or more chromosomes The chromosome names in the fasta description line have to match the GFF file name(s).
- Genomic annotation: This data is provided by the GFF3 files, and is also used to link the annotations and the genome reference sequences in the Fasta file. The sequence names used in the OmicsBox project should appear in the feature column in the GFF3 file. The corresponding feature ID can be specified as parameter (default is seqName).
- Functional annotation: This information is provided by your OmicsBox project, and is intended to provide the functional features of your sequences, including gene names, Gene Ontology terms and enzyme numbers. The option to create the submission file is only activated when a OmicsBox Project file is loaded and selected.
The sequence name of the functional annotation in your OmicsBox project has to match with a feature of your choice in the gff file.
Preparing Your Data
In order to integrate all the information and to create the NCBI submission files, we need to create informative links between them. As discussed above, the GFF3 files act as a link between the genome sequence and the functional annotation (Figure 3). The sequence name used in the OmicsBox project should appear in the feature column in the GFF3 file (attributes field). The corresponding feature ID can be specified as a parameter (for the GFF files created by Augustus and Glimmer included in OmicsBox, the IDs correspond to `seqName' by default).
Figure 3: Information Integration Scheme
Page 1: Project Details
- Locus tag: The `locus tag' is an alphanumeric identifier of your project provided by the NCBI or user determined at the moment of the BioProject registration at: https://submit.ncbi.nlm.nih.gov/subs/bioproject
- Laboratory ID: The laboratory ID is a unique tag that refers to your own laboratory and allows the sequences to be associated with it.
- Submission type: Here you can choose the type of submission you want to perform.
- One or a few nucleotide sequences: use this option if you have a small data-set containing few sequences (less than a chromosome, or a chromosome on scaffold stage).
- Complete eukaryotic genomes or chromosomes: use this option if you have a complete data set without N's, conforming a chromosome or a whole genome.
- Incomplete genomes (WGS): use this option if your data-set consists of incomplete genomic or chromosomal assemblies derived from shotgun sequencing methods.
- Assembly details (only for WGS submission): These details provide information about the more technical steps of the assembly. Here we can find:
- Assembly method: The program or algorithms used to assemble the genome.
- Assembly name: This is a short project identifier.
- Long assembly name: This is a larger and more explanatory name of your project.
- Genome coverage: This is the mean genome coverage obtained by the assembler and has a general format of one or more digits followed by an `x' (e.g: 12x or 76x).
- Sequencing technology: The name of the technology used to perform the sequencing of the query genome. If the technology used is not in the list shown, you can manually enter the name.
- Optional source qualifiers: These are additional sequence qualifiers to your all project, specifications of optional qualifiers allows you to add useful information regarding the organism chromosome, type, etc. If you are going to submit a WGS project, add the source information as the organism and the relevant strain, breed, cultivar or isolate, if exists for the sequenced organism.
Figure 4: Information Integration Scheme
- The `gcode' corresponding to the genetic code is only mandatory if the submitting organism is not specified or is not in the NCBI Taxonomy Browser.
Page 2: Sequence Data and Annotation Files
- Output Directory: The creation and validation of the submitting sequences will produce multiple files that may be checked. This option allows the files to be saved in an existing folder, or to create a new one.
- Fasta File: The reference genome is the FASTA or multi-FASTA file containing the sequences to be submitted. This tool is designed to submit complete eukaryotic genomes or chromosomes. If you are submitting a single complete chromosome it must be in a single fasta entry, however, if you are submitting a complete genome, you must have a single entry for each chromosome. Important note: if this is a complete genome or chromosome submission, remove all the ‘Ns’ present in the fasta file.
- Genome annotation: The genome annotation refers to the .gff file containing the gene coordinates for each annotated gene. This file must be named according to the data entry to which it corresponds.
- Feature ID: The feature ID of annotation refers to the flag on the ninth column of the gff file, which contains the name of the sequence, displayed as SeqName in OmicsBox.
- Gene names: Here you can choose how to assign the names for your annotated sequences, the options are: “hypothetical protein”, the “SeqName” assigned in the OmicsBox project or assign the name of the “Top BLAST Hit”. If this last option is selected, you can set the threshold for:
- E-value: The minimum E-value obtained in the BLAST between the top BLAST hit and your query (default value is 1E-6).
- Similarity: The minimum percent similarity between the two sequences (0-100).
- Coverage: The minimum percent coverage between the two sequences (0-100).
Figure 5: Sequence Data and Annotation Files page
- If the threshold is not reached, the name of the gene will be “hypothetical protein”.
- All manual gene names annotations has higher priority.
Page 3: Author’s and Affiliation data
- Contact data: This page allows providing contact details for the submitting person. This information will not be publicly visible, and only can be used by the NCBI staff for validation.
- Institution data: Information about the institution where the sequencing was performed is provided here.
- Title of the manuscript: This title is provisional and can be modified at any time via email request to the NCBI.
- Release date: This is the date when your submitted and validated data will be accessible in the NCBI database. If the release date is The same day or before the submission, it will be automatically available once the data is validated.
- Names and Initials: Insert the names and initials of the individuals who must receive scientific credit for the generation of the sequences and annotations in this submission. If the authors are part of a consortium, it is not necessary that they appear as individual authors, as they are represented in the ‘Consortium’ option
Figure 6: Author’s and Affiliation page
Once the input data has been analysed and processed via the tbl2asn tool, several result files are created.
- .sqn: This is the ASN1 file containing the compressed information of the .tbl and .sbt files.
- .tbl: The file containing the coordinates and the features for each annotated gene.
- .sbt: The file containing the authors and project information.
- .gbf: This is the GenBank flat file, a previous view of the .sqn once it is published.
- .ecn: This file contains the Enzyme Consortium Number errors and the changes applied by the previous NCBI automatic validation you have just performed.
- .val: This is the same file as the ”errorsummary.val”, with more details and explanations, that will guide you to make the appropriate corrections.
- .txt: This file contains additional information about the errors found.
A result page provides a summary of the different types of errors and warnings. Errors must be corrected and the warnings should be reviewed (they may actually not be harmful. depending on your data). Modifications can be made by editing the gff or the annotation in the project. Once errors are corrected, the tool should be run again, until an error-free validation is achieved. Once the submission files have no errors, the ASN1 (.sqn) file is ready for submission via the NCBI Genome Submission Tools (www.ncbi.nlm.nih.gov/projects/GenomeSubmit/genome_submit.cgi). Whenever you submit a new genome, it is necessary to send an email to the Submission Processing Center (firstname.lastname@example.org), specifying the registered BioProject and organism name in the message as well as the requested release date for the genome.
Figure 7: Results summary
Most Common Errors and Warnings
- ERROR(s): InternalStop + StartCodon + BadProteinStart + StopInProtein: These error codes usually appear grouped, and they refer to the same sequence. This may be due to an error in the gff that has shifted its reading frame, you can correct that by changing the frame on the .gff.
- StartCodon: An illegal start codon was used. Some possible explanations are: (1) the wrong genetic code may have been selected; (2) the wrong reading frame may be in use; or (3) the coding region may be incomplete at the 5’ end, in which case a partial location should be indicated. This can be fixed in the .gff file, or by selecting the correct code in the ‘source qualifiers’ on the first wizard page.
- InternalStop: Internal stop codons are found in the protein sequence. Some possible explanations are: (1) the wrong genetic code may have been selected; (2) the wrong reading frame may be in use; (3) the coding region may be incomplete at the 5’ end, in which case a partial location should be indicated; or (4) the CdRegion feature location is incorrect. This can be fixed in the .gff file by modifying the start of the sequence or selecting the correct code on the ‘source qualifiers’ on the first wizard page.
- WARNING CDSmRNArange: This error alerts you that two or more ’CDS’ features are under the same ‘mRNA’ feature, but there are not colliding. If you are working with prokaryotes, this is a feature you must fix it in the .gff file, but if working with eukaryotes, it’s a normal feature, as eukaryotic genes contain introns.
- WARNING CDSwithNoMRNAOverlap: This warning alerts you that a ‘CDS’ feature out of the ‘mRNA’ bounds, and should be fixed in the .gff file by extending the mRNA range.
- WARNING BadProteinName: The name assigned to this protein is not adequate. Remember that the protein name should not contain the names ‘hypothetical’ or ‘partial’, and must follow the Uni-Prot protein product names. Modify it in the OmicsBox project or directly in the .sqn file.
- WARNING CollidingGeneNames: Two gene features should not have the same name, this can be fixed in the OmicsBox project.
- WARNING MissingMRNAproduct: The mRNA feature indicates to a cDNA product that is not contained in the record. This must be fixed on the .gff file.
- WARNING DuplicateInterval: The location has identical adjacent intervals, e.g., a duplicate exon reference. This can be fixed eliminating the duplicated ‘exon’ or ‘CDS’ from the .gff file.
- WARNING mRNAgeneRange: An mRNA is overlapped by a gene feature, but is not completely contained by it. This can be corrected in the .gff by extending the range of the ‘mRNA’.
- NoOrgFound: This entry does not specify the organism that was the source of the sequence. Please enter a name for the organism on the first page of the wizard, in the ‘Optional source qualifiers’.
For more information about these errors, please refer to http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/errmsg/valid.msg