A Field Guide to Whole Genome Sequencing Assembly and Annotation

Whole Genome Sequencing (WGS) assembly and annotation are critical steps in understanding the genetic makeup of organisms. This guide provides a comprehensive overview of the process, focusing on prokaryotic and eukaryotic genomes, and includes best practices for submission to GenBank.

Types of Genome Submissions: WGS vs. non-WGS

When submitting genomes to GenBank via the Submission Portal, you’ll need to classify them as either WGS or non-WGS. Understanding the differences is crucial for proper data handling and processing.

Alt Text: WGS vs Non-WGS Genome Assembly: non-WGS shows single-piece chromosomes, WGS may have multiple pieces and unplaced sequences.

non-WGS Genomes:

Each chromosome is represented by a single, contiguous sequence.
All sequences within the genome are assigned to a specific chromosome, plasmid, or organelle.
Plasmids and organelles can still exist in multiple fragments.

WGS Genomes:

One or more chromosomes are fragmented into multiple pieces.
Some sequences might not be assembled into chromosomes.

Commonalities:

Both types can contain gaps within sequences; this information needs to be specified during submission.
Plasmids and organelles can be in multiple pieces.
Internal sequences must maintain the correct order and orientation. Concatenated sequences of unknown order are not permitted.

Submitting Genomes: Single vs. Batch

There are two primary submission routes: single genome submission and batch genome submission. The choice depends on the number of genomes you are submitting and the level of standardization across them.

Submitting a Single Genome

This straightforward approach involves completing a web form in the Submission Portal and uploading FASTA (or SQN) files of the genome sequences. Key requirements include:

BioProject Association: Linking to an existing BioProject (created during read submission to SRA) or registering a new one.
BioSample Association: Linking to an existing BioSample (created during read submission to SRA) or registering a new one.
WGS/non-WGS Designation: Declaring the genome assembly type.
Sequence Upload: Providing FASTA sequences of the genome (or SQN files if annotated).
Metadata Provision: Answering prompts regarding genome assembly data, gap information, chromosome/plasmid assignments, author details, and release date.
Optional Annotation: Requesting annotation of prokaryotic genomes via PGAP.

Submitting a Batch of Genomes

This method enables the submission of up to 400 WGS or non-WGS genomes simultaneously. It requires selecting “Batch/multiple” in the Genome Submission Portal, completing the web form, uploading a Genome Info file containing genome metadata, and uploading (or preloading) FASTA files (or SQN files if annotated) of the genome sequences. Crucial requirements:

BioProject Consistency: All genomes must belong to the same BioProject.
Assembly Type Homogeneity: All genomes must be either WGS or non-WGS, not a mix.
Uniform Release Date: All genomes must share the same initial release date.
Consistent Gap Information: All genomes must have the same gap/Ns information.
File Type Uniformity: Use either FASTA or ASN (SQN) files exclusively, not a combination. FASTA files are recommended unless annotation or Genome-Assembly-Data structured comments are required.
Single File Per Genome: Each genome, including plasmids or organelles, must be in a separate file.
Distinct Files: Each genome must have its own file, not a concatenated file.
Annotation Request Consistency: Either all genomes request PGAP annotation or none do (relevant for prokaryotic genomes only).
BioProject and BioSample Details: Same requirements as single genome submission.
Sequence Assignment: Chromosome, plasmid, and organelle assignment information must be encoded within the FASTA files (see Additional requirements for batch submissions).
Genome Info Table: Upload a Genome Info table with genome-specific information.
Metadata: Provide gap information, author details, and release date via web page prompts.
Annotation (Optional): Request PGAP annotation if desired for prokaryotic genomes.

Essential Submission Files and Formats

Preparing your files correctly is crucial for a smooth submission process. Here’s a breakdown of the required file formats and their specifications.

FASTA Files

FASTA files (.fsa suffix) contain the nucleotide sequences of your genome. Adhere to these guidelines:

Definition Line: Each sequence must have a definition line starting with “>” and a unique identifier (SeqID), e.g., contig001.
- SeqIDs must be less than 50 characters.
- SeqIDs can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
- SeqIDs must be unique within a genome.
Organism Information: Include the organism and relevant strain, breed, cultivar, or isolate in the definition line. Additional source qualifiers will be added from the registered BioSample during processing.
Sequence Quality: Remove any Ns from the beginning or end of each sequence.
Contig Length: Contigs should be greater than 199 nucleotides.

Crucial for Batch Submissions:

All sequences belonging to a single genome must be contained within one file.
Chromosome, plasmid, and organelle assignment information must be encoded within the input files for batch submissions.

Chromosome Designation: To indicate a sequence represents a chromosome, include [location=chromosome] in the FASTA definition line. This is mandatory for at least one sequence when “non-WGS genome” is selected.
Circular Topology: Sequences representing complete circular chromosomes or plasmids must include topology and completeness information:
- [topology=circular] [completeness=complete]
- [topology=circular] gap at end, not circularized
Plasmid/Organelle/Chromosome Assignment: Designate sequences belonging to plasmids, organelles, or specific nuclear chromosomes:
- [plasmid-name=pBR322]
- [plasmid-name=unnamed] (use unique names like unnamed1, unnamed2 for distinct unnamed plasmids)
- [location=mitochondrion]
- [location=chloroplast]
- [chromosome=2]
Plasmid and Chromosome Naming: Follow the Plasmid and chromosome names rules.

>contig02 [organism=Clostridium difficile] [strain=ABDC] [plasmid-name=pABDC1] [topology=circular] [completeness=complete]

>Seq001 [organism=Puma concolor] [isolate=ABDC] [location=chromosome] [chromosome=2]

>Seq001 [organism=Puma concolor] [isolate=ABDC] [chromosome=5]
>Seq002 [organism=Puma concolor] [isolate=ABDC] [chromosome=5]

SQN Files

SQN files are generally required only when submitting annotation along with the genome sequence. Annotation is optional for GenBank genome submissions.

Prepare a .sqn file for submission using table2asn. table2asn reads a template file along with the fasta sequence and annotation table files, and outputs an ASN (.sqn) file for submission to GenBank.

Steps to Generate SQN files:

Prepare Data Files: Prepare FASTA files as described above, with one file per genome. Also, prepare additional files for annotation.
Run table2asn:
- If annotation is in GenBank-specific GFF files, follow the instructions for GFF files.
- If annotation is in .tbl files, follow these instructions.
- Common table2asn command:
```
table2asn -indir path_to_files -t template -M n -Z
```
- Command for sequences with gaps:
```
table2asn -indir path_to_files -t template -M n -Z -gaps-min 10 -l paired-ends
```
Check output: Check the validation and discrepancy report and fix errors. All Errors and Rejects need to be fixed.

Genome Info Table

The Genome Info table is mandatory for batch submissions. It supplies the Genome Assembly Data for each genome. Use the Genome Info file template to prepare this table.

Each row in the template represents a genome. Required and Optional fields are:

Required Fields:

Biosample accession OR sample_name
Assembly method
Assembly method version
Genome coverage
Sequencing technology
File name

Optional Fields:

Assembly date
Assembly name
Reference genome
Update (update_for)
bacteria_available_from

Core Metadata Requirements

Several key metadata elements are crucial for every genome submission.

BioProject

The BioProject provides a description of the research effort, grant information, and links to public data. Every genome must be associated with a BioProject. Reuse existing BioProjects if appropriate, and avoid creating duplicates.

BioSample

The BioSample contains source information for the sequenced sample. Use the same BioSample for sequence reads and the genome assembly. Register a new BioSample during the genome submission process for unannotated genomes. Genomes submitted with annotation will need to be pre-registered to get a locus_tag prefix.

Genome Assembly Data

This includes details on the assembly process:

Assembly method: Name of the assembly algorithm(s).
Assembly method version or date: Version of the algorithm or the date it was run.
Genome coverage: Estimated base coverage across the genome (e.g., 12x).
Sequencing technology: Sequencing platform(s) used.
Assembly date: Optional. Year, month, or day of assembly.
Assembly name: Optional. A short name for display.
Full or Partial Genome in the sample: Indicate if the entire genome or a subset was sequenced.
Reference genome: If not a de novo assembly, provide the accession.version and/or assembly name of the reference genome.
Update: Accession of the genome being updated (if applicable).
bacteria_available_from: Optional. For prokaryotes, provide contact information for obtaining the bacterial culture.

Gap Information

Specify details about the gaps in your assembly:

Minimum number of consecutive Ns representing a gap (≤10).
Number of Ns representing a gap of unknown length.
Evidence used to link sequences across the gap (usually paired-ends).

Chromosome and Plasmid Assignments

Indicate which sequences are chromosomes or plasmids. Adhere to the naming rules:

Chromosome Names

Can contain only digits, dots, underscores, and ASCII characters.
Cannot include “chr” or “chromosome”. Use “LG” for linkage groups.
Limited to 33 characters.
Cannot include “unknown,” “Un,” “Unk,” or “0”.

Plasmid Names

Can contain only digits, dots, underscores, and ASCII characters.
Should start with lowercase “p” unless the name is unknown (use “unnamed,” “unnamed1,” etc.).
Cannot include “plasmid.”
Limited to 20 characters.

Submitting and Post-Submission Process

Submit all files via the Genome Submission Portal. Choose “Single genome” or “Batch/multiple genomes,” answer the questions, upload the files, review, and submit. A “SUB” identifier will be assigned.

Post-Submission Steps

NCBI performs automated validations and staff reviews. If issues are found, you’ll receive an email with details and instructions for correction.

Problems that could be found:

Errors and Warnings from validation (.val files).
FATAL errors from the discrepancy report (.dr file).
Sequence contamination.
Unexpected genome size.
Misidentification of the organism (based on ANI analysis).

Fix the issues, log back into the Genome Submission Portal, retrieve the submission, click “FIX,” delete problematic files, and upload corrected versions.

After accession number assignment, staff performs a thorough review, which can lead to further communication if issues are encountered.

Submission Statuses

Queued: Waiting for initial review.
Error: Files have errors; resubmission needed.
Processing, no accession number: Passed initial validations, waiting for review.
Processing, accession number: Accessions assigned; NCBI staff processing.
Processed: Publicly released.

Prokaryotic Genome Annotation Pipeline (PGAP)

You can request annotation by the Prokaryotic Genomes Annotation Pipeline during submission. Alternatively, you can download and run an external version of PGAP before submission and generate a GenBank-compliant annotated genome that is submission-ready.

This field guide provides essential information for whole genome sequencing assembly and annotation, ensuring successful submission and accurate representation of your genomic data in GenBank. By following these guidelines, researchers can contribute valuable data to the scientific community, accelerating discoveries in biology and medicine.