A set of Bash scripts that download all known genomic sequences from NCBI GenBank for a given list of taxonomy IDs, using NCBI Datasets CLI tools. Assembly metadata is saved as TSV files and genome sequences are stored as gzip-compressed FASTA files.
The scripts rely on NCBI Datasets CLI tools, which are installed and managed through a Conda environment. To set up the environment you need to have Conda installed, for instance Miniconda. The scripts were tested on Ubuntu 24.04.4 LTS x86-64.
Create and activate the ncbi-cli Conda environment:
conda env create --file envs/ncbi-cli.yml
conda activate ncbi-cliBefore running the scripts, prepare the two input files located by default in the input/ directory:
taxids.txt– one NCBI Taxonomy ID per line, identifying the taxa for which genome assemblies will be downloadedfields.txt– metadata column names to include in the output TSV files, one per line; the first name must beaccession– a comprehensive list of all available field names is provided in the reference for dataformat tool
In the input/ directory you will find example files for downloading genomes of Staphylococcus agnetis (taxid 985762) and Staphylococcus delphini (taxid 53344). You can test the scripts using them before changing anything – simply run them in the proper order with default argument values (see the following sections).
The scripts use the following directory structure:
fetch-genomes/
├── envs/
│ └── ncbi-cli.yml
├── input/
│ ├── taxids.txt
│ └── fields.txt
├── output/
├── genomes/
├── fna/
├── genomes.zip
├── fetch_genomes.sh
├── check_md5sums.sh
└── restruct.sh
fetch_genomes.sh, check_md5sums.sh, and restruct.sh are the main scripts described in section 5. The envs/ directory contains the Conda environment file for NCBI Datasets CLI. Input files with taxonomy IDs and metadata field names are located in input/. The output/ directory is created by fetch_genomes.sh and contains assembly metadata TSV files and the list of accession numbers. genomes.zip is the intermediate dehydrated genome archive downloaded by fetch_genomes.sh. The genomes/ directory is also created by fetch_genomes.sh and holds the rehydrated NCBI dataset with the downloaded genome sequences. You can use the check_md5sums.sh script for an optional but recommended verification of MD5 checksums of the downloaded genome files. The fna/ directory is created by restruct.sh and provides a flat view of all genome FASTA files, each hard-linked from genomes/.
| Script | Description |
|---|---|
fetch_genomes.sh |
Downloads assembly metadata from NCBI for each taxid listed in input/taxids.txt and saves it as per-taxid TSV files (assemblies_taxid-*.tsv) and one merged TSV file (assemblies.tsv) in the output/ directory. Then downloads and rehydrates the corresponding genome sequences into the genomes/ directory. |
check_md5sums.sh |
Verifies MD5 checksums of the rehydrated .fna.gz genome files against the expected values listed in md5sum.txt in the genomes/ directory. Reports each mismatch and prints the total count of failed checks at the end. |
restruct.sh |
Restructures the NCBI dataset directory into a flat directory of genome FASTA files by creating hard links for all .fna.gz files from the genomes/ncbi_dataset/data/ directory into the fna/ directory, one file per assembly accession. |
With the ncbi-cli environment active and the input files prepared, run the scripts in order from the repository directory.
Download assembly metadata and genome sequences:
./fetch_genomes.shOptional arguments:
--taxid FILE– path to the file with taxids (default:input/taxids.txt)--fields FILE– path to the file with metadata field names (default:input/fields.txt)--outdir DIR– output directory for TSV files and the accession list (default:output/)--genarch FILE– path for the ZIP file with dehydrated genomes (default:genomes.zip)--gendir DIR– directory for the rehydrated genomes (default:genomes/)--redo– forces re-download of assembly metadata even when output files from a previous run already exist; by default, ifoutput/accessions.txtis present, the script skips metadata download and proceeds directly to downloading genomes
Optionally, verify the integrity of the downloaded genome files:
./check_md5sums.shOptional argument:
--gendir DIR– directory containing the.fna.gzgenome files and themd5sum.txtfile (default:genomes/)
Restructure the downloaded genomes into a flat FASTA directory:
./restruct.shOptional arguments:
--gendir DIR– rehydrated NCBI dataset directory to restructure (default:genomes/)--flatdir DIR– output directory for the hard-linked.fna.gzfiles; must not exist yet (default:fna/)
You do not need to pass any arguments to the scripts. If none are provided, the scripts will work correctly using the default values and the directory structure described in section 4.
Once all three scripts have been run and the genome sequences are collected in the flat fna/ directory
(or the directory set with --flatdir), the intermediate archive and dataset directory can be safely removed:
rm -rf genomes.zip genomesNote: rm -rf permanently deletes the files. Make sure the restruct.sh script completed successfully before removing them.