Multiple Bacteria Genome Compressor (MBGC) is a tool for compressing genomes in FASTA (or gzipped FASTA) input format. It is tailored for fast and efficient compression of bacteria species collections.
The implementation:
- supports gzipped (.gz) archives as input,
- preserves folder structure of compressed files,
- decompresses DNA streams without EOLs or with a fixed number of bases per line,
- handles FASTA with non-bacteria species as well,
- supports standard input and output during (de)compression.
mbgc is now available through bioconda repository. Once conda manager is installed, run the following command to install mbgc:
conda install -c bioconda mbgc mbgc requires libdeflate library to work (an example install howto for Debian and Ubuntu).
The following steps create an mbgc executable.
On Linux mbgc build requires cmake version >= 3.4 installed (check using cmake --version):
git clone https://github.com/kowallus/mbgc.git
cd mbgc
mkdir build
cd build
cmake ..
make mbgcUsage for multiple file compression (list of files given as input):
mbgc [-c compressionMode] [-t noOfThreads] <sequencesListFile> <archiveFile>
Usage for single file compression:
mbgc [-c compressionMode] [-t noOfThreads] -i <inputFastaFile> <archiveFile>
Usage for decompression:
mbgc -d [-t noOfThreads] [-f pattern] [-l dnaLineLength] <archiveFile> [<outputPath>]
<sequencesListFile> name of text file containing a list of FASTA files (raw or in gz archives)
(given in separate lines) for compression
<inputFastaFile> name of a FASTA file (raw or in gz archive) for compression
<archiveFile> mbgc archive filename
<outputPath> extraction target path root (if skipped the root path is the current directory)
Basic options:
-c select compression mode (speed: 0; default: 1; repo: 2; max: 3)
-d decompression mode
-l format decompressed DNA (i.e., sets the number of bases per row)
-f decompress files with names containing the given pattern
-t number of threads used (default: 8)
-h print full help and exit
-v print version number and exit
Compression modes description:
(0) speed - for speed (fast compression and decompression)
(1) default - regular mode (fast compression and good ratio)
(2) repo - for public repositories (better ratio and fast decompression)
(3) max - for long-term storage (the best ratio)
compression of FASTA files (raw or gzipped) listed in seqlist.txt file (one FASTA file per line):
./mbgc seqlist.txt comp.mbgc
compression of a single FASTA file (in FASTA or gzipped FASTA format):
./mbgc -i input.fasta comp.mbgc
decompression to out folder (which is created if it does not exist) using 80 bases per row DNA formatting:
./mbgc -l 80 -d comp.mbgc out
Please note that decompression overwrites existing files!
Exemplary data and scripts demonstrating usages of MBGC in basic compression scenarios are located in example-scripts folder.
Following POSIX convention, a single hyphen character can be used to specify input from or output to the standard input and output streams.
for standard input set <inputFastaFile> to -
for standard input (resp. output) in compression (resp. decompression) set <archiveFile> to -
for standard output set <outputPath> to - (all files are concatenated)
compression of FASTA in standard input data stream (in raw or gzipped FASTA format):
./mbgc -i - comp.mbgc
decompression to standard output (without EOLs symbols within DNA sequences):
./mbgc -d comp.mbgc -