This is the code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT). The repository includes implementations of the various SBWT variants described in the paper. The data structures answer k-mer membership queries on the input data. Note that contrary to many other k-mer membership data structures, our code is not aware of DNA reverse complements. That is, it considers a k-mer and its reverse complement as separate k-mers.
This construction algorithm is based on the lightning-fast k-mer counter KMC. We call the KMC binaries directly from our code. The construction is very disk-heavy, so it is recommended to run construction code off a fast SSD drive.
git submodule init
git submodule update
# Build the KMC components
cd KMC
make -j4
cd ..
# Build the SBWT code
cd build
cmake .. -DMAX_KMER_LENGTH=32
make -j4
Change the parameter -DMAX_KMER_LENGTH=32 to increase the maximum allowed k-mer length, up to 255. Larger values lead to slower construction and higher disk usage during construction.
Troubleshooting: If you run into problems involving the <filesystem> header, you probably need to update your compiler. The compiler g++-8 should be sufficient. Install a new compiler and direct CMake to use it with the -DCMAKE_CXX_COMPILER option. For example, to set the compiler to g++-8, run CMake with the option -DCMAKE_CXX_COMPILER=g++-8.
Note: the Elias-Fano variants make use of the _pext_u64 instruction in the BMI2 instruction set. Older CPUs might not support this instruction. In that case, we fall back to a simple software implementation, which will ruin the performance of the Elias-Fano variants (those whose variant name starts with "mef").
Below is the command to build the SBWT for input data example_data/coli3.fna provided in this repository, with k = 30. The index is written to the file index.sbwt.
./build/bin/sbwt build -i example_data/coli3.fna -o index.sbwt -k 30
This builds the default variant, which is the plain matrix SBWT. Other variant can be specified with the --variant option.
The list of all command line options and parameters is below:
Construct an SBWT variant.
Usage:
build [OPTION...]
-i, --in-file arg The input sequences as a FASTA or FASTQ
file, possibly gzipped. If the file
extension is .txt, the file is interpreted
as a list of input files, one file on each
line. All input files must be in the same
format.
-o, --out-file arg Output file for the constructed index.
-k, --kmer-length arg The k-mer length.
--variant arg The SBWT variant to build. Available
variants: plain-matrix rrr-matrix
mef-matrix plain-split rrr-split mef-split
plain-concat mef-concat plain-subsetwt
rrr-subsetwt (default: plain-matrix)
--add-reverse-complements
Also add the reverse complement of every
k-mer to the index. Warning: this creates a
temporary reverse-complemented duplicate of
each input file before construction. Make
sure that the directory at --temp-dir can
handle this amount of data. If the input is
gzipped, the duplicate will also be
compressed, which might take a while.
--no-streaming-support Save space by not building the streaming
query support bit vector. This leads to
slower queries.
-t, --n-threads arg Number of parallel threads. (default: 1)
-a, --min-abundance arg Discard all k-mers occurring fewer than
this many times. By default we keep all
k-mers. Note that we consider a k-mer
distinct from its reverse complement.
(default: 1)
-b, --max-abundance arg Discard all k-mers occurring more than this
many times. (default: 1000000000)
-m, --ram-gigas arg RAM budget in gigabytes (not strictly
enforced). Must be at least 2. (default: 2)
-d, --temp-dir arg Location for temporary files. (default: .)
-h, --help Print usage
To query for existence of all k-mers in an index for all sequences in a fastq-file, run the following command:
./build/bin/sbwt search -i index.sbwt -q example_data/queries.fastq -o out.txt
This prints for each query of length n in the input a line containing n-k+1 space-separated integers, which are the ranks of the columns representing the k-mer in the index. If the k-mer is not found, -1 is printed. If the index was built streaming support (which is the default), the faster streaming query algorithm is automatically used. The full options are:
Query all k-mers of all input reads.
Usage:
search [OPTION...]
-o, --out-file arg Output filename.
-i, --index-file arg Index input file.
-q, --query-file arg The query in FASTA or FASTQ format, possibly
gzipped. Multi-line FASTQ is not supported. If the
file extension is .txt, this is interpreted as a
list of query files, one per line. In this case,
--out-file is also interpreted as a list of output
files in the same manner, one line for each input
file.
-z, --gzip-output Writes output in gzipped form. This can shrink the
output files by an order of magnitude.
-h, --help Print usage
git submodule init
git submodule update
# Build googletest
cd googletest
mkdir build
cd build
cmake ..
make
# Build SBWT
cd ../../build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=1 -DMAX_KMER_LENGTH=32
make
This will build the executable ./build/bin/sbwt_tests. Make sure to run the test executable from the root of the repository, or otherwise it will not find the example data in ./example_data.
The command-line parsing and the gzip support are implemented using cxxopts and zstr respectively.