SBWT

This is the code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT). The repository includes implementations of the various SBWT variants described in the paper. The data structures answer k-mer membership queries on the input data. Note that contrary to many other k-mer membership data structures, our code is not aware of DNA reverse complements. That is, it considers a k-mer and its reverse complement as separate k-mers.

This construction algorithm is based on the lightning-fast k-mer counter KMC. We call the KMC binaries directly from our code. The construction is very disk-heavy, so it is recommended to run construction code off a fast SSD drive.

Compiling

git submodule init
git submodule update

# Build the KMC components
cd KMC
make -j4
cd ..

# Build the SBWT code
cd build
cmake .. -DMAX_KMER_LENGTH=32
make -j4

Change the parameter -DMAX_KMER_LENGTH=32 to increase the maximum allowed k-mer length, up to 255. Larger values lead to slower construction and higher disk usage during construction.

Troubleshooting: If you run into problems involving the <filesystem> header, you probably need to update your compiler. The compiler g++-8 should be sufficient. Install a new compiler and direct CMake to use it with the -DCMAKE_CXX_COMPILER option. For example, to set the compiler to g++-8, run CMake with the option -DCMAKE_CXX_COMPILER=g++-8.

Note: the Elias-Fano variants make use of the _pext_u64 instruction in the BMI2 instruction set. Older CPUs might not support this instruction. In that case, we fall back to a simple software implementation, which will ruin the performance of the Elias-Fano variants (those whose variant name starts with "mef").

Index construction

Below is the command to build the SBWT for input data example_data/coli3.fna provided in this repository, with k = 30. The index is written to the file index.sbwt.

./build/bin/sbwt build -i example_data/coli3.fna -o index.sbwt -k 30

This builds the default variant, which is the plain matrix SBWT. Other variant can be specified with the --variant option. The list of all command line options and parameters is below:

Construct an SBWT variant.
Usage:
  build [OPTION...]

  -i, --in-file arg             The input sequences as a FASTA or FASTQ 
                                file, possibly gzipped. If the file 
                                extension is .txt, the file is interpreted 
                                as a list of input files, one file on each 
                                line. All input files must be in the same 
                                format.
  -o, --out-file arg            Output file for the constructed index.
  -k, --kmer-length arg         The k-mer length.
      --variant arg             The SBWT variant to build. Available 
                                variants: plain-matrix rrr-matrix 
                                mef-matrix plain-split rrr-split mef-split 
                                plain-concat mef-concat plain-subsetwt 
                                rrr-subsetwt (default: plain-matrix)
      --add-reverse-complements
                                Also add the reverse complement of every 
                                k-mer to the index. Warning: this creates a 
                                temporary reverse-complemented duplicate of 
                                each input file before construction. Make 
                                sure that the directory at --temp-dir can 
                                handle this amount of data. If the input is 
                                gzipped, the duplicate will also be 
                                compressed, which might take a while.
      --no-streaming-support    Save space by not building the streaming 
                                query support bit vector. This leads to 
                                slower queries.
  -t, --n-threads arg           Number of parallel threads. (default: 1)
  -a, --min-abundance arg       Discard all k-mers occurring fewer than 
                                this many times. By default we keep all 
                                k-mers. Note that we consider a k-mer 
                                distinct from its reverse complement. 
                                (default: 1)
  -b, --max-abundance arg       Discard all k-mers occurring more than this 
                                many times. (default: 1000000000)
  -m, --ram-gigas arg           RAM budget in gigabytes (not strictly 
                                enforced). Must be at least 2. (default: 2)
  -d, --temp-dir arg            Location for temporary files. (default: .)
  -h, --help                    Print usage

Running queries

To query for existence of all k-mers in an index for all sequences in a fastq-file, run the following command:

./build/bin/sbwt search -i index.sbwt -q example_data/queries.fastq -o out.txt

This prints for each query of length n in the input a line containing n-k+1 space-separated integers, which are the ranks of the columns representing the k-mer in the index. If the k-mer is not found, -1 is printed. If the index was built streaming support (which is the default), the faster streaming query algorithm is automatically used. The full options are:

Query all k-mers of all input reads.
Usage:
  search [OPTION...]

  -o, --out-file arg    Output filename.
  -i, --index-file arg  Index input file.
  -q, --query-file arg  The query in FASTA or FASTQ format, possibly 
                        gzipped. Multi-line FASTQ is not supported. If the 
                        file extension is .txt, this is interpreted as a 
                        list of query files, one per line. In this case, 
                        --out-file is also interpreted as a list of output 
                        files in the same manner, one line for each input 
                        file.
  -z, --gzip-output     Writes output in gzipped form. This can shrink the 
                        output files by an order of magnitude.
  -h, --help            Print usage

For developers: building and running the tests

git submodule init
git submodule update

# Build googletest
cd googletest
mkdir build
cd build
cmake ..
make

# Build SBWT
cd ../../build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=1 -DMAX_KMER_LENGTH=32
make

This will build the executable ./build/bin/sbwt_tests. Make sure to run the test executable from the root of the repository, or otherwise it will not find the example data in ./example_data.

Acknowledgements

The command-line parsing and the gzip support are implemented using cxxopts and zstr respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
KMC @ a646d8a		KMC @ a646d8a
build		build
ecoli_dataset		ecoli_dataset
example_data		example_data
googletest @ 8d51dc5		googletest @ 8d51dc5
include		include
sdsl-lite @ 9ccbba4		sdsl-lite @ 9ccbba4
src		src
temp		temp
tests		tests
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
version.h.in		version.h.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SBWT

Compiling

Index construction

Running queries

For developers: building and running the tests

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SBWT

Compiling

Index construction

Running queries

For developers: building and running the tests

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages