This is the main repo of the AnyBlox project, containing the runtime, dataset utilities, benchmark tools, and the C++ integration.
The main runtime lives in the anyblox crate. It produces the library and the anyblox-cli binary.
In anyblox-bench there is a simple CLI app that can run a given workload and output a bunch of metrics. Check the CLI documentation for details.
In anyblox-datafusion there is the DataFusion table source for reading AnyBlox data.
In anyblox-format we define the serialization and deserialization of the metadata files for AnyBlox. This is a library only. Most other stuff depends on it.
In anyblox-tools there are three very important tools for working with AnyBlox and testing.
- The
bundlerbinary creates a full.anybloxbundle from a metadata file, a wasm decoder, and optional data (for self-contained datasets). - The
viewercan open an existing bundle and read its metadata, verifying correctness. - The
anyblox2csvcan run AnyBlox on an existing bundle and output the results to a CSV file. This allows testing by comparing the output to the input dataset that was compressed in the first place.
In compress a bunch of compression algorithms are defined.
In dataset-utils there are specific scripts that compress a specific dataset with some predefined compression. For example, single_column_rle can compress any
single-column integer dataset with RLE, while taxpayer is meant specifically for the Taxpayer PublicBI dataset.
In decompress all AnyBlox decompression algorithms are defined. They use the common decoder-lib utility library for wasm that defines some helpers, the expected return bundle type,
and a memory allocator for wasm.
The C++ and JNI integrations are in anyblox-cpplib. The rust directory contains the C API that is built as a static library.
In cpp the libanyblox-cpp library is defined, using the static Rust library as a dependency. In jni the magic of jni happens so that
the interface can be loaded from JVM-based engines like Spark.
In vendored the forked version of Rust's arrow and parquet live.
In anyblox-parquet the entirely experimental prototype of a Parquet writer for AnyBlox is used.
You need Rust to build the Rust parts, and the mold linker (on Debian-based OS use apt install mold).
Having just helps with automation.
For C++ the prerequisite is an installation of the C++ Arrow library so that CMake can detect it with find_package.
The recommended way is to build arrow/cpp from source separately, follow their docs.
To install libanyblox systemwide go to anyblox-cpplib and run just install.
For anyblox-spark you need sbt, maven, and a JDK. The recommended way is to use sdkman to manage these dependencies.
The recommended SDK is 11.0.25-tem. The required Scala version is 2.12.
Run just in the root directory to build the library, cli, all wasm decoders, and benchmarks.
To build a specific tool use cargo with the package specifier, e.g. cargo build -p anyblox-tools --bin bundler builds the bundler.
The datasets we use are mostly from the Public BI benchmark.
Bimbocontains integer columns that are great for RLE compression.Taxpayercontains string columns for FSST compression.
We also use TCP-H lineitem for compressing with RLE+Bitpacking
To make a bundle first use dataset-utils to extract data from a CSV dataset into a compressed binary format.
Then use the bundler tool to create a bundle. Example TOML configurations are in anyblox-tools/example-bundle, e.g. there is
Bimbo-rle-ext.toml that can produce an extension working for any of the Bimbo columns compressed with RLE.
The best place for datasets is to create a directory dataset at the top level (it is gitignored).
To run the bundle you can use anyblox-cli or anyblox-bench. Type help in the CLI to get the list of commands.