PDX: A Vertical Data Layout for Vector Similarity Search

PDX is a vertical data layout for vectors that store the dimensions of different vectors together. In a nutshell, PDX is a PAX for vector similarity search.

What are the benefits of storing vectors vertically?

✂️ Efficient pruning of dimensions with partial distance calculations.
⚡ Up to 7x faster IVF queries (compared to FAISS+AVX512) when pairing PDX with the pruning algorithm ADSampling.
⚡ Up to 13x faster exhaustive search thanks to pruning.
⚡ Raw distance kernels (no pruning) in PDX are up to 1.6x faster than the float32 kernels available in SimSIMD (USearch) and FAISS.
- Why? Distance kernels in PDX are free of dependencies and have fewer LOAD/STORE operations.
Distance kernels can auto-vectorize efficiently without explicit SIMD for float32.
Distance kernels on small vectors (d < 16) are up to 8x faster than SIMD kernels in SimSIMD.
More efficient compressed representations of vectors (WIP).

Only thanks to the PDX layout, do pruning methods outperform SIMD-optimized kernels in all CPU microarchitectures (Zens, Intels, Gravitons). This is because, in PDX, distance calculations are efficient even in small vector segments. Also, the evaluation of the pruning predicate is not interleaved with distance calculations.

Pruning algorithms are especially effective when:

Vectors are of high dimensionality (d > 512)
High recalls are needed (> 0.90)
Exact results are needed
k is relatively low (k=10,20)

Click here to see some quick benchmarks of our Python bindings vs FAISS. The complete benchmarks are available in our publication. Furthermore, you will find details on how we adapted these novel pruning algorithms to work in PDX.

We also refer to the recent research on pruning algorithms with partial distance calculations: ADSampling, DDC (previously named BSA), and DADE. All of these rely on rotating the vector collection to prune effectively. Alongside PDX, we also introduce PDX-BOND, a simpler pruning algorithm that does not need to rotate the vectors.

Quickstart

You can quickly try PDX with your data using our Python bindings and examples. We have implemented PDX on Flat IVF indexes and exhaustive search settings.

Prerequisites

Python 3.11 or higher
FAISS with Python Bindings
Clang++17 or higher
CMake 3.26 or higher

Steps

Clone the repository and init submodules (Eigen for efficient matrix operations and pybind11)

git clone https://github.com/cwida/PDX
git submodule init
git submodule update

Install Python dependencies and pdxearch Python bindings.

export CXX="/usr/bin/clang++-18"  # Set proper CXX first
pip install -r requirements.txt
python setup.py clean --all
python -m pip install .

Run the examples under /examples

# Creates an IVF index with FAISS on random data
# Then, it compares the search performance of PDXearch and FAISS
python ./examples/pdxearch_simple.py

For more details on the available examples and how to use your own data, refer to /examples/README.md.

Notes

We heavily rely on FAISS to create the underlying IVF indexes.
PDX is an ongoing research project. In its current state, it is not production-quality code.

Use Cases

IVF indexes

PDX paired with ADSampling on IVF indexes works great in most scenarios with less than 0.001 recall loss. The higher the dimensionality, the higher the gains from pruning. The following benchmarks are from /examples/pdxearch_ivf.py.

NOTE THAT on these benchmarks: (i) Both FAISS and PDXearch are scanning exactly the same vectors. (ii) The recall loss of ADSampling is always less than 0.001.

Avg. query time [Intel SPR \| r7iz.2x]	FAISS AVX512 R@10: 0.99 · 0.95 · 0.90	PDXearch	Improvement
DBPedia · d=1536 · 1M	53.6 · 18.0 · 7.7 ms	7.4 · 3.7 · 2.4 ms	7.2x · 4.9x · 3.2x
arXiv · d=768 · 2.25M	25.9 · 7.7 · 3.5	6.2 · 2.8 · 1.7	4.2x · 2.8x · 2.1x
SIFT · d=128 · 1M	1.7 · 0.6 · 0.4	1.0 · 0.4 · 0.3	1.7x · 1.5x · 1.3x

Avg. query time [Graviton 4 \| r8g.2x]	FAISS SVE R@10: 0.99 · 0.95 · 0.90	PDXearch	Improvement
DBPedia · d=1536 · 1M	47.1 · 18.4 · 6.7 ms	7.1 · 4.1 · 2.5 ms	6.6x · 4.5x · 2.7x
arXiv · d=768 · 2.25M	25.3 · 7.0 · 3.2	5.9 · 2.7 · 1.7	4.3x · 2.6x · 1.9x
SIFT · d=128 · 1M	1.1 · 0.5 · 0.3	0.7 · 0.4 · 0.2	1.6x · 1.3x · 1.3x

Exact search + IVF

In PDX, building an IVF index can significantly improve exact search speed (thanks to the reliable pruning). The following benchmarks are from /examples/pdxearch_ivf_exhaustive.py.

Avg. query time [Intel SPR \| r7iz.2x]	FAISS AVX512	PDXearch	Improvement
DBPedia · d=1536 · 1M	411.7 ms	34.9 ms	11.8x
GIST · d=960 · 1M	252.9	27.6	9.1x
arXiv · d=768 · 2.25M	454.5	41.4	11.0x
SIFT · d=128 · 1M	34.4	6.5	5.3x

Avg. query time [Graviton 4 \| r8g.2x]	FAISS SVE	PDXearch	Improvement
DBPedia · d=1536 · 1M	277.2 ms	21.7 ms	12.8x
GIST · d=960 · 1M	170.9	19.4	8.8x
arXiv · d=768 · 2.25M	306.0	27.2	11.3x
SIFT · d=128 · 1M	21.3	4.7	4.5x

Exact search without an index

Use PDX+BOND, our pruning algorithm. Here, vectors are not transformed, and we do not use any additional index. Gains vary depending on the dimensions distribution. The following benchmarks are from /examples/pdxearch_exact_bond.py.

Avg. query time [Intel SPR \| r7iz.2x]	FAISS AVX512	PDXearch	Improvement
DBPedia · d=1536 · 1M	374 ms	216 ms	1.7x
arXiv · d=768 · 2.25M	422	212	2.0x
MSong · d = 420 · 1M	117	15	7.8x
SIFT · d=128 · 1M	31	10	3.1x

Avg. query time [Graviton 4 \| r8g.2x]	FAISS SVE	PDXearch	Improvement
DBPedia · d=1536 · 1M	278 ms	139 ms	2.0x
arXiv · d=768 · 2.25M	305	155	2.0x
MSong · d = 420 · 1M	70	15	4.7x
SIFT · d=128 · 1M	22	11	2.0x

No pruning and no index

PDX distance kernels are also faster than the state-of-the-art SIMD kernels in all major architectures, only relying on auto-vectorization (for float32). The following benchmarks are from /examples/pdx_brute.py.

Avg. query time [Intel SPR \| r7iz.2x]	USearch	FAISS AVX512	PDXearch	Improvement
GIST · d=960 · n=1M	260 ms	247 ms	208 ms	1.3x · 1.2x
STL · d=9216 · n=90K	223	220	176	1.3x · 1.3x

Avg. query time [Graviton 4 \| r8g.2x]	USearch	FAISS SVE	PDXearch	Improvement
GIST · d=960 · n=1M	203 ms	172 ms	124 ms	1.6x · 1.4x
STL · d=9216 · n=90K	175	160	109	1.6x · 1.5x

Refer to our publication for the complete benchmarks of PDXearch done in 4 microarchitectures: Intel SPR, Zen 4, Zen 3, and Graviton 4. Here, you will also find a comparison against ADSampling in the horizontal layout (.fvecs layout). Furthermore, you will find details on how we adapted pruning algorithms to work in PDX.

Roadmap

Compression: The vertical layout opens opportunities for compression as indexing algorithms group together vectors that share some numerical similarity within their dimensions. The next step on PDX is to apply our scalar quantization algorithm LEP, which uses database compression techniques (ALP) to compress vectors with higher compression ratios and less information loss.
More data types: For compressed vectors, we need to implement vertical distance kernels on vectors of variable bit size.
PDX in HNSW: For this, we need a layout similar to the ones proposed in Starling or AiSAQ, in which neighborhoods of the graph are stored and fetched in blocks.
Improve code readability and usability.
Add a testing framework
Add DDC or DADE algorithms to the Python Bindings

Benchmarking

To run our benchmark suite in C++, refer to BENCHMARKING.md.

SIGMOD

The code used for the experiments presented at SIGMOD'25 can be found in the sigmod branch.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
benchmarks		benchmarks
examples		examples
extern		extern
include		include
python		python
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
BENCHMARKING.md		BENCHMARKING.md
CMakeLists.txt		CMakeLists.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDX: A Vertical Data Layout for Vector Similarity Search

What are the benefits of storing vectors vertically?

Contents

Pruning in a nutshell

Quickstart

Prerequisites

Steps

Notes

Use Cases

IVF indexes

Exact search + IVF

Exact search without an index

No pruning and no index

Roadmap

Benchmarking

SIGMOD

About

Uh oh!

Releases

Packages

Languages

HengruiTHU/PDX

Folders and files

Latest commit

History

Repository files navigation

PDX: A Vertical Data Layout for Vector Similarity Search

What are the benefits of storing vectors vertically?

Contents

Pruning in a nutshell

Quickstart

Prerequisites

Steps

Notes

Use Cases

IVF indexes

Exact search + IVF

Exact search without an index

No pruning and no index

Roadmap

Benchmarking

SIGMOD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages