pq-vector

Vector Search with only Parquet and DataFusion

Features

Embedded Index: Index stored within the Parquet file itself - no separate index files
Standard Compatible: Indexed files remain valid Parquet - DuckDB, Pandas, etc. can read them normally
DataFusion integration: Ergonomic vector search with just SQL.
Zero-copy: Zero-copy, in-place Parquet indexing.

Quick start

1) Build an index

use pq_vector::IndexBuilder;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    IndexBuilder::new(
        "data/embeddings.parquet", // Source file (indexed in-place by default)
        "embedding",               // Column name containing vectors
    )
    .n_clusters(100)
    .max_iters(20)
    .seed(42)
    .build_inplace()?;

    // Optional: write to a new file instead of in-place
    IndexBuilder::new("data/embeddings.parquet", "embedding")
        .build_new("data/embeddings_indexed.parquet")?;

    Ok(())
}

2) Search with Rust

use pq_vector::TopkBuilder;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let query_vector: Vec<f32> = vec![/* your query embedding */];

    let results = TopkBuilder::new("data/embeddings_indexed.parquet", &query_vector)
    .k(10)?
    .nprobe(5)?
    .search()
    .await?;

    for result in results {
        println!("Row {}: distance {:.4}", result.row_idx, result.distance);
    }

    Ok(())
}

3) DataFusion SQL

use datafusion::execution::SessionStateBuilder;
use datafusion::prelude::{ParquetReadOptions, SessionContext};
use pq_vector::df_vector::{PqVectorSessionBuilderExt, VectorTopKOptions};
use pq_vector::IndexBuilder;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let indexed = "data/embeddings_indexed.parquet";

    let options = VectorTopKOptions {
        nprobe: 8,
        max_candidates: None,
    };
    let state = SessionStateBuilder::new()
        .with_default_features()
        .with_pq_vector(options) // ENABLE pq-vector here!
        .build();
    let ctx = SessionContext::new_with_state(state);

    ctx.register_parquet("t", indexed, ParquetReadOptions::default())
        .await?;

    let df = ctx
        .sql(
            r#"
            SELECT id
            FROM t
            WHERE id >= 100
            ORDER BY array_distance(embedding, [0.0, 0.0])
            LIMIT 5
            "#,
        )
        .await?;
    let _batches = df.collect().await?;
    Ok(())
}

If you already have a custom SessionConfig, enable pq-vector like this:

use datafusion::execution::SessionStateBuilder;
use datafusion::prelude::SessionConfig;
use pq_vector::df_vector::{PqVectorSessionBuilderExt, PqVectorSessionConfigExt, VectorTopKOptions};

let options = VectorTopKOptions {
    nprobe: 8,
    max_candidates: None,
};
let config = SessionConfig::new()
    .with_target_partitions(2)
    .with_pq_vector();
let state = SessionStateBuilder::new()
    .with_default_features()
    .with_pq_vector(options)
    .with_config(config)
    .build();

Notes

k controls how many results you return.
nprobe trades speed for recall (higher = more accurate, slower).

License

MIT or Apache

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
benches		benches
data		data
examples		examples
src		src
.envrc		.envrc
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pq-vector

Features

Quick start

1) Build an index

2) Search with Rust

3) DataFusion SQL

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pq-vector

Features

Quick start

1) Build an index

2) Search with Rust

3) DataFusion SQL

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages