#reranking #cross-encoder #semantic-search #nlp #search

flash_rerank

Core reranking engine — cross-encoder and ColBERT inference via ONNX Runtime

1 unstable release

Uses new Rust 2024

0.2.0 Apr 2, 2026
0.1.1 Mar 28, 2026
0.1.0 Mar 28, 2026

#2219 in Algorithms

Download history 23/week @ 2026-04-11 5/week @ 2026-04-18 28/week @ 2026-04-25 47/week @ 2026-05-02

103 downloads per month
Used in 3 crates

AGPL-3.0-or-later

155KB
3.5K SLoC

Flash-Rerank

Rust · Python · WASM · CLI

The fastest reranker in the world. Pure Rust. No GPU required.

CI License crates.io PyPI npm Stars


72ms to rerank 100 documents on CPU. No GPU, no cloud API, no Python in the inference path. Pure Rust with INT8 quantization, parallel sub-batch scoring, and zero-copy tokenization.

Combined with BM25-Turbo, Flash-Rerank searches 8.8 million documents and semantically reranks the top 100 in 80ms — faster than any competitor can rerank 100 pre-selected documents alone.

import flash_rerank

reranker = flash_rerank.load("cross-encoder/ms-marco-MiniLM-L-6-v2")
results = reranker.rerank("what is machine learning?", [
    "ML is a subset of artificial intelligence",
    "The weather today is sunny",
    "Deep learning uses neural networks",
], top_k=2)

Performance

Benchmarked on Intel i9-12900H (16 threads), RTX 3080 Ti Laptop GPU, 32GB RAM. Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (22M params, INT8 quantized). Parallel sub-batch inference with 8 workers. All benchmarks reproducible via cargo bench.

Competitive Comparison — Reranking 100 Documents

Competitive comparison chart

Provider Latency Type vs Flash-Rerank
Flash-Rerank (Parallel CPU INT8) 72ms Local, open-source
Flash-Rerank + BM25-Turbo 80ms Search 8.8M docs + rerank Entire pipeline
Jina Reranker v3 188ms Cloud API, GPU 2.6x slower
Cohere Rerank 3.5 595ms Cloud API 8.3x slower
Voyage Rerank 2.5 603ms Cloud API 8.4x slower

Key insight: Our full pipeline (keyword search across 8.8M documents + semantic reranking of top 100) completes in 80ms — faster than Jina, Cohere, or Voyage can rerank 100 pre-selected documents alone. And we're running on a laptop CPU.

Latency Scaling

Latency scaling chart

Documents P50 Latency Per-Doc Cost QPS
1 2.7ms 2.7ms 353
10 10.3ms 1.0ms 96
50 39.5ms 0.8ms 25
100 72ms 0.7ms 14

Per-document cost decreases with batch size — parallel sub-batch scoring saturates all CPU cores as the batch grows.

Two-Stage Pipeline Performance

Pipeline comparison chart

Why Flash-Rerank Beats GPU-Accelerated APIs on CPU

For models under ~100M parameters, CPU INT8 with AVX-512 + parallel workers outperforms GPU because:

Factor CPU (Flash-Rerank) GPU (competitors)
Model fits in cache L2/L3 cache (~30MB) holds entire model VRAM access adds latency
Kernel overhead None — direct computation GPU kernel launch costs 2-5ms
Data transfer Zero — model lives in CPU memory Host→Device copies add latency
Quantization INT8 on AVX-512 is extremely efficient FP16/FP32 wastes precision headroom
Parallelism N workers × M threads = all cores busy Single model instance, fixed parallelism

Quick Start

Python

pip install flash-rerank
import flash_rerank

# Load model (auto-downloads from HuggingFace Hub on first use)
reranker = flash_rerank.load("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Rerank documents by semantic relevance
results = reranker.rerank("distributed systems at scale", [
    "Designing Data-Intensive Applications covers distributed systems",
    "A cookbook with Italian recipes",
    "Raft consensus algorithm for fault-tolerant systems",
    "Introduction to watercolor painting techniques",
    "MapReduce: simplified data processing on large clusters",
], top_k=3)

for index, score in results:
    print(f"  doc {index}: {score:.4f}")

Rust

cargo add flash_rerank
use flash_rerank::models::ModelRegistry;
use flash_rerank::engine::ort_backend::OrtScorer;
use flash_rerank::engine::Scorer;
use flash_rerank::ModelConfig;

// Load from HuggingFace cache
let cache_dir = dirs::home_dir().unwrap().join(".cache/huggingface/hub");
let registry = ModelRegistry::new(cache_dir);
let model_dir = registry.load("cross-encoder/ms-marco-MiniLM-L-6-v2")?;

// Score documents
let scorer = OrtScorer::new(ModelConfig::default(), &model_dir)?;
let results = scorer.score("query", &docs)?;
// Results are sorted by descending score, each with index, score in [0.0, 1.0]

For maximum throughput, use ParallelScorer instead of OrtScorer:

use flash_rerank::engine::parallel::ParallelScorer;

let scorer = ParallelScorer::new(ModelConfig::default(), &model_dir, None)?;
// Splits batch across N worker threads — 1.7x faster than single-session
let results = scorer.score("query", &docs)?;

CLI

cargo install flash-rerank-cli
# Download a model from HuggingFace Hub
flash-rerank download --model cross-encoder/ms-marco-MiniLM-L-6-v2

# Benchmark reranking latency
flash-rerank bench --model cross-encoder/ms-marco-MiniLM-L-6-v2 --documents 100

# Start the HTTP server
flash-rerank serve --model cross-encoder/ms-marco-MiniLM-L-6-v2 --port 8080

# Manage cached models
flash-rerank models list

# Generate shell completions
flash-rerank completions bash > ~/.bash_completion.d/flash-rerank

WASM / JavaScript

npm install flash-rerank-wasm
import init, { load_model, rerank } from 'flash-rerank-wasm';

await init();

// Load model bytes and tokenizer (fetch from your server or CDN)
const modelBytes = await fetch('/models/minilm-int8.onnx').then(r => r.arrayBuffer());
const tokenizerJson = await fetch('/models/tokenizer.json').then(r => r.text());
load_model(new Uint8Array(modelBytes), tokenizerJson);

// Rerank entirely client-side — documents never leave the browser
const results = rerank("machine learning", ["ML is a subset of AI", "Weather is sunny"]);
console.log(results); // [{index: 0, score: 0.95}, ...]

Two-Stage Pipeline — BM25-Turbo + Flash-Rerank

The direct sequel to BM25-Turbo. Together they form the fastest end-to-end retrieval + reranking pipeline in the world.

┌──────────────────────────────────────────────────────────────────┐
│  User Query: "distributed systems at scale"                      │
│                                                                  │
│  Stage 1: BM25-Turbo (8.6ms)                                    │
│  ├── Searches 8.8 MILLION documents via precomputed BM25         │
│  ├── Returns top 100 keyword matches                             │
│  └── No math at query time — sparse vector lookup                │
│                                                                  │
│  Stage 2: Flash-Rerank (72ms)                                    │
│  ├── Cross-encoder scores each (query, document) pair            │
│  ├── INT8 quantized inference on CPU                             │
│  ├── Parallel sub-batch across 8 worker threads                  │
│  └── Returns top 5 semantically ranked results                   │
│                                                                  │
│  Total: 80ms — Search millions, get the best 5.                  │
└──────────────────────────────────────────────────────────────────┘

Installation

pip install flash-rerank bm25-turbo

Usage

import flash_rerank
import bm25_turbo

# Build BM25 index (one-time, ~60s for 8.8M docs)
index = bm25_turbo.build_index(corpus, method="lucene")

# Load reranker
reranker = flash_rerank.load("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Two-stage pipeline: keyword retrieval → semantic reranking
pipeline = flash_rerank.Pipeline(index, reranker)
results = pipeline.search("distributed systems at scale", top_k=5, retrieve=100)
# 80ms end-to-end: BM25 retrieval (8.6ms) + neural rerank (72ms)

Hybrid RRF Fusion

Combine BM25 keyword scores with vector similarity and neural reranking via Reciprocal Rank Fusion:

results = reranker.rerank_hybrid(
    query="distributed systems at scale",
    documents=candidates,
    bm25_scores=bm25_results,
    alpha=0.6,  # weight toward neural scores (default 0.5)
)

Why This Pipeline Wins

Approach Latency Accuracy Cost
BM25-Turbo + Flash-Rerank 80ms High (neural reranking) Free (local)
Elasticsearch + Cohere Rerank ~700ms High $1/1000 queries
Vector DB + Jina Rerank ~300ms High $0.80/1000 queries
Pure vector search (Qdrant/Pinecone) ~50ms Medium (no reranking) $0.05+/query
BM25 only (no reranking) 8.6ms Low-medium Free

Flash-Rerank + BM25-Turbo gives you neural-level accuracy at BM25-level latency and zero marginal cost.


Requirements

  • Rust 1.85+ (edition 2024) — for building from source or cargo install
  • Python 3.9+ — for pip install flash-rerank
  • NVIDIA GPU (optional) — enable with cargo build --features cuda,tensorrt

No GPU drivers, CUDA toolkit, or cloud API keys required for CPU inference.

Installation

Platform Command Package
Rust library cargo add flash_rerank crates.io/crates/flash_rerank
CLI binary cargo install flash-rerank-cli crates.io/crates/flash-rerank-cli
Python pip install flash-rerank pypi.org/project/flash-rerank
WASM / npm npm install flash-rerank-wasm npmjs.com/package/flash-rerank-wasm
Server cargo add flash-rerank-server crates.io/crates/flash-rerank-server

Features

Inference Engine

  • Parallel sub-batch inference — N workers x M threads saturate all CPU cores
  • Auto INT8 model selection — quantized ONNX models auto-selected for CPU (2x speedup)
  • GPU acceleration — CUDA and TensorRT execution providers (optional feature)
  • Dynamic model input detection — supports both 2-input (BGE) and 3-input (MiniLM) models
  • Memory-mapped model loading — cold start in microseconds via memmap2

Scoring & Fusion

  • Score calibration — sigmoid normalization + custom Platt scaling (0.0-1.0 scores)
  • Hybrid RRF fusion — combine BM25, vector, and neural scores with configurable weights
  • Cascade reranking — fast model filters, accurate model refines uncertain results
  • ColBERT late interaction — MaxSim scoring with LRU embedding cache

Server & Operations

  • Dynamic batching — SLA-aware request grouping for GPU throughput
  • Multi-GPU scaling — least-loaded routing across available GPUs
  • A/B model comparison — deterministic traffic splitting with per-variant metrics
  • Canary deployment — gradual rollout (1% → 5% → 25% → 100%) with auto-rollback
  • Score drift detection — KL divergence monitoring for distribution shifts
  • OpenTelemetry tracing — behind telemetry feature flag, zero-cost when disabled

Developer Experience

  • Async Pythonarerank() releases the GIL and runs on asyncio executors
  • WASM browser reranking — client-side inference via tract-onnx (documents never leave the device)
  • BEIR benchmarking suite — accuracy and latency benchmarks against standard IR datasets
  • 235+ tests — property-based (proptest), snapshot (insta), integration, stress, CLI

Model Zoo

Model Params Latency (100 docs) Best For
cross-encoder/ms-marco-MiniLM-L-6-v2 22M 72ms (CPU INT8) Speed — fastest inference
cross-encoder/ms-marco-MiniLM-L-12-v2 33M 165ms (CPU INT8) Speed + accuracy balance
BAAI/bge-reranker-v2-m3 568M ~7s (CPU) Multilingual (100+ languages)
Qwen3-Reranker-0.6B 600M Most accurate small model

Any ONNX cross-encoder model from HuggingFace Hub works with Flash-Rerank. The engine auto-detects model inputs and selects the best available quantized variant.

# Download any model
flash-rerank download --model cross-encoder/ms-marco-MiniLM-L-6-v2

# List cached models
flash-rerank models list

# Benchmark a model
flash-rerank bench --model cross-encoder/ms-marco-MiniLM-L-6-v2 --documents 100

HTTP Server

Quick Start

# Install and start
cargo install flash-rerank-cli
flash-rerank download --model cross-encoder/ms-marco-MiniLM-L-6-v2
flash-rerank serve --model cross-encoder/ms-marco-MiniLM-L-6-v2 --port 8080

API

# Rerank documents
curl -X POST http://localhost:8080/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "documents": ["ML is a subset of AI", "The weather is sunny", "Neural networks learn patterns"],
    "top_k": 2
  }'

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

Advanced Server Options

# Dynamic batching (group requests within 5ms window, max batch 256)
flash-rerank serve --model ms-marco-MiniLM-L-6-v2 --max-batch 256 --max-wait-ms 5

# Multi-GPU (distribute across GPUs 0, 1, 2)
flash-rerank serve --model ms-marco-MiniLM-L-6-v2 --gpus 0,1,2

# With OpenTelemetry tracing
flash-rerank serve --model ms-marco-MiniLM-L-6-v2 --otlp-endpoint http://localhost:4317

Management Endpoints

Endpoint Method Description
/rerank POST Score and rank documents
/health GET Health check
/metrics GET Prometheus metrics
/admin/drift/status GET Score drift detection status
/admin/drift/reset-baseline POST Reset drift baseline
/admin/ab POST/PUT/DELETE A/B test management
/admin/ab/metrics GET Per-variant comparison metrics
/admin/canary POST Start canary deployment
/admin/canary/status GET Canary rollout status
/admin/canary/advance POST Advance to next rollout stage
/admin/canary DELETE Abort and rollback

Architecture

Flash-Rerank Workspace
├── flash_rerank/              # Core library (Scorer trait + backends)
│   ├── engine/
│   │   ├── ort_backend.rs     #   ONNX Runtime: CPU, CUDA, TensorRT
│   │   ├── parallel.rs        #   Parallel sub-batch scorer (N workers)
│   │   ├── colbert.rs         #   ColBERT MaxSim late interaction
│   │   └── tensorrt.rs        #   TensorRT engine compiler
│   ├── tokenize/              #   HuggingFace tokenizers wrapper
│   ├── calibrate/             #   Sigmoid + Platt score calibration
│   ├── fusion/                #   Reciprocal Rank Fusion (RRF)
│   ├── cascade/               #   Fast model → accurate model pipeline
│   ├── batch/                 #   Dynamic request batching
│   ├── multi_gpu/             #   Least-loaded GPU routing
│   └── models/                #   HuggingFace Hub download + cache
├── flash-rerank-server/       # HTTP server (axum)
│   ├── routes.rs              #   /rerank, /health, /metrics, /admin/*
│   ├── drift.rs               #   KL divergence score monitoring
│   ├── ab_test.rs             #   A/B traffic splitting
│   ├── canary.rs              #   Progressive rollout
│   └── telemetry.rs           #   OpenTelemetry OTLP exporter
├── flash-rerank-cli/          # CLI binary (clap)
│   ├── download.rs            #   Model download from HuggingFace
│   ├── compile.rs             #   ONNX → TensorRT compilation
│   ├── bench.rs               #   Latency + accuracy benchmarks
│   ├── serve.rs               #   Start HTTP server
│   └── models.rs              #   Cache management
├── flash-rerank-python/       # Python bindings (PyO3 + maturin)
├── flash-rerank-wasm/         # Browser WASM (tract-onnx)
├── benchmarks/                # Criterion + divan + BEIR + MSMARCO
└── examples/                  # Runnable examples

Why Flash-Rerank is Fast

Every existing reranker (Cohere, Voyage, Jina, BGE) runs a Python server on top of PyTorch or ONNX Runtime. Flash-Rerank eliminates every layer of overhead:

┌─────────────────────────────────────────────────────────────────┐
│  Competitor Stack              │  Flash-Rerank Stack            │
│                                │                                │
│  Python HTTP server            │  Rust HTTP server (axum)       │
│  ├── Python GIL                │  ├── No GIL, no interpreter   │
│  ├── NumPy array conversion    │  ├── Zero-copy tensors        │
│  ├── PyTorch / ONNX Runtime    │  ├── ONNX Runtime (direct)    │
│  ├── FP32 inference            │  ├── INT8 auto-quantized      │
│  ├── Single session            │  ├── N parallel workers       │
│  └── GPU (required)            │  └── CPU-only (GPU optional)  │
│                                │                                │
│  Result: 188-603ms             │  Result: 72ms                  │
└─────────────────────────────────────────────────────────────────┘
  1. Pure Rust — No Python interpreter or GIL in the inference hot path
  2. INT8 quantized models — Auto-selected on CPU via AVX-512 instructions (2x speedup)
  3. Parallel sub-batch scoring — 8 worker threads each running an independent ORT session
  4. Zero-copy tokenization — HuggingFace tokenizers encodes directly into model input tensors
  5. Memory-mapped model loading — Models load via memmap2 in microseconds
  6. Dynamic model input detection — Auto-detects 2-input vs 3-input models at load time

Benchmarks

Reproducing Our Numbers

git clone https://github.com/alessandrobenigni/Flash-Rerank-Rust-Python-WASM-CLI-.git
cd Flash-Rerank-Rust-Python-WASM-CLI-

# Download model
cargo run -p flash-rerank-cli --release -- download --model cross-encoder/ms-marco-MiniLM-L-6-v2

# Run competitive benchmark
cargo bench --bench competitive

# Run latency profiling
cargo run --example micro_profile --release

# Run pipeline benchmark (BM25-Turbo + Flash-Rerank)
cargo run --example pipeline_benchmark --release

# Run full BEIR evaluation
MSMARCO_PATH=/path/to/msmarco cargo bench --bench msmarco_accuracy

Our Benchmark Hardware

Component Spec
CPU Intel i9-12900H (14 cores, 20 threads)
GPU NVIDIA RTX 3080 Ti Laptop (16GB VRAM)
RAM 32GB DDR5
OS Windows 11
Rust 1.93.1
ONNX Runtime 2.0.0-rc.12

On a desktop RTX 3090 Ti with TensorRT INT8, the target is <20ms for 100 documents — 10x faster than Jina's published benchmark.


Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

# Build everything
cargo build --workspace

# Run tests (235+ tests)
cargo test --workspace --exclude flash-rerank-python --exclude flash-rerank-wasm

# Lint
cargo clippy --workspace --exclude flash-rerank-python --exclude flash-rerank-wasm -- -D warnings

# Format
cargo fmt --all

# Build Python wheel
cd flash-rerank-python && maturin develop --release

# Build WASM
cd flash-rerank-wasm && wasm-pack build --target web

Roadmap

  • TensorRT INT8 engine compilation with auto-calibration
  • Pre-compiled engine distribution via HuggingFace Hub
  • WebGPU acceleration for browser inference
  • Early-exit transformer inference for adaptive compute
  • Multi-modal reranking (text + images)
  • Listwise reranking (RankGPT-style)

License

v0.2.0+ is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

  • Individuals and open-source projects: Free to use under AGPL-3.0 terms.
  • Enterprises and commercial use: A commercial license is available at alessandrobenigni.com — use Flash-Rerank in proprietary software without the AGPL copyleft requirement. See COMMERCIAL_LICENSE.md for details.

Legacy: v0.1.x was released under MIT/Apache-2.0. Those versions remain under their original terms but are no longer maintained.


Built by The Sauce Suite
Part of the BM25-Turbo ecosystem — the fastest search stack in the world.

Dependencies

~28–44MB
~561K SLoC