March

Trie-based KV cache with prefix sharing for LLM inference.

When multiple LLM requests share a system prompt or conversation history, a naive cache stores N full copies of the same key/value tensors. With 100 concurrent sessions on a 512-token system prompt, that's 100 identical blobs in GPU memory. The prefix is identical. The storage isn't.

March fixes that. One copy, shared by everyone.

How it works

March inserts token sequences into a Trie. Nodes that share a path share physical memory. A fixed-size mmap'd pool manages pages. Reads return raw pointers into that pool — zero memcpy, zero allocation.

insert([sys, A, B, C],    kv₁)
insert([sys, A, B, C, D], kv₂)

  root → [sys] → [A] → [B] → [C] ← KVPage  (shared by both)
                                └── [D] ← KVPage  (one new page)

query([sys, A, B, C, D])
  → 5 hops, each O(1) via internal hashmap
  → returns pointers: [&page_C, &page_D]
  → zero memcpy, zero allocation

Each node's children live in an open-addressing hashmap with Fibonacci hashing, so each hop is O(1) regardless of vocabulary size. The overall lookup is O(L) where L is sequence length — which is the minimum possible, since you must traverse the prefix.

Memory:

Traditional: N × L × page_size
March: unique_nodes × page_size
Savings: 80–97% in prefix-heavy workloads

Install

# one-liner install from GitHub (compiles C core automatically)
uv pip install git+https://github.com/lizixi-0x2F/March

# with HuggingFace integration
uv pip install "march[hf] @ git+https://github.com/lizixi-0x2F/March"

# local development
git clone https://github.com/lizixi-0x2F/March && cd March
uv pip install -e ".[hf]"

Requires: gcc, Python ≥ 3.10. HF integration additionally requires transformers ≥ 4.40 and torch.

Usage

Core cache

from march import MarchCache

cache = MarchCache(page_size=4096, max_pages=256)

cache.insert([1, 2, 3], kv_bytes)
pages, matched = cache.query([1, 2, 3, 4])

cache.print_stats()
# ── March Stats ──────────────────────────
#   uptime          : 0.3s
#   inserts         : 1
#   queries         : 1
#   hit rate        : 100.0%  (1/1)
#   token reuse     : 75.0%   (3/4 tokens served from cache)
#   page pool       : 256 pages × 4096 B = 1024.0 KB total
# ─────────────────────────────────────────

HuggingFace integration

MarchPrefixCache wraps a HF model and intercepts prefill. On a cache hit it skips the forward pass entirely and returns stored past_key_values directly.

from march.hf import MarchPrefixCache

prefix_cache = MarchPrefixCache(model, tokenizer)

# Miss — runs prefill, caches result
past_kv, hit = prefix_cache.prefill(input_ids)   # hit=False

# Same prefix again — no forward pass
past_kv, hit = prefix_cache.prefill(input_ids)   # hit=True

outputs = model.generate(
    input_ids,
    past_key_values=past_kv,
    use_cache=True,
)

prefix_cache.print_stats()

When March wins

Workload	Savings	Why
Multi-turn chat	80–97%	Same system prompt every request
Batch inference	80–97%	N requests, 1 shared prefix copy
Speculative / beam search	40–70%	Branching from a shared root
Random sequences	~0%	No shared prefixes to deduplicate

Architecture

Three C modules. One Python binding layer. No runtime dependencies.

PageAllocator — mmap-backed fixed-size pool. Alloc and free are O(1) via a free-list stack. Pages are reference-counted for safe sharing. LRU timestamps enable eviction when the pool fills.

KVTrie — Token-keyed prefix trie. Each node holds an open-addressing hashmap over its children, so vocabulary size doesn't affect lookup speed. Nodes without KV data act as internal routing nodes only.

ViewBuilder — Walks a path through the Trie and fills a caller-provided pointer array with references to live pool pages, root-to-leaf. The inference engine reads directly from the mmap'd buffer.

march_insert → KVTrie → PageAllocator  (alloc page, memcpy kv_data in)
march_query  → KVTrie → ViewBuilder    → out_ptrs[]  (zero-copy into pool)

Demo

python3 march/demo/demo_basic.py

pip install transformers torch
python3 march/demo/demo_smollm2_memory.py

500 multi-turn conversations, same system prompt. Left: traditional. Right: March.

March is early-stage research software. The C core is stable; the Python API may evolve.

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
march		march
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
smollm2_memory_comparison.png		smollm2_memory_comparison.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

March

How it works

Install

Usage

Core cache

HuggingFace integration

When March wins

Architecture

Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

March

How it works

Install

Usage

Core cache

HuggingFace integration

When March wins

Architecture

Demo

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages