Skip to content

4thel00z/konan

Repository files navigation

konan logo

konan

Like the paper angel of the Akatsuki, konan folds your documents into precise pieces — Rust chunkers behind a Python API.

PyPI CI python-ci Python 3.12+ Rust License: MIT


Why konan?

  • 🦀 Rust core — all chunking runs in native code, no Python-loop overhead
  • Multithreaded by defaultchunk_many() fans out across all cores via rayon and releases the GIL
  • 🌀 Real asyncchunk_async / chunk_many_async are native async def, not thread-pool wrappers
  • 🎯 Char-accurate offsetstext[chunk.start:chunk.end] == chunk.text, always (Python slicing semantics, emoji-safe)
  • 🧠 Semantic chunking — splits on topic shifts using any OpenAI-compatible embeddings endpoint, or your own async embedder
  • 🔌 Ports & adapters — the Embedder port is injectable; bring your own backend

Strategies

Chunker Splits by Best for
NaiveChunker fixed word count quick & dirty baselines
FixedSizeChunker chars, sentence-aware, overlap classic RAG pipelines
RecursiveChunker separator hierarchy (\n\n\n → … ) general text, LangChain-compatible
SentenceChunker unicode sentence boundaries prose, multilingual text
MarkdownChunker document structure + heading breadcrumbs docs, wikis, READMEs
TokenChunker exact token counts (cl100k_base, o200k_base) embedding-model token limits
SemanticChunker embedding similarity drops topic-coherent chunks

Installation

uv add konan        # or: pip install konan

Quickstart

from konan import RecursiveChunker

chunker = RecursiveChunker(chunk_size=1000, chunk_overlap=200)

chunks = chunker.chunk(open("moby_dick.txt").read())
print(chunks[0].text, chunks[0].start, chunks[0].end, chunks[0].hash)

Parallel — all cores, one call

# rayon work-stealing across every core, GIL released:
all_chunks = chunker.chunk_many(documents)

Async — real async def, no thread-pool wrappers

chunks = await chunker.chunk_async(text)
batches = await chunker.chunk_many_async(documents)

Markdown with breadcrumbs

from konan import MarkdownChunker

chunks = MarkdownChunker(chunk_size=800).chunk(readme_text)
# chunk text is prefixed with its heading trail: "# Guide > ## Install\n\n..."
# code fences are never split

Token-exact chunks

from konan import TokenChunker

chunker = TokenChunker(chunk_size=512, chunk_overlap=64, encoding="o200k_base")

Semantic chunking

Point it at any OpenAI-compatible /embeddings endpoint (OpenAI, vLLM, Ollama, LiteLLM, …):

from konan import OpenAIEmbedder, SemanticChunker

embedder = OpenAIEmbedder(
    base_url="https://api.openai.com/v1",
    model="text-embedding-3-small",
    api_key="sk-...",
    batch_size=128,    # texts per request
    timeout=30.0,      # request timeout, seconds
    max_retries=2,     # exponential backoff on 429/5xx/connect errors
    dimensions=512,    # optional: shorten text-embedding-3-* vectors
)
chunker = SemanticChunker(embedder=embedder, threshold=0.75)
chunks = await chunker.chunk_async(article)

Or inject your own embedder — any async callable works:

async def my_embedder(texts: list[str]) -> list[list[float]]:
    return await my_model.embed(texts)

chunker = SemanticChunker(embedder=my_embedder, percentile=95.0)
chunks = await chunker.chunk_async(article)   # async-only for Python embedders

Python embedders must return list[list[float]] — call .tolist() on numpy arrays. They are async-only: chunk()/chunk_many() raise a RuntimeError pointing you at the _async variants.

The Chunk object

chunk.text    # the chunk's text
chunk.start   # char offset into the source (Python slicing semantics)
chunk.end     # char offset, exclusive
chunk.index   # 0-based position
chunk.hash    # xxh3-64 content hash

Benchmarks

Benchmarked on Apple M3 Pro (arm64), Python 3.12.10. Decimal MB/s, median of 5 runs, measured from Python (the numbers you actually get). Reproduce both tables and plots with uv run --extra bench benchmarks/bench.py; Rust-level criterion benches: cargo bench -p konan-core.

vs other libraries (same 1 MB document, identical configs)

konan vs other libraries

Strategy Library Throughput Chunks
recursive konan 282 MB/s 1298
recursive semantic-text-splitter 129 MB/s 1368
recursive chonkie 87 MB/s 1413
recursive langchain-text-splitters 24 MB/s 1468
token konan 68 MB/s 438
token semchunk 24 MB/s 616
token langchain-text-splitters 23 MB/s 438
token chonkie 18 MB/s 438
token semantic-text-splitter 4 MB/s 505
sentence konan 305 MB/s 1042
sentence chonkie 3 MB/s 2124
recursive (unicode) semantic-text-splitter 167 MB/s 1150
recursive (unicode) konan 166 MB/s 1092
recursive (unicode) chonkie 57 MB/s 1171

Throughput per strategy (1 MB document)

konan throughput per strategy

Chunker Config Throughput Chunks
NaiveChunker 200 words 553 MB/s 804
FixedSizeChunker 1000 chars, 200 overlap 1,602 MB/s 1255
RecursiveChunker 1000 chars, 200 overlap 282 MB/s 1298
SentenceChunker 1000 chars, 1 overlap 306 MB/s 1130
MarkdownChunker 1000 chars, 200 overlap 448 MB/s 1511
TokenChunker 512 tokens, 64 overlap (cl100k) 69 MB/s 438

Parallel scaling — rayon goes brrr

chunk_many parallel scaling

64 docs × 256 KB through RecursiveChunker:

Mode Time Throughput Speedup
sequential chunk() loop 61 ms 267 MB/s 1.0×
chunk_many() (rayon, GIL released) 10 ms 1,568 MB/s 5.9×

(Pure-Rust criterion puts the same workload at ~6 GiB/s; the Python numbers include chunk-object conversion.)

Caveats, honestly: recursive/token use identical configs across libraries (1000 chars / 200 overlap; cl100k, 512 / 64 — note the identical token chunk counts for the libraries with the same windowing semantics). Sentence configs are not directly comparable (konan groups by chars, chonkie by tokens) — read those rows as per-library cost, not head-to-head. The recursive (unicode) rows run mixed German/CJK/Cyrillic/emoji prose, off konan's ASCII fast path. konan tokenizes with bpe-openai (tiktoken-equivalent output, much faster encoder).

Development

uv sync                       # set up the venv (builds the extension)
uv run maturin develop --uv   # rebuild after Rust changes
cargo test --workspace        # rust unit tests
uv run pytest -q              # python integration tests

The workspace is hexagonal: crates/konan-core is pure Rust (no PyO3) with Chunker and Embedder ports; crates/konan-py adapts it to Python.

License

MIT


Named after Konan of the Akatsuki — the only one who could fold paper into anything. 🗞️

About

Blazingly fast text chunkers in Rust with pythonic bindings. Rayon-parallel, real async, char-accurate offsets, semantic chunking

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors