#chunking #rag #text-splitting

chunkedrs

AI-native text chunking — recursive, markdown-aware, and semantic splitting with token-accurate boundaries

4 stable releases

Uses new Rust 2024

1.0.3 Apr 23, 2026
1.0.1 Mar 8, 2026

#393 in Text processing

MIT license

66KB
1.5K SLoC

chunkedrs

Crates.io docs.rs License Downloads MSRV

English | 简体中文 | 日本語

Token-accurate text chunking for RAG pipelines — recursive, markdown-aware, and semantic splitting. Built on tiktoken, the fastest pure-Rust BPE tokenizer.

Highlights

  • Token-accurate — every chunk is guaranteed within your token budget, not character-approximate
  • 3 strategies — recursive (fast, general), markdown-aware (preserves headers), semantic (embedding-based breakpoints)
  • Rich metadata — byte offsets, token counts, and section headers on every chunk
  • Overlap — configurable token overlap between chunks for retrieval context preservation
  • Any tokenizer — auto-detect from model name (gpt-4o, claude, llama) or specify encoding directly
  • Built on tiktoken — the fastest pure-Rust BPE tokenizer with 9 encodings across all major LLMs

Why chunkedrs?

RAG pipelines need text split into chunks that fit model context windows. Naive splitting (by character count or fixed size) breaks mid-word, mid-sentence, or mid-paragraph — destroying meaning and hurting retrieval quality.

chunkedrs splits at semantic boundaries (paragraphs, sentences, words) while enforcing exact token limits. No chunk ever exceeds max_tokens.

Feature chunkedrs text-splitter Manual
Token-accurate limits Yes (tiktoken) Character-based No
Recursive splitting Yes Yes DIY
Markdown-aware Yes (section metadata) No DIY
Semantic splitting Yes (via embedrs) No DIY
Byte offsets Yes No DIY
Token count per chunk Yes No DIY
Overlap support Token-level Character-level DIY
Tokenizer selection Model name or encoding N/A N/A

Strategies

Strategy Use case Speed
Recursive (default) General text — paragraphs, sentences, words Fastest
Markdown Documents with # headers — preserves section metadata Fast
Semantic High-quality RAG — splits at meaning boundaries via embeddings Slower (API calls)

Quick start

Add to your Cargo.toml:

[dependencies]
chunkedrs = "1"

Split text with defaults (recursive, 512 max tokens, no overlap):

use chunkedrs::Chunk;

let chunks: Vec<Chunk> = chunkedrs::chunk("your long text here...").split();
for chunk in &chunks {
    println!("[{}] {} tokens (bytes {}..{})", chunk.index, chunk.token_count, chunk.start_byte, chunk.end_byte);
}
// Output:
// [0] 4 tokens (bytes 0..21)

Token-accurate splitting

let chunks = chunkedrs::chunk("your long text here...")
    .max_tokens(256)
    .overlap(50)
    .model("gpt-4o")
    .split();

// every chunk is guaranteed to have <= 256 tokens
assert!(chunks.iter().all(|c| c.token_count <= 256));

Markdown-aware splitting

let markdown = "# Intro\n\nSome text.\n\n## Details\n\nMore text here.\n";
let chunks = chunkedrs::chunk(markdown).markdown().split();

// each chunk knows which section it belongs to
assert_eq!(chunks[0].section.as_deref(), Some("# Intro"));

Semantic splitting

With the semantic feature enabled, split at meaning boundaries using embeddings:

[dependencies]
chunkedrs = { version = "1", features = ["semantic"] }
let client = embedrs::openai("sk-...");
let chunks = chunkedrs::chunk("your long text here...")
    .semantic(&client)
    .threshold(0.5)
    .split_async()
    .await?;

Chunk metadata

Every Chunk carries rich metadata:

pub struct Chunk {
    pub content: String,         // the text
    pub index: usize,            // position in sequence
    pub start_byte: usize,       // byte offset in original text
    pub end_byte: usize,         // byte offset (exclusive)
    pub token_count: usize,      // exact token count
    pub section: Option<String>, // markdown header (if applicable)
}

Overlap

Token overlap between consecutive chunks preserves context at boundaries — critical for retrieval quality:

let chunks = chunkedrs::chunk("your long text here...")
    .max_tokens(256)
    .overlap(50)
    .split();

Tokenizer selection

// auto-detect from model name
let chunks = chunkedrs::chunk(text).model("gpt-4o").split();

// or specify encoding directly
let chunks = chunkedrs::chunk(text).encoding("cl100k_base").split();

// default: o200k_base (GPT-4o, GPT-4-turbo)

Ecosystem

chunkedrs is part of airs (AI in Rust Series) — a family of crates for AI infrastructure:

Crate Description
tiktoken High-performance BPE tokenizer for all major LLMs
embedrs Unified embedding — cloud APIs + local inference through one interface
instructors Type-safe structured output extraction from LLMs
chunkedrs Token-accurate text chunking (this crate)

License

MIT

Dependencies

~9–24MB
~203K SLoC