4 stable releases
Uses new Rust 2024
| 1.0.3 | Apr 23, 2026 |
|---|---|
| 1.0.1 | Mar 8, 2026 |
#393 in Text processing
66KB
1.5K
SLoC
chunkedrs
Token-accurate text chunking for RAG pipelines — recursive, markdown-aware, and semantic splitting. Built on tiktoken, the fastest pure-Rust BPE tokenizer.
Highlights
- Token-accurate — every chunk is guaranteed within your token budget, not character-approximate
- 3 strategies — recursive (fast, general), markdown-aware (preserves headers), semantic (embedding-based breakpoints)
- Rich metadata — byte offsets, token counts, and section headers on every chunk
- Overlap — configurable token overlap between chunks for retrieval context preservation
- Any tokenizer — auto-detect from model name (
gpt-4o,claude,llama) or specify encoding directly - Built on tiktoken — the fastest pure-Rust BPE tokenizer with 9 encodings across all major LLMs
Why chunkedrs?
RAG pipelines need text split into chunks that fit model context windows. Naive splitting (by character count or fixed size) breaks mid-word, mid-sentence, or mid-paragraph — destroying meaning and hurting retrieval quality.
chunkedrs splits at semantic boundaries (paragraphs, sentences, words) while enforcing exact token limits. No chunk ever exceeds max_tokens.
| Feature | chunkedrs | text-splitter | Manual |
|---|---|---|---|
| Token-accurate limits | Yes (tiktoken) | Character-based | No |
| Recursive splitting | Yes | Yes | DIY |
| Markdown-aware | Yes (section metadata) | No | DIY |
| Semantic splitting | Yes (via embedrs) | No | DIY |
| Byte offsets | Yes | No | DIY |
| Token count per chunk | Yes | No | DIY |
| Overlap support | Token-level | Character-level | DIY |
| Tokenizer selection | Model name or encoding | N/A | N/A |
Strategies
| Strategy | Use case | Speed |
|---|---|---|
| Recursive (default) | General text — paragraphs, sentences, words | Fastest |
| Markdown | Documents with # headers — preserves section metadata |
Fast |
| Semantic | High-quality RAG — splits at meaning boundaries via embeddings | Slower (API calls) |
Quick start
Add to your Cargo.toml:
[dependencies]
chunkedrs = "1"
Split text with defaults (recursive, 512 max tokens, no overlap):
use chunkedrs::Chunk;
let chunks: Vec<Chunk> = chunkedrs::chunk("your long text here...").split();
for chunk in &chunks {
println!("[{}] {} tokens (bytes {}..{})", chunk.index, chunk.token_count, chunk.start_byte, chunk.end_byte);
}
// Output:
// [0] 4 tokens (bytes 0..21)
Token-accurate splitting
let chunks = chunkedrs::chunk("your long text here...")
.max_tokens(256)
.overlap(50)
.model("gpt-4o")
.split();
// every chunk is guaranteed to have <= 256 tokens
assert!(chunks.iter().all(|c| c.token_count <= 256));
Markdown-aware splitting
let markdown = "# Intro\n\nSome text.\n\n## Details\n\nMore text here.\n";
let chunks = chunkedrs::chunk(markdown).markdown().split();
// each chunk knows which section it belongs to
assert_eq!(chunks[0].section.as_deref(), Some("# Intro"));
Semantic splitting
With the semantic feature enabled, split at meaning boundaries using embeddings:
[dependencies]
chunkedrs = { version = "1", features = ["semantic"] }
let client = embedrs::openai("sk-...");
let chunks = chunkedrs::chunk("your long text here...")
.semantic(&client)
.threshold(0.5)
.split_async()
.await?;
Chunk metadata
Every Chunk carries rich metadata:
pub struct Chunk {
pub content: String, // the text
pub index: usize, // position in sequence
pub start_byte: usize, // byte offset in original text
pub end_byte: usize, // byte offset (exclusive)
pub token_count: usize, // exact token count
pub section: Option<String>, // markdown header (if applicable)
}
Overlap
Token overlap between consecutive chunks preserves context at boundaries — critical for retrieval quality:
let chunks = chunkedrs::chunk("your long text here...")
.max_tokens(256)
.overlap(50)
.split();
Tokenizer selection
// auto-detect from model name
let chunks = chunkedrs::chunk(text).model("gpt-4o").split();
// or specify encoding directly
let chunks = chunkedrs::chunk(text).encoding("cl100k_base").split();
// default: o200k_base (GPT-4o, GPT-4-turbo)
Ecosystem
chunkedrs is part of airs (AI in Rust Series) — a family of crates for AI infrastructure:
| Crate | Description |
|---|---|
| tiktoken | High-performance BPE tokenizer for all major LLMs |
| embedrs | Unified embedding — cloud APIs + local inference through one interface |
| instructors | Type-safe structured output extraction from LLMs |
| chunkedrs | Token-accurate text chunking (this crate) |
License
MIT
Dependencies
~9–24MB
~203K SLoC