A fast multilingual sentence splitter written in Rust. Provides both a library crate and a CLI binary.
- Language-aware splitting with dedicated pipelines for German, Chinese, and Sanskrit
- 244+ language fallback via sentencex (rule-based, no ML models)
- Text preprocessing -- normalizes whitespace, punctuation, and language-specific quirks before splitting
- Zero empty results -- output is always trimmed and filtered
| Language | Code | Splitter | Preprocessing |
|---|---|---|---|
| German | de |
sentencex | Quote normalization (»«„" -> "), date protection (15. Januar won't split), default |
| Chinese | zh |
Regex on 。?!.?! |
CJK symbol/bracket stripping |
| Sanskrit | sa |
Regex on dandas । ॥ and ?! |
Default |
| Everything else | en, fr, ru, ... |
sentencex | Default (collapse spaces, commas, dashes) |
The default preprocessing normalizes double spaces, repeated commas, and double dashes (to em-dash) across all languages except Chinese, which has its own pipeline.
Requires Rust >= 1.91.
cargo install --path .sentenza <LANG> [TEXT]
echo "text" | sentenza <LANG>
Text as argument:
$ sentenza en "Hello world. How are you? I am fine."
Hello world.
How are you?
I am fine.German with date protection:
$ sentenza de "Am 15. Januar war es kalt. Es schneite."
Am 15. Januar war es kalt.
Es schneite.Chinese:
$ sentenza zh "你好世界。今天天气怎么样?非常好!"
你好世界。
今天天气怎么样?
非常好!Sanskrit:
$ sentenza sa "धर्मक्षेत्रे कुरुक्षेत्रे। समवेता युयुत्सवः॥"
धर्मक्षेत्रे कुरुक्षेत्रे।
समवेता युयुत्सवः॥Pipe from stdin:
$ echo "Bonjour le monde. Comment allez-vous?" | sentenza fr
Bonjour le monde.
Comment allez-vous?Add to your Cargo.toml:
[dependencies]
sentenza = { path = "../sentenza" }use sentenza::split_sentences;
let sentences = split_sentences("Hello world. How are you?", "en");
assert_eq!(sentences, vec!["Hello world.", "How are you?"]);
// German dates don't cause false splits
let de = split_sentences("Am 15. Januar war es kalt. Es schneite.", "de");
assert_eq!(de.len(), 2);
// Chinese with CJK bracket cleanup
let zh = split_sentences("「你好」世界。再见!", "zh");
assert_eq!(zh.len(), 2);split_sentences(text, lang)
|
v
preprocess(text, lang) # language-specific text normalization
|
v
languages::split(text, lang) # dispatch to splitter
|-- "zh" -> chinese::split (regex)
|-- "sa" -> sanskrit::split (regex)
|-- "de" -> german::split (sentencex)
|-- _ -> fallback::split (sentencex)
|
v
languages::postprocess(sentences, lang)
|-- "de" -> restore date placeholders
|-- _ -> passthrough
|
v
trim + filter empty
src/
lib.rs # Public API: split_sentences()
main.rs # CLI binary
preprocessing.rs # Default, German, Chinese preprocessing
languages/
mod.rs # Language dispatch
fallback.rs # sentencex passthrough (EN, FR, and all others)
german.rs # sentencex + date protection postprocess
chinese.rs # Regex splitter
sanskrit.rs # Regex splitter (dandas)
cargo test39 unit/integration tests + 1 doc-test covering all language paths, preprocessing, edge cases (empty input, whitespace, trimming).
MIT