Skip to content

tlkahn/sentenza

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentenza

A fast multilingual sentence splitter written in Rust. Provides both a library crate and a CLI binary.

Features

  • Language-aware splitting with dedicated pipelines for German, Chinese, and Sanskrit
  • 244+ language fallback via sentencex (rule-based, no ML models)
  • Text preprocessing -- normalizes whitespace, punctuation, and language-specific quirks before splitting
  • Zero empty results -- output is always trimmed and filtered

Language support

Language Code Splitter Preprocessing
German de sentencex Quote normalization (»«„" -> "), date protection (15. Januar won't split), default
Chinese zh Regex on 。?!.?! CJK symbol/bracket stripping
Sanskrit sa Regex on dandas and ?! Default
Everything else en, fr, ru, ... sentencex Default (collapse spaces, commas, dashes)

The default preprocessing normalizes double spaces, repeated commas, and double dashes (to em-dash) across all languages except Chinese, which has its own pipeline.

Installation

Requires Rust >= 1.91.

cargo install --path .

CLI usage

sentenza <LANG> [TEXT]
echo "text" | sentenza <LANG>

Text as argument:

$ sentenza en "Hello world. How are you? I am fine."
Hello world.
How are you?
I am fine.

German with date protection:

$ sentenza de "Am 15. Januar war es kalt. Es schneite."
Am 15. Januar war es kalt.
Es schneite.

Chinese:

$ sentenza zh "你好世界。今天天气怎么样?非常好!"
你好世界。
今天天气怎么样?
非常好!

Sanskrit:

$ sentenza sa "धर्मक्षेत्रे कुरुक्षेत्रे। समवेता युयुत्सवः॥"
धर्मक्षेत्रे कुरुक्षेत्रे।
समवेता युयुत्सवः॥

Pipe from stdin:

$ echo "Bonjour le monde. Comment allez-vous?" | sentenza fr
Bonjour le monde.
Comment allez-vous?

Library usage

Add to your Cargo.toml:

[dependencies]
sentenza = { path = "../sentenza" }
use sentenza::split_sentences;

let sentences = split_sentences("Hello world. How are you?", "en");
assert_eq!(sentences, vec!["Hello world.", "How are you?"]);

// German dates don't cause false splits
let de = split_sentences("Am 15. Januar war es kalt. Es schneite.", "de");
assert_eq!(de.len(), 2);

// Chinese with CJK bracket cleanup
let zh = split_sentences("「你好」世界。再见!", "zh");
assert_eq!(zh.len(), 2);

Architecture

split_sentences(text, lang)
    |
    v
preprocess(text, lang)          # language-specific text normalization
    |
    v
languages::split(text, lang)    # dispatch to splitter
    |-- "zh" -> chinese::split      (regex)
    |-- "sa" -> sanskrit::split     (regex)
    |-- "de" -> german::split       (sentencex)
    |-- _    -> fallback::split     (sentencex)
    |
    v
languages::postprocess(sentences, lang)
    |-- "de" -> restore date placeholders
    |-- _    -> passthrough
    |
    v
trim + filter empty

Project structure

src/
  lib.rs                # Public API: split_sentences()
  main.rs               # CLI binary
  preprocessing.rs      # Default, German, Chinese preprocessing
  languages/
    mod.rs              # Language dispatch
    fallback.rs         # sentencex passthrough (EN, FR, and all others)
    german.rs           # sentencex + date protection postprocess
    chinese.rs          # Regex splitter
    sanskrit.rs         # Regex splitter (dandas)

Testing

cargo test

39 unit/integration tests + 1 doc-test covering all language paths, preprocessing, edge cases (empty input, whitespace, trimming).

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages