1 unstable release

0.1.3	Mar 9, 2026

#16 in #vietnamese

Used in 4 crates

MIT/Apache

5KB
76 lines

🦀 vn-nlp

Vietnamese NLP library in pure Rust — tokenization, normalization, segmentation.
Zero-copy, no_std compatible (with alloc), zero-cost abstractions.

Quick Start

Add to your Cargo.toml:

[dependencies]
vn-nlp = "0.1"

Tokenize

use vn_nlp::tokenize;

let tokens = tokenize("Xin chào Việt Nam!").unwrap();
assert_eq!(tokens[0].text, "Xin");
assert_eq!(tokens[1].text, "chào");

Normalize

use vn_nlp::normalize;

let clean = normalize::strip_diacritics("Tiếng Việt");
assert_eq!(clean, "Tieng Viet");

Sentence Segmentation

use vn_nlp::segment;

let sentences = segment("Hôm nay trời đẹp. Tôi đi chơi.").unwrap();
assert_eq!(sentences.len(), 2);

Feature Flags

Feature	Default	Description
`tokenize`	✅	Word tokenization
`normalize`	✅	Unicode normalization & diacritics
`segment`	✅	Sentence segmentation
`dictionary`	❌	Dictionary-based word segmentation

# Chỉ dùng tokenizer
vn-nlp = { version = "0.1", default-features = false, features = ["tokenize"] }

Documentation

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

Dependencies

~120–480KB
~11K SLoC