1 unstable release
| 0.1.3 | Mar 9, 2026 |
|---|
#16 in #vietnamese
Used in 4 crates
5KB
76 lines
🦀 vn-nlp
Vietnamese NLP library in pure Rust — tokenization, normalization, segmentation.
Zero-copy,no_stdcompatible (withalloc), zero-cost abstractions.
Quick Start
Add to your Cargo.toml:
[dependencies]
vn-nlp = "0.1"
Tokenize
use vn_nlp::tokenize;
let tokens = tokenize("Xin chào Việt Nam!").unwrap();
assert_eq!(tokens[0].text, "Xin");
assert_eq!(tokens[1].text, "chào");
Normalize
use vn_nlp::normalize;
let clean = normalize::strip_diacritics("Tiếng Việt");
assert_eq!(clean, "Tieng Viet");
Sentence Segmentation
use vn_nlp::segment;
let sentences = segment("Hôm nay trời đẹp. Tôi đi chơi.").unwrap();
assert_eq!(sentences.len(), 2);
Feature Flags
| Feature | Default | Description |
|---|---|---|
tokenize |
✅ | Word tokenization |
normalize |
✅ | Unicode normalization & diacritics |
segment |
✅ | Sentence segmentation |
dictionary |
❌ | Dictionary-based word segmentation |
# Chỉ dùng tokenizer
vn-nlp = { version = "0.1", default-features = false, features = ["tokenize"] }
Documentation
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Dependencies
~120–480KB
~11K SLoC