#nlp #tokenize #vietnamese #linguistics

vn-nlp-core

Core types, traits, and errors for vn-nlp

1 unstable release

0.1.3 Mar 9, 2026

#16 in #vietnamese


Used in 4 crates

MIT/Apache

5KB
76 lines

🦀 vn-nlp

CI crates.io docs.rs License: MIT OR Apache-2.0

Vietnamese NLP library in pure Rust — tokenization, normalization, segmentation.
Zero-copy, no_std compatible (with alloc), zero-cost abstractions.

Quick Start

Add to your Cargo.toml:

[dependencies]
vn-nlp = "0.1"

Tokenize

use vn_nlp::tokenize;

let tokens = tokenize("Xin chào Việt Nam!").unwrap();
assert_eq!(tokens[0].text, "Xin");
assert_eq!(tokens[1].text, "chào");

Normalize

use vn_nlp::normalize;

let clean = normalize::strip_diacritics("Tiếng Việt");
assert_eq!(clean, "Tieng Viet");

Sentence Segmentation

use vn_nlp::segment;

let sentences = segment("Hôm nay trời đẹp. Tôi đi chơi.").unwrap();
assert_eq!(sentences.len(), 2);

Feature Flags

Feature Default Description
tokenize Word tokenization
normalize Unicode normalization & diacritics
segment Sentence segmentation
dictionary Dictionary-based word segmentation
# Chỉ dùng tokenizer
vn-nlp = { version = "0.1", default-features = false, features = ["tokenize"] }

Documentation

License

Licensed under either of:

at your option.

Dependencies

~120–480KB
~11K SLoC