Skip to content

yongsk0066/mce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MCE — Morphological Computation Engine

CI Performance npm License

Browser-first Finnish NLP engine -- morphological analysis, POS tagging, spell checking, grammar checking, hyphenation, compound analysis, and morphological generation, all running offline in WebAssembly.

MCE uses a mathematically grounded architecture: a Writer Comonad for morphophonological rules, Constraint Grammar for disambiguation, and a suffix-based statistical tagger -- achieving 94.66% UPOS accuracy (UD Finnish-TDT dev) in ~380KB of WASM with no server required.

Why MCE?

MCE occupies an unusual position in the Finnish NLP landscape. To our knowledge, no other system combines all five of these properties:

Browser-first, no server. MCE compiles to a ~380KB WASM module. The dictionary and model are fetched once from a CDN (~9.2MB, gzip ~2-3MB) and cached in IndexedDB. After first load, everything runs offline with zero network dependency. To our knowledge, no other Finnish NLP tool runs in the browser.

Mathematical foundation. Morphophonological rules (consonant gradation, vowel harmony, boundary effects) are expressed as coKleisli arrows over a Writer Comonad with a DeletionSet monoid. To our knowledge, this is the first production NLP system to use comonadic composition for morphophonology, ensuring that rules compose purely without mutation or sentinel characters.

94.66% UPOS in 9.2MB. MCE achieves near-neural accuracy on the UD Finnish-TDT dev split through a hybrid pipeline: 62 Constraint Grammar rules prune impossible readings, a suffix-based statistical tagger provides emission scores, and Viterbi decoding selects the optimal POS sequence. Neural systems (TurkuNLP, Trankit) score 2-4pp higher but require 100-1000x more storage and a GPU server.

Complete writer-tools stack. MCE is not just a POS tagger. A single 9.2MB deployment provides spell checking, spelling suggestions, grammar checking (21 rules), hyphenation, compound word analysis, and morphological generation (11 noun cases, 4 verb conjugation types). To our knowledge, no other tool offers all of these in one package that runs client-side.

Offline-first, privacy-preserving. No text ever leaves the user's device. All computation happens locally in the browser or native runtime. This makes MCE suitable for sensitive documents, air-gapped environments, and privacy-conscious applications.

Features

  • Morphological analysis with full inflection details and POS disambiguation
  • POS tagging at 94.66% UPOS accuracy on UD Finnish-TDT dev (CG + Suffix Tagger)
  • Spell checking with compound word and derivation support
  • Spelling suggestions with context-aware ranking
  • Grammar checking with 21 rule-based checks
  • Hyphenation with compound-aware syllable splitting
  • Compound word analysis with 6 linking morpheme types
  • Morphological generation for nouns (22 forms: 11 singular + 11 plural) and verbs (4 conjugation types, beta -- irregular verbs may produce incorrect forms)
  • Sentence-level disambiguation via Viterbi + Constraint Grammar + Suffix Tagger
  • Writer Comonad pipeline for morphophonological rules (consonant gradation, vowel harmony)

Use Cases

Browser-Based Writing Tools

  • In-browser Finnish spell checker with no server -- data never leaves the device
  • Grammar checker extension for web editors (Google Docs, Notion, CMS platforms)
  • Real-time hyphenation engine for responsive Finnish typography
  • Offline-capable PWA for Finnish writers working without internet

Education & Language Learning

  • Interactive morphology explorer showing all 22 noun forms (11 cases x singular/plural) for any Finnish word
  • Verb conjugation trainer that generates paradigms and quizzes learners
  • Reading assistant: hover over any word to see its lemma, POS tag, and case
  • L2 Finnish error correction with explanations of grammar mistakes

Accessibility & Privacy

  • Fully offline NLP for air-gapped environments (military, healthcare, legal)
  • GDPR-compliant by design -- no cloud dependency, text never leaves the browser
  • TTS preprocessing: correct morphological analysis improves text-to-speech output
  • 9.2MB total footprint fits on IoT, edge, and mobile devices

Research & NLP Pipelines

  • Morphological annotation of Finnish corpora at 94.66% UPOS accuracy (UD Finnish-TDT dev)
  • Lemmatization for information retrieval and search indexing
  • Compound word decomposition for machine translation preprocessing
  • Writer Comonad as a research tool for studying morphophonological theory

Developer Integration

  • npm package (@yongsk0066/mce) -- drop-in for any JS/TS project
  • Rust crate -- native speed for server-side batch processing
  • CLI with 11 subcommands for scripting and automation
  • WASM API with 22 methods covering the full NLP pipeline

What Only MCE Can Do

  • Run a complete Finnish NLP stack in a browser tab with no internet connection
  • Apply category theory (comonads) to morphophonological rule composition
  • Generate full noun paradigms (22 forms: 11 singular + 11 plural) and verb conjugations client-side
  • Analyze Finnish compounds of arbitrary depth (lentokonesuihkuturbiinimoottori)
  • Deliver spell check + grammar check + hyphenation + POS tagging in under 9.2MB

Quick Start

npm (Browser / Node.js)

npm install @yongsk0066/mce
import init, { MceEngine } from '@yongsk0066/mce';

await init();
const dictBytes = await fetch('mor.vfst').then(r => r.arrayBuffer());
const engine = MceEngine.load(new Uint8Array(dictBytes));

// Morphological analysis
engine.analyze('koirien');         // JSON: [{ BASEFORM: 'koira', CLASS: 'nimisana', ... }]

// Spell checking
engine.spell_check('koira');       // true
engine.suggest('koirra', 1);       // ['koira', ...]

// Sentence-level analysis with POS disambiguation
engine.analyze_sentence('Koira juoksee nopeasti.');

// Grammar checking
engine.grammar_check('Koira koira juoksee pihalla.');

// Hyphenation
engine.hyphenate('suomalainen');   // 'suo-ma-lai-nen'

// Compound word splitting
engine.compound_split('rautatieasema');

// Morphological generation
engine.generate_form('koira', 'genitive', 'singular');
engine.generate_verb_form('juosta', 'present', '3sg', 'affirmative');

// Load suffix tagger model for higher accuracy (94.66% UPOS, UD Finnish-TDT dev)
const modelBytes = await fetch('suffix_tagger.bin').then(r => r.arrayBuffer());
engine.load_model(new Uint8Array(modelBytes));

// Optional: load wordlist for spelling suggestions
const wordlistBytes = await fetch('wordlist.txt').then(r => r.arrayBuffer());
engine.load_wordlist(new Uint8Array(wordlistBytes));

engine.free();

Rust

cd crates
cargo test --all-features     # 1,619 Rust tests (1,994 total with 375 JS integration tests)
cargo clippy --all-features -- -D warnings

CLI

Eleven subcommands for interactive use:

export MCE_DICT_PATH=/path/to/dictionary
cargo run -p mce-cli -- analyze koira
cargo run -p mce-cli -- spell koirra
cargo run -p mce-cli -- compound rautatieasema
cargo run -p mce-cli -- sentence "Koira juoksee nopeasti."
cargo run -p mce-cli -- grammar "Koira koira juoksee pihalla."
cargo run -p mce-cli -- hyphenate suomalainen
cargo run -p mce-cli -- hyphenate-text "Koira juoksee pihalla nopeasti."
cargo run -p mce-cli -- info
cargo run -p mce-cli -- eval --conllu fi_tdt-ud-dev.conllu
cargo run -p mce-cli -- benchmark --iterations 500
cargo run -p mce-cli -- benchmark --rules

How It Fits Together

flowchart LR
    dict[VFST Dictionary] --> fst[mce-fst]
    fst --> core[mce-core]
    core --> comonad[mce-comonad]
    core --> tokenizer[mce-tokenizer]
    core --> speller[mce-speller]
    fst --> speller
    comonad --> fi[mce-fi]
    fst --> fi
    speller --> fi
    core --> disambig[mce-disambig]
    fi --> grammar[mce-grammar]
    tokenizer --> grammar
    disambig --> grammar
    fi --> wasm[mce-wasm]
    fi --> cli[mce-cli]
    fi --> eval[mce-eval]
    disambig --> eval
    grammar --> wasm
    grammar --> cli
    wasm --> js[JS/TS npm]
Loading

The Rust workspace contains 11 crates:

Crate Role
mce-core Shared types, character classification, LOUDS succinct trie (M1)
mce-fst FST engine with format abstraction and VFST traversal
mce-tokenizer Text tokenizer (words, sentences, URLs, emails)
mce-speller Spell checking and suggestion engine
mce-comonad Writer Comonad morphophonological engine (M2') + CG rules
mce-disambig Disambiguation: Viterbi + CG-lite + Suffix Tagger (M4')
mce-fi Finnish language module (analysis, generation, compounds, hyphenation)
mce-grammar Grammar checking (21 rules)
mce-eval UPOS/Lemma evaluation against UD treebanks
mce-wasm WebAssembly bindings (22 API methods)
mce-cli Command-line tools (11 subcommands)

Performance

Metric Value (UD Finnish-TDT dev)
UPOS accuracy (CG + Suffix Tagger) 94.66%
UPOS accuracy (rule-only) 83.92%
Lemma accuracy 93.09%
Coverage 99.35%
Speed 88,285 tokens/sec (~0.8ms per sentence)
WASM binary ~380KB
Total deploy size ~9.2MB (WASM + dictionary + model)
Deploy size (gzip) ~2-3MB
CG rules 62 active (85 total)
Grammar rules 21
Tests 1,994 passed (1,619 Rust + 375 JS integration)
Lines of code ~45,600 Rust

Test split, for reference: UPOS 94.58%, Lemma 88.44%.

Comparison with Other Finnish NLP Tools

MCE Omorfi TurkuNLP (TNPP) Trankit
UPOS 94.66% 83.88% 96.91% 98.48%
Environment Browser (WASM) CLI / HFST GPU server GPU server
Offline Yes (fully) Yes No No
Deploy size 9.2MB ~50MB+ ~1GB+ ~1GB+
Latency 1.35ms/sent ~10ms ~100ms+ ~100ms+
Writer tools Yes No No No
Maintained Yes Yes Deprecated Yes

MCE trades ~2-4pp of UPOS accuracy for a deployment that is orders of magnitude smaller and runs entirely in the browser with no network dependency.

UPOS and lemma accuracy measured on the UD Finnish-TDT development split with gold tokenization and PUNCT/SYM excluded (CoNLL standard). TNPP and Omorfi figures are reproduced from Pirinen (2019) on an earlier UD Finnish-TDT version; comparison is approximate. End-to-end accuracy with MCE's own tokenizer may differ slightly.

Architecture

MCE v3 combines four computational machines:

Machine Role Mathematical Basis
M1: Succinct Trie Dictionary lookup / spell checking LOUDS encoding
M2': Comonadic Engine Morphological analysis + morphophonological rules Writer Comonad (coKleisli composition)
M3: PDT Compound word structure analysis Pushdown Transducer
M4': Weighted Lattice POS disambiguation Viterbi + CG-lite + Suffix Tagger

The Writer Comonad (M2') expresses all Finnish morphophonological rules -- consonant gradation (11 patterns), vowel harmony, and boundary effects -- as pure coKleisli arrows that compose without mutation or sentinel characters.

License

The MCE engine code (all Rust crates) is licensed under Apache-2.0.

The Finnish morphological dictionary (mor.vfst) is loaded at runtime and is distributed under GPL-2.0-or-later as part of the Voikko project. See THIRD_PARTY_NOTICES.md for complete attribution and compliance details.

Citation

If you use MCE in your research, please cite:

@inproceedings{jang2026comonadic,
  title     = {Comonadic Morphophonology: A Compositional Framework
               for Context-Dependent Morphological Rules in {F}innish},
  author    = {Jang, Yongseok},
  booktitle = {Proceedings of the Society for Computation in Linguistics (SCiL)},
  year      = {2026}
}

Reproducing the SCiL 2026 Results

The accuracy and throughput figures in the paper are produced by the just eval / just eval-test / just bench-throughput recipes against the UD Finnish-TDT treebank and the dictionary/model assets shipped in data/. To reproduce them on a fresh clone:

git clone https://github.com/yongsk0066/mce.git
cd mce
git submodule update --init vendor/ud-finnish-tdt    # CC-BY-SA 4.0
export MCE_DICT_PATH="$(pwd)/data"
just eval          # dev split   -- expect UPOS 94.66%, Lemma 93.09%
just eval-test     # test split  -- expect UPOS 94.58%
just bench-throughput   # expect ~88k tok/s on a recent laptop CPU
just wasm-size     # WASM binary -- expect ~380 KB (budget 420 KB)

The scil-2026-camera-ready-results git tag marks the exact commit whose measurements are reported in the paper. Bundled assets (data/mor.vfst, data/suffix_tagger.bin, data/lemma_dict.tsv) and the pinned vendor/ud-finnish-tdt submodule revision together make the pipeline deterministic from a clean clone.

Contributing

Contributions are welcome! See CONTRIBUTING.md for development setup and guidelines.

Credits

MCE is built by Yongseok Jang as the analytical core for corevoikko, a Rust+WASM rewrite of Voikko. The Finnish dictionary data originates from the Voikko project contributors.

Development was assisted by Anthropic's Claude for English-language editing and portions of the Rust implementation. All design decisions, mathematical correctness, and final code review are the author's responsibility.

Documentation

  • CLAUDE.md -- project context and architecture details
  • ARCHITECTURE.md -- 4-machine architecture and crate dependency graph

Links

About

Browser-first Finnish NLP engine — morphological analysis, POS tagging (95.56%), spell checking, grammar checking, hyphenation. 225KB WASM, offline-capable, no server required.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors