Skip to content

ye-kyaw-thu/oppaWord

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

82 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

oppaWord Logo

oppaWord: Super Fast Myanmar Word Segmenter

แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บแ€แ€ญแ€ฏแ€ท แ€…แ€ฌแ€›แ€ฑแ€ธแ€แ€ฒแ€ทแ€กแ€แ€ซแ€™แ€พแ€ฌ แ€กแ€„แ€บแ€นแ€‚แ€œแ€ญแ€•แ€บแ€…แ€ฌแ€œแ€ญแ€ฏแ€™แ€ปแ€ญแ€ฏแ€ธ แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ แ€–แ€ผแ€แ€บแ€•แ€ผแ€ฎแ€ธแ€›แ€ฑแ€ธแ€œแ€ฑแ€ทแ€™แ€›แ€พแ€ญแ€แ€ฌแ€€แ€ผแ€ฑแ€ฌแ€„แ€บแ€ทแŠ NLP/AI แ€กแ€œแ€ฏแ€•แ€บแ€แ€ฝแ€ฑแ€กแ€แ€ฝแ€€แ€บ แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌแ€€แ€ผแ€ฑแ€ฌแ€„แ€บแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€–แ€ผแ€แ€บแ€›แ€แ€ฒแ€ทแ€กแ€œแ€ฏแ€•แ€บแ€€ แ€แ€€แ€šแ€บแ€แ€€แ€บแ€แ€ฒแ€•แ€ซแ€แ€šแ€บแ‹ แ€กแ€ฒแ€’แ€ซแ€€แ€ผแ€ฑแ€ฌแ€„แ€บแ€ท แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บแ€ฅแ€ฎแ€ธแ€†แ€ฑแ€ฌแ€„แ€บแ€”แ€ฑแ€แ€ฒแ€ท Language Understanding Lab. แ€™แ€พแ€ฌ Word Segmentation แ€žแ€ฏแ€แ€ฑแ€žแ€”แ€กแ€œแ€ฏแ€•แ€บแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ แ€กแ€แ€ปแ€ญแ€”แ€บแ€›แ€›แ€„แ€บแ€›แ€žแ€œแ€ญแ€ฏ แ€†แ€€แ€บแ€แ€ญแ€ฏแ€€แ€บแ€œแ€ฏแ€•แ€บแ€–แ€ผแ€…แ€บแ€”แ€ฑแ€•แ€ซแ€แ€šแ€บแ‹ แ€แ€แ€ปแ€ญแ€ฏแ€ทแ€กแ€œแ€ฏแ€•แ€บแ€แ€ฝแ€ฑแ€กแ€แ€ฝแ€€แ€บ sylbreak แ€”แ€ฒแ€ท แ€แ€แ€นแ€แ€–แ€ผแ€แ€บแ€•แ€ผแ€ฎแ€ธ แ€œแ€ฏแ€•แ€บแ€แ€šแ€บแŠ แ€แ€แ€ปแ€ญแ€ฏแ€ทแ€กแ€œแ€ฏแ€•แ€บแ€แ€ฝแ€ฑแ€กแ€แ€ฝแ€€แ€บ myWord แ€”แ€ฒแ€ท แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€–แ€ผแ€แ€บแ€แ€šแ€บแ‹ แ€’แ€ซแ€ทแ€กแ€•แ€ผแ€„แ€บ แ€„แ€ซแ€ธแ€•แ€ญแ€œแ€ญแ€ฏแ€ท แ€”แ€ฌแ€™แ€Šแ€บแ€•แ€ฑแ€ธแ€‘แ€ฌแ€ธแ€แ€ฒแ€ท semantic chunking แ€œแ€ญแ€ฏแ€™แ€ปแ€ญแ€ฏแ€ธ แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ แ€—แ€€แ€บแ€แ€ฌแ€•แ€ผแ€ฑแ€ฌแ€„แ€บแ€ธแ€•แ€ผแ€ฎแ€ธ แ€กแ€“แ€ญแ€•แ€นแ€•แ€ซแ€šแ€บแ€•แ€ซ แ€†แ€ฝแ€ฒแ€šแ€ฐแ€–แ€ญแ€ฏแ€ท แ€–แ€ผแ€…แ€บแ€”แ€ญแ€ฏแ€„แ€บแ€แ€ฒแ€ท แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€–แ€ผแ€แ€บแ€”แ€Šแ€บแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏแ€œแ€Šแ€บแ€ธ แ€œแ€ฑแ€ทแ€œแ€ฌแ€…แ€™แ€บแ€ธแ€žแ€•แ€บแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ทแ€•แ€ซแ€แ€šแ€บแ‹ แ€’แ€ฎแ€›แ€€แ€บแ€•แ€ญแ€ฏแ€„แ€บแ€ธแ€แ€ฝแ€ฑแ€กแ€แ€ฝแ€„แ€บแ€ธแ€™แ€พแ€ฌ Lab แ€›แ€ฒแ€ท internship แ€€แ€ปแ€ฑแ€ฌแ€„แ€บแ€ธแ€žแ€ฌแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ word segmentation แ€”แ€ฒแ€ทแ€•แ€แ€บแ€žแ€€แ€บแ€แ€ฌแ€œแ€Šแ€บแ€ธ แ€…แ€ฌแ€žแ€„แ€บแ€–แ€ญแ€ฏแ€ทแ€•แ€ผแ€„แ€บแ€›แ€„แ€บแ€ธแ€”แ€ฒแ€ท แ€Ÿแ€ญแ€ฏแ€ธแ€กแ€›แ€„แ€บแ€€ แ€…แ€™แ€บแ€ธแ€แ€ฒแ€ทแ€–แ€ฐแ€ธแ€แ€ฒแ€ท DAG (directed acyclic graph) แ€˜แ€€แ€บแ€€แ€ญแ€ฏ แ€•แ€ผแ€”แ€บแ€œแ€พแ€Šแ€บแ€ทแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ทแ€•แ€ซแ€แ€šแ€บแ‹ แ€กแ€ญแ€ฏแ€€แ€บแ€’แ€ฎแ€›แ€ฌแ€กแ€”แ€ฑแ€”แ€ฒแ€ทแ€€แ€แ€ฑแ€ฌแ€ท ARPA n-gram language model แ€›แ€šแ€บ syllable frequency count แ€แ€ฝแ€ฑแ€›แ€šแ€บแ€”แ€ฒแ€ท แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌ แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏ แ€–แ€ผแ€แ€บแ€แ€ฒแ€ท แ€”แ€Šแ€บแ€ธแ€œแ€™แ€บแ€ธแ€•แ€ซแ‹ แ€žแ€ฐแ€€ OOV แ€€แ€ญแ€ฏแ€œแ€Šแ€บแ€ธ backoff แ€”แ€ฒแ€ท แ€›แ€พแ€ฑแ€ฌแ€„แ€บแ€œแ€ญแ€ฏแ€ท แ€›แ€แ€ฌแ€™แ€ญแ€ฏแ€ท แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บแ€€ แ€…แ€ญแ€แ€บแ€แ€„แ€บแ€…แ€ฌแ€ธแ€แ€šแ€บแ‹ แ€œแ€€แ€บแ€แ€ฝแ€ฑแ€ทแ€™แ€พแ€ฌแ€€ แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌแ€กแ€แ€ฝแ€€แ€บ แ€แ€€แ€šแ€บแ€€แ€ฑแ€ฌแ€„แ€บแ€ธแ€แ€ฒแ€ท language model แ€†แ€ฑแ€ฌแ€€แ€บแ€–แ€ญแ€ฏแ€ทแ€†แ€ญแ€ฏแ€แ€ฌแ€€ แ€™แ€œแ€ฝแ€šแ€บแ€€แ€ฐแ€•แ€ซแ€˜แ€ฐแ€ธแ‹ แ€กแ€™แ€ปแ€ญแ€ฏแ€ธแ€™แ€ปแ€ญแ€ฏแ€ธ แ€€แ€ผแ€ญแ€ฏแ€ธแ€…แ€ฌแ€ธแ€€แ€ผแ€Šแ€บแ€ทแ€œแ€Šแ€บแ€ธ แ€›แ€œแ€’แ€บแ€€ แ€‘แ€„แ€บแ€žแ€œแ€ฑแ€ฌแ€€แ€บ แ€แ€€แ€บแ€™แ€œแ€ฌแ€•แ€ซแ€˜แ€ฐแ€ธแ‹ แ€กแ€ฒแ€’แ€ซแ€”แ€ฒแ€ท แ€”แ€ฑแ€ฌแ€€แ€บแ€†แ€ฏแ€ถแ€ธแ€แ€ฑแ€ฌแ€ท แ€กแ€˜แ€ญแ€“แ€ฌแ€”แ€บ แ€”แ€ฒแ€ท bidirectional maximum matching แ€€แ€ญแ€ฏแ€•แ€ซ แ€แ€ฝแ€ฒแ€œแ€ญแ€ฏแ€€แ€บแ€•แ€ผแ€ฎแ€ธ score แ€œแ€ฏแ€•แ€บแ€แ€ฌ fallback แ€œแ€ฏแ€•แ€บแ€แ€ฌแ€แ€ฝแ€ฑแ€”แ€ฒแ€ท แ€แ€ฝแ€ฒแ€–แ€ผแ€…แ€บแ€žแ€ฝแ€ฌแ€ธแ€•แ€ซแ€แ€šแ€บแ‹ แ€›แ€œแ€’แ€บแ€€ --bimm-boost แ€”แ€ฒแ€ท tuning แ€œแ€ฏแ€•แ€บแ€›แ€„แ€บแ€ธ F1-score 90+ แ€‘แ€ญ แ€แ€€แ€บแ€กแ€ฑแ€ฌแ€„แ€บ แ€œแ€ฏแ€•แ€บแ€œแ€ญแ€ฏแ€ทแ€›แ€แ€ฌแ€€แ€ญแ€ฏ แ€›แ€พแ€ฌแ€–แ€ฝแ€ฑแ€แ€ฝแ€ฑแ€ทแ€›แ€พแ€ญแ€แ€ฒแ€ทแ€แ€šแ€บแ‹ แ€กแ€ฒแ€’แ€ซแ€”แ€ฒแ€ทแ€•แ€ฒ แ€€แ€ปแ€ฑแ€ฌแ€„แ€บแ€ธแ€žแ€ฌแ€ธแŠ แ€€แ€ปแ€ฑแ€ฌแ€„แ€บแ€ธแ€žแ€ฐแ€แ€ฝแ€ฑแ€€แ€ญแ€ฏแ€œแ€Šแ€บแ€ธ Hybrid DAG + Bi-MM + LM architecture แ€€แ€ญแ€ฏแ€กแ€แ€ผแ€ฑแ€แ€ถแ€แ€ฒแ€ท word segmentation แ€€แ€ญแ€ฏ แ€™แ€ญแ€แ€บแ€†แ€€แ€บแ€•แ€ฑแ€ธแ€›แ€„แ€บแ€ธ แ€’แ€ฎ oppaWord แ€€แ€ญแ€ฏ แ€กแ€™แ€ปแ€ฌแ€ธแ€žแ€ฏแ€ถแ€ธแ€œแ€ญแ€ฏแ€ทแ€›แ€–แ€ญแ€ฏแ€ทแ€กแ€‘แ€ญ coding แ€†แ€€แ€บแ€œแ€ฏแ€•แ€บแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ทแ€แ€šแ€บแ‹

oppaWord แ€†แ€ญแ€ฏแ€แ€ฒแ€ท แ€”แ€ฌแ€™แ€Šแ€บแ€œแ€ฌแ€ธแ‹ แ€กแ€…แ€€ ARPA LM แ€€แ€ญแ€ฏ แ€žแ€ฏแ€ถแ€ธแ€‘แ€ฌแ€ธแ€แ€ฌแ€™แ€ญแ€ฏแ€ทแ€œแ€ญแ€ฏแ€ท arpaWord แ€†แ€ญแ€ฏแ€•แ€ผแ€ฎแ€ธ แ€”แ€ฌแ€™แ€Šแ€บแ€•แ€ฑแ€ธแ€–แ€ญแ€ฏแ€ท แ€…แ€‰แ€บแ€ธแ€…แ€ฌแ€ธแ€แ€ฒแ€ทแ€แ€ฌแ‹ แ€แ€€แ€šแ€บแ€แ€™แ€บแ€ธแ€€ hybrid_dag_bimm_lm.py แ€†แ€ญแ€ฏแ€•แ€ผแ€ฎแ€ธ แ€•แ€ฑแ€ธแ€›แ€„แ€บแ€แ€ฑแ€ฌแ€ท แ€žแ€ฏแ€ถแ€ธแ€‘แ€ฌแ€ธแ€แ€ฒแ€ท approach แ€กแ€ฌแ€ธแ€œแ€ฏแ€ถแ€ธแ€œแ€ญแ€ฏแ€œแ€ญแ€ฏแ€€แ€ญแ€ฏ แ€แ€ผแ€ฏแ€ถแ€„แ€ฏแ€ถแ€™แ€ญแ€•แ€ซแ€แ€šแ€บแ‹ แ€แ€€แ€บแ€แ€ฌแ€€ แ€กแ€™แ€ปแ€ฌแ€ธแ€กแ€แ€ฝแ€€แ€บ แ€™แ€พแ€แ€บแ€™แ€ญแ€–แ€ญแ€ฏแ€ท แ€แ€€แ€บแ€•แ€ซแ€œแ€ญแ€™แ€บแ€ทแ€™แ€šแ€บแ‹ แ€œแ€€แ€บแ€›แ€พแ€ญ แ€€แ€ญแ€ฏแ€›แ€ฎแ€ธแ€šแ€ฌแ€ธ แ€…แ€€แ€ฌแ€ธแ€œแ€ฏแ€ถแ€ธ oppa (์˜ค๋น ) แ€†แ€ญแ€ฏแ€›แ€„แ€บแ€แ€ฑแ€ฌแ€ท แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€œแ€ฐแ€„แ€šแ€บแ€แ€ญแ€ฏแ€„แ€บแ€ธแ€œแ€ญแ€ฏแ€œแ€ญแ€ฏ แ€›แ€„แ€บแ€ธแ€”แ€พแ€ฎแ€ธแ€•แ€ผแ€ฎแ€ธแ€žแ€ฌแ€ธ แ€–แ€ผแ€…แ€บแ€•แ€ซแ€œแ€ญแ€™แ€บแ€ทแ€™แ€šแ€บแ‹ แ€•แ€ผแ€ฎแ€ธแ€แ€ฑแ€ฌแ€ท แแ€แ€แ€”แ€บแ€ธแ€กแ€•แ€ผแ€ฎแ€ธแ€™แ€พแ€ฌ แ€”แ€พแ€…แ€บแ€แ€ฑแ€ฌแ€บแ€แ€ฑแ€ฌแ€บแ€€แ€ผแ€ฌแ€€แ€ผแ€ฌ แ€แ€ญแ€ฏแ€€แ€บแ€€แ€ฝแ€™แ€บแ€’แ€ญแ€ฏแ€กแ€ฌแ€ธแ€€แ€…แ€ฌแ€ธแ€€แ€ญแ€ฏ แ€œแ€ฏแ€•แ€บแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ทแ€แ€ฒแ€ท แ€„แ€šแ€บแ€˜แ€ แ€กแ€™แ€พแ€แ€บแ€แ€›แ€แ€ฝแ€ฑแ€›แ€šแ€บแŠ แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บแ€›แ€ฒแ€ท แ€แ€ญแ€ฏแ€€แ€บแ€€แ€ฝแ€™แ€บแ€’แ€ญแ€ฏแ€†แ€›แ€ฌแ€แ€ฝแ€ฑแ€กแ€™แ€ปแ€ฌแ€ธแ€€แ€ผแ€ฎแ€ธแ€‘แ€ฒแ€€แ€แ€…แ€บแ€ฅแ€ฎแ€ธแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ท แ€€แ€™แ€นแ€˜แ€ฌแ€ทแ€แ€ปแ€”แ€บแ€•แ€ฎแ€šแ€ถ แ€†แ€›แ€ฌแ€‚แ€ปแ€ฝแ€”แ€บ แ€€แ€ญแ€ฏแ€œแ€Šแ€บแ€ธ แ€žแ€แ€ญแ€›แ€แ€ฌแ€”แ€ฒแ€ท oppaWord แ€œแ€ญแ€ฏแ€ทแ€•แ€ฒ แ€”แ€ฌแ€™แ€Šแ€บแ€•แ€ฑแ€ธแ€–แ€ผแ€…แ€บแ€แ€ฒแ€ทแ€•แ€ซแ€แ€šแ€บแ‹

แ€œแ€€แ€บแ€›แ€พแ€ญแ€กแ€แ€ปแ€ญแ€”แ€บแ€‘แ€ญ แ€กแ€™แ€ปแ€ญแ€ฏแ€ธแ€™แ€ปแ€ญแ€ฏแ€ธ experiment แ€แ€ฝแ€ฑ แ€œแ€ฏแ€•แ€บแ€€แ€ผแ€Šแ€บแ€ทแ€แ€ฒแ€ทแ€•แ€ผแ€ฎแ€ธ แ€…แ€ฌแ€€แ€ผแ€ฑแ€ฌแ€„แ€บแ€ธแ€›แ€ฑ แ€œแ€ฑแ€ธแ€žแ€ฑแ€ฌแ€„แ€บแ€ธแ€€แ€ปแ€ฑแ€ฌแ€บแ€€แ€ญแ€ฏ (e.g. myPOS corpus แ€แ€…แ€บแ€แ€ฏแ€œแ€ฏแ€ถแ€ธ) แ€…แ€ฌแ€œแ€ฏแ€ถแ€ธแ€–แ€ผแ€แ€บแ€€แ€ผแ€Šแ€บแ€ทแ€แ€ฌ แƒ แ€…แ€€แ€นแ€€แ€”แ€บแ€ทแ€แ€ฑแ€ฌแ€„แ€บ แ€™แ€€แ€ผแ€ฌแ€•แ€ซแ€˜แ€ฐแ€ธแ‹ แ€กแ€ฒแ€’แ€ซแ€€แ€ผแ€ฑแ€ฌแ€„แ€บแ€ท แ€œแ€€แ€บแ€›แ€พแ€ญแ€กแ€แ€ปแ€ญแ€”แ€บแ€‘แ€ญ แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌแ€กแ€แ€ฝแ€€แ€บ แ€กแ€™แ€ผแ€”แ€บแ€†แ€ฏแ€ถแ€ธ word segmenter แ€•แ€ซแ€•แ€ฒแ‹ ์˜คppa๋น ord แ€€แ€ญแ€ฏ แ€กแ€ฌแ€ธแ€•แ€ฑแ€ธแ€€แ€ผแ€•แ€ซแ€ฅแ€ฎแ€ธแ€œแ€ญแ€ฏแ€ทแ‹

แ€›แ€ฒแ€€แ€ปแ€ฑแ€ฌแ€บแ€žแ€ฐ
4 Aug 2025

Overview

Myanmar language lacks strict word boundary rules, making word segmentation essential for NLP tasks. While existing tools like sylbreak (syllable segmenter) and myWord (multi-level segmenter) exist, oppaWord fills the critical need for a fast, training-free word segmenter with domain adaptation capabilities through:

  • Hybrid DAG + Bi-MM + LM architecture
  • Post-editing rule support
  • Visual debugging tools
  • Syllable-level processing

Key Advantages:

  • Faster than Neural Network based segmenters
  • No training required - just provide a dictionary
  • Tunable segmentation strategies via simple parameters

Algorithm Explained

Overview of oppaWord Segmenter

oppaWord combines three core techniques:

  1. DAG Construction:

    • Builds all possible segmentations (3-12 syllable lengths)
    • Scores paths using dictionary, frequency, and language model features
  2. Bidirectional Maximum Matching (Bi-MM):

    • Fallback mechanism when DAG paths are uncertain
    • Configurable score boosting (--bimm-boost)
  3. Multi-Feature Scoring:

    Total_Score = Dict_Weight + Syllable_Freq + LM_Score + (Bi-MM_Boost)

Installation

git clone https://github.com/ye-kyaw-thu/oppaWord.git  
cd oppaWord  
7z x data/5gramLM.7z.001  # Extract language model

Usage

Basic Command

python oppa_word.py \
  --input input.txt \
  --dict data/myg2p_mypos.dict \
  --output segmented.txt

oppaWord achieves remarkable processing speeds even on large corpora. Here's a benchmark from segmenting the entire myPOS corpus (43,196 sentences) without using a language model:

ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ time python oppa_word.py \
  --input ../../corpus_info/tool/dagWord/data/mypos-ver.3.0.shuf.notag.nopunc.txt.seg_normalized2 \
  --dict data/myg2p_mypos.dict \
  --space-remove-mode my_not_num \
  --use-bimm-fallback \
  --bimm-boost 150 \
  --output ./mypos-ver.3.0.noLM.norules.token.txt

real    0m2.586s    # Total elapsed time (including space removal preprocessing)
user    0m2.555s
sys     0m0.024s
oppaWord Running Screenshot

Recommended Configuration

For best accuracy with current 5-gram LM:

python oppa_word.py \
  --input text.txt \
  --dict data/myg2p_mypos.dict \
  --arpa data/myMono_clean_syl.trie.bin \
  --use-bimm-fallback \
  --bimm-boost 150 \
  --space-remove-mode "my_not_num"

Full Options

$ python ./oppa_word.py --help
usage: oppa_word.py [-h] --input INPUT [--output OUTPUT] --dict DICT [--sylfreq SYLFREQ] [--arpa ARPA]
                    [--postrule-file POSTRULE_FILE] [--max-order MAX_ORDER] [--dict-weight DICT_WEIGHT]
                    [--use-bimm-fallback] [--bimm-boost BIMM_BOOST] [--visualize-dag] [--dag-output-dir DAG_OUTPUT_DIR]
                    [--space-remove-mode {all,my,my_not_num}] [--max-word-len MAX_WORD_LEN]

oppa_word, Hybrid DAG + BiMM + LM Myanmar Word Segmenter with optional Aho-Corasick support

options:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        Input file with one sentence per line (UTF-8)
  --output OUTPUT, -o OUTPUT
                        Optional output file path (default: stdout)
  --dict DICT, -d DICT  Word dictionary file (one word per line)
  --sylfreq SYLFREQ, -s SYLFREQ
                        Syllable frequency file (syllable<TAB>frequency, for scoring)
  --arpa ARPA, -a ARPA  ARPA-format syllable-level language model (optional)
  --postrule-file POSTRULE_FILE
                        Optional post-processing rules (e.g., merging, corrections)
  --max-order MAX_ORDER
                        Max LM n-gram order (default: 5)
  --dict-weight DICT_WEIGHT
                        Dictionary path weight in scoring (default: 10.0)
  --use-bimm-fallback   Enable Bi-directional Maximum Matching as fallback
  --bimm-boost BIMM_BOOST
                        Boost score added to Bi-MM fallback path (default: 0.0)
  --visualize-dag       Generate DAG visualization (PDF per sentence)
  --dag-output-dir DAG_OUTPUT_DIR
                        Directory to save DAG PDFs if --visualize-dag is used (default: 'dag_viz')
  --space-remove-mode {all,my,my_not_num}
                        Preprocessing mode to remove spaces: 'all', 'my' (Myanmar only), or 'my_not_num (Myanmar but not
                        including Myanmar numbers'
  --max-word-len MAX_WORD_LEN
                        Maximum word length in syllables (3-12, default:6)

Visualization

Debug segmentation decisions using DAG visualizations:

python oppa_word.py \
  --input ./data/10lines.ref \
  --dict ./data/myg2p_mypos.dict \
  --space-remove-mode "my_not_num" \
  --use-bimm-fallback \
  --bimm-boost 150 \
  --visualize-dag \
  --dag-output-dir debug_viz2

Visualization Demo:

ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ python oppa_word.py \
>   --input ./data/10lines.ref \
>   --dict ./data/myg2p_mypos.dict \
>   --space-remove-mode "my_not_num" \
>   --use-bimm-fallback \
>   --bimm-boost 150 \
>   --visualize-dag \
>   --dag-output-dir debug_viz2
แแ‰แ†แ‚ แ€แ€ฏแ€”แ€พแ€…แ€บ แ€แ€”แ€บแ€ทแ€™แ€พแ€”แ€บแ€ธ แ€žแ€”แ€บแ€ธแ€แ€ฑแ€ซแ€„แ€บแ€…แ€ฌแ€›แ€„แ€บแ€ธ แ€กแ€› แ€œแ€ฐแ€ฆแ€ธ แ€›แ€ฑ แแแ…แ‰แƒแ แ€šแ€ฑแ€ฌแ€€แ€บ แ€›แ€พแ€ญ แ€žแ€Šแ€บ
แ€œแ€ฐ แ€แ€ญแ€ฏแ€„แ€บแ€ธ แ€แ€ฝแ€„แ€บ แ€žแ€„แ€บแ€ทแ€™แ€ผแ€แ€บ แ€œแ€ปแ€ฑแ€ฌแ€บแ€€แ€”แ€บ แ€…แ€ฝแ€ฌ แ€€แ€”แ€บแ€ทแ€žแ€แ€บ แ€‘แ€ฌแ€ธ แ€žแ€Šแ€บแ€ท แ€กแ€œแ€ฏแ€•แ€บแ€œแ€ฏแ€•แ€บแ€แ€ปแ€ญแ€”แ€บ แ€กแ€•แ€ผแ€„แ€บ แ€œแ€…แ€ฌ แ€”แ€พแ€„แ€บแ€ทแ€แ€€แ€ฝ แ€กแ€แ€ซ แ€€แ€ฌแ€œ แ€กแ€ฌแ€ธแ€œแ€ปแ€ฑแ€ฌแ€บแ€…แ€ฝแ€ฌ แ€žแ€แ€บแ€™แ€พแ€แ€บ แ€‘แ€ฌแ€ธ แ€žแ€Šแ€บแ€ท แ€กแ€œแ€ฏแ€•แ€บ แ€กแ€ฌแ€ธแ€œแ€•แ€บแ€›แ€€แ€บ แ€™แ€ปแ€ฌแ€ธ แ€•แ€ซแ€แ€„แ€บ แ€žแ€Šแ€บแ€ท แ€กแ€”แ€ฌแ€ธแ€šแ€ฐแ€แ€ฝแ€„แ€บแ€ท แ€”แ€พแ€„แ€บแ€ท แ€กแ€ฌแ€ธแ€œแ€•แ€บแ€แ€ฝแ€„แ€บแ€ท แ€แ€ถแ€…แ€ฌแ€ธแ€•แ€ญแ€ฏแ€„แ€บแ€แ€ฝแ€„แ€บแ€ท แ€›แ€พแ€ญ แ€žแ€Šแ€บ
แ€ค แ€”แ€Šแ€บแ€ธ แ€€แ€ญแ€ฏ แ€…แ€…แ€บแ€šแ€ฐ แ€žแ€ฑแ€ฌ แ€”แ€Šแ€บแ€ธ แ€Ÿแ€ฏ แ€แ€ฑแ€ซแ€บ แ€žแ€Šแ€บ
แ€…แ€ฌแ€•แ€ผแ€”แ€บแ€•แ€ฝแ€ฒ แ€†แ€ญแ€ฏ แ€แ€ฌ แ€€ แ€กแ€ฌแ€‚แ€ฏแ€ถแ€†แ€ฑแ€ฌแ€„แ€บ แ€กแ€œแ€ฝแ€แ€บแ€€แ€ปแ€€แ€บ แ€‘แ€ฌแ€ธ แ€แ€ฒแ€ท แ€•แ€ญแ€‹แ€€แ€แ€บแ€žแ€ฏแ€ถแ€ธแ€•แ€ฏแ€ถ แ€…แ€ฌแ€•แ€ฑ แ€แ€ฝแ€ฑ แ€€แ€ญแ€ฏ แ€…แ€ฌแ€…แ€…แ€บ แ€žแ€ถแ€ƒแ€ฌแ€แ€ฑแ€ฌแ€บแ€€แ€ผแ€ฎแ€ธ แ€แ€ฝแ€ฑ แ€›แ€ฒแ€ท แ€›แ€พแ€ฑแ€ทแ€™แ€พแ€ฌ แ€กแ€œแ€ฝแ€แ€บ แ€•แ€ผแ€”แ€บ แ€•แ€ผแ€ฎแ€ธ แ€›แ€ฝแ€แ€บแ€•แ€ผ แ€› แ€แ€ฌ แ€•แ€ฑแ€ซแ€ท
แ€’แ€ฎ แ€™แ€พแ€ฌ แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บแ€ท แ€žแ€€แ€บแ€žแ€ฑแ€แ€ถแ€€แ€แ€บ แ€•แ€ซ
แ‚แ€ แ€›แ€ฌแ€…แ€ฏ แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€ท แ€žแ€™แ€ญแ€ฏแ€„แ€บแ€ธ แ€žแ€”แ€บแ€ธ แ€แ€„แ€บแ€ธ แ€œแ€พแ€ญแ€ฏแ€„แ€บ แ‚แ€แ€แ‰ แ€แ€ฏ แ€™แ€ฑ แ€œ แ€€แ€ถแ€€แ€ฑแ€ฌแ€บแ€แ€แ€บแ€›แ€Šแ€บ แ€…แ€ฌแ€•แ€ฑ
แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บ แ€™แ€ปแ€€แ€บแ€™แ€พแ€”แ€บ แ€แ€…แ€บ แ€œแ€€แ€บแ€œแ€ฏแ€•แ€บ แ€แ€ปแ€„แ€บแ€•แ€ซแ€แ€šแ€บ
แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บ แ€แ€ญแ€ฏแ€ท แ€€ แ€’แ€ฎ แ€กแ€™แ€พแ€ฏ แ€›แ€ฒแ€ท แ€€แ€ผแ€ถแ€›แ€ฌแ€•แ€ซ แ€€แ€ญแ€ฏ แ€–แ€™แ€บแ€ธแ€™แ€ญ แ€–แ€ญแ€ฏแ€ท แ€€แ€ผแ€ญแ€ฏแ€ธแ€…แ€ฌแ€ธ แ€แ€ฒแ€ท แ€แ€šแ€บ
แ€€แ€œแ€ฑแ€ธ แ€™แ€ฎแ€ธแ€–แ€ฝแ€ฌแ€ธ แ€–แ€ญแ€ฏแ€ท แ€แ€”แ€ทแ€บแ€™แ€พแ€”แ€บแ€ธ แ€›แ€€แ€บ แ€€ แ€˜แ€šแ€บแ€แ€ฑแ€ฌแ€ท แ€•แ€ซ แ€œแ€ฒ
แ€กแ€›แ€ญแ€ฏแ€ธแ€›แ€พแ€„แ€บแ€ธแ€†แ€ฏแ€ถแ€ธ แ€€แ€ฌแ€—แ€ญแ€ฏแ€Ÿแ€ญแ€ฏแ€€แ€บแ€’แ€›แ€ญแ€แ€บ แ€™แ€พแ€ฌ แ€‚แ€œแ€ฐแ€ธแ€€แ€ญแ€ฏแ€ทแ€…แ€บ แ€‚แ€œแ€€แ€บแ€แ€ญแ€ฏแ€ทแ€…แ€บ แ€–แ€›แ€•แ€บแ€แ€ญแ€ฏแ€ทแ€…แ€บ แ€…แ€žแ€Šแ€บแ€ท แ€™แ€ญแ€ฏแ€”แ€ญแ€ฏแ€†แ€€แ€บแ€€แ€›แ€ญแ€ฏแ€€แ€บ แ€™แ€ปแ€ฌแ€ธ แ€–แ€ผแ€…แ€บ แ€žแ€Šแ€บ
ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ ls debug_viz2/
dag_line_0000.dot  dag_line_0002.dot  dag_line_0004.dot  dag_line_0006.dot  dag_line_0008.dot
dag_line_0000.pdf  dag_line_0002.pdf  dag_line_0004.pdf  dag_line_0006.pdf  dag_line_0008.pdf
dag_line_0001.dot  dag_line_0003.dot  dag_line_0005.dot  dag_line_0007.dot  dag_line_0009.dot
dag_line_0001.pdf  dag_line_0003.pdf  dag_line_0005.pdf  dag_line_0007.pdf  dag_line_0009.pdf
ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$

Example output:

dag_line_0004.png

File Structure

oppaWord/
โ”œโ”€โ”€ data/ # Resource files
โ”‚   โ”œโ”€โ”€ myg2p_mypos.dict # Main dictionary
โ”‚   โ”œโ”€โ”€ myg2p_mypos_name.dict # Extended dictionary with Myanmar names
โ”‚   โ”œโ”€โ”€ myMono.freq # Syllable frequency counts
โ”‚   โ”œโ”€โ”€ myMono_clean_syl.trie.bin # 5-gram syllable LM (optimized binary)
โ”‚   โ””โ”€โ”€ rules.txt # Post-processing correction rules
โ”œโ”€โ”€ doc/ # Documentation
โ”œโ”€โ”€ tools/ # Evaluation and preprocessing scripts
โ””โ”€โ”€ oppa_word.py # Main segmenter code

Note: exp_1, exp_2, and exp_20250727_1553 are output folders from some earlier experiments.

Data Preparation

Input Format

  • One sentence per line
  • Space removal optional (handled by --space-remove-mode)

Post-Editing Rules (rules.txt)

oppaWord supports post-segmentation corrections through a rules file. This helps fix systematic errors and improve readability.

Rule Format

WRONG_FORM|||CORRECT_FORM

Types of Rules

  1. Exact Word Replacements:

    แ€•แ€ซแ€แ€šแ€บ|||แ€•แ€ซ แ€แ€šแ€บ
    แ€™แ€›แ€พแ€ญ|||แ€™ แ€›แ€พแ€ญ
    
    • Merges or splits exact word matches
    • Example: "แ€•แ€ซแ€แ€šแ€บ" โ†’ "แ€•แ€ซ แ€แ€šแ€บ"
  2. Regular Expressions:

    (\S)([แŠแ‹])|||\1 \2
    
    • Uses regex patterns to handle:
      • Punctuation attachment ("แ€•แ€ซแ‹" โ†’ "แ€•แ€ซ แ‹")
    • Regex syntax follows Python's re module
  3. Best Practices

    • Order Matters: Rules are applied top-to-bottom
    • Balance Specificity:
      • Prefer exact matches (แ€•แ€ซแ€˜แ€ฐแ€ธ|||แ€•แ€ซ แ€˜แ€ฐแ€ธ) over broad regex when possible

Performance Tips

  1. For speed: Use --dict-only --bimm-boost 150
  2. For accuracy: Add --arpa data/myMono_clean_syl.trie.bin
  3. Domain adaptation:
    • Customize data/rules.txt for post-editing
    • Add domain terms to dictionary

Evaluation

We provide eval_segmentation.py - a comprehensive evaluation tool that offers:

Key Features

  1. Multi-Level Metrics:

    • Word-Level: Precision, Recall, F1 (exact word matches)
    • Boundary-Level: Accuracy of word boundaries
    • Vocabulary-Level: Type-based analysis
  2. Error Analysis:

    • Categorizes errors into:
      • Over-Segmentation: System incorrectly splits words
      • Under-Segmentation: System incorrectly merges words
      • Complex Errors: Mixed boundary mistakes
    • Shows top-K most frequent errors (configurable with --top-k)
  3. Efficient Processing:

    • Handles large files (>500k words) in seconds
    • Optional --no-errors flag for faster metric-only evaluation

Usage Examples

ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ python ./tools/eval_segmentation.py --help
usage: eval_segmentation.py [-h] -r REFERENCE [-H HYPOTHESIS] [--top-k TOP_K] [--no-errors]

Enhanced Word Segmentation Evaluator with Error Analysis

options:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        Reference (gold standard) file (default: None)
  -H HYPOTHESIS, --hypothesis HYPOTHESIS
                        Hypothesis (system output) file (use - for stdin) (default: -)
  --top-k TOP_K         Show top K most frequent errors (default: 10)
  --no-errors           Skip error analysis to save time (default: False)
ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$

Basic Evaluation

python ./tools/eval_segmentation.py \
  -r reference.txt \
  -H hypothesis.txt > results.txt

Example Evaluation Results

Word segmentation evaluation on the mypos-ver.3.0 corpus (43,196 sentences) using the following parameters: --dict, --space-remove-mode, --use-bimm-fallback, --bimm-boost, and --postrule:

ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ time python oppa_word.py   --input ../../corpus_info/tool/dagWord/data/mypos-ver.3.0.shuf.notag.nopunc.txt.seg_normalized2   \
--dict data/myg2p_mypos.dict --space-remove-mode my_not_num  --use-bimm-fallback   \
--bimm-boost 150 --postrule-file ./data/rules.txt --output ./mypos-ver.3.0.noLM.rules.token.txt

real    0m2.721s
user    0m2.688s
sys     0m0.026s
ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$

Running eval_segmentation.py to compare reference and hypothesis files, with output saved to result_noLM_Rule2.txt:

ye@lst-hpc3090:~/exp/myTokenizer/oppaWord$ python ./tools/eval_segmentation.py -r ../../corpus_info/tool/dagWord/data/mypos-ver.3.0.shuf.notag.nopunc.txt.seg_normalized2 \
-H  ./mypos-ver.3.0.noLM.rules.token.txt > result_noLM_Rule2.txt

Contents of the evaluation output file (result_noLM_Rule2.txt):


Word Segmentation Evaluation Results
============================================================
Metric                         Score
------------------------------------------------------------
Word Precision                0.9120
Word Recall                   0.8889
Word F1-score                 0.9003
------------------------------------------------------------
Boundary Precision            0.6475
Boundary Recall               0.6311
Boundary F1-score             0.6392
------------------------------------------------------------
Vocab Precision               0.8754
Vocab Recall                  0.9398
Vocab F1-score                0.9065
============================================================

Additional Statistics:
Reference words: 510437
Hypothesis words: 497503
Correct words: 453708
Reference vocabulary size: 24257
Hypothesis vocabulary size: 26042
Common vocabulary: 22797

Top Segmentation Errors Analysis
============================================================
Total errors: 57106

Most Frequent Over-Segmentation Errors (System split where it shouldn't):
  120 ร— REF: 'แ€žแ€ญแ€ฏแ€ทแ€™แ€Ÿแ€ฏแ€แ€บ' โ†’ HYP: 'แ€žแ€ญแ€ฏแ€ทแ€™|แ€Ÿแ€ฏแ€แ€บ'
  101 ร— REF: 'แ€’แ€ซแ€™แ€พแ€™แ€Ÿแ€ฏแ€แ€บ' โ†’ HYP: 'แ€’แ€ซแ€™แ€พแ€™|แ€Ÿแ€ฏแ€แ€บ'
   81 ร— REF: 'แ€’แ€ฑแ€ซแ€บแ€œแ€ฌ' โ†’ HYP: 'แ€’แ€ฑแ€ซแ€บ|แ€œแ€ฌ'
   76 ร— REF: 'แ€”แ€ฌแ€›แ€ฎ' โ†’ HYP: 'แ€”แ€ฌ|แ€›แ€ฎ'
   57 ร— REF: 'แ€€แ€ปแ€ฝแ€”แ€บแ€แ€ฑแ€ฌแ€บ' โ†’ HYP: 'แ€€แ€ปแ€ฝแ€”แ€บ|แ€แ€ฑแ€ฌแ€บ'
   54 ร— REF: 'แ€กแ€žแ€€แ€บ' โ†’ HYP: 'แ€ก|แ€žแ€€แ€บ'
   50 ร— REF: 'แ€™แ€ญแ€”แ€…แ€บ' โ†’ HYP: 'แ€™แ€ญ|แ€”แ€…แ€บ'
   49 ร— REF: 'แ€Šแ€”แ€ฑ' โ†’ HYP: 'แ€Š|แ€”แ€ฑ'
   47 ร— REF: 'แ€™แ€”แ€€แ€บ' โ†’ HYP: 'แ€™|แ€”แ€€แ€บ'
   42 ร— REF: 'แ€กแ€™แ€พแ€แ€บ' โ†’ HYP: 'แ€ก|แ€™แ€พแ€แ€บ'

Most Frequent Under-Segmentation Errors (System joined what should be separate):
  247 ร— REF: 'แ€™|แ€›' โ†’ HYP: 'แ€™แ€›'
  175 ร— REF: 'แ€แ€…แ€บ|แ€แ€ฏ|แ€แ€ฏ' โ†’ HYP: 'แ€แ€…แ€บแ€แ€ฏแ€แ€ฏ'
  137 ร— REF: 'แ€šแ€‰แ€บแ€€แ€ปแ€ฑแ€ธ|แ€™แ€พแ€ฏ' โ†’ HYP: 'แ€šแ€‰แ€บแ€€แ€ปแ€ฑแ€ธแ€™แ€พแ€ฏ'
  137 ร— REF: 'แ€”แ€ญแ€ฏแ€„แ€บแ€„แ€ถ|แ€›แ€ฑแ€ธ' โ†’ HYP: 'แ€”แ€ญแ€ฏแ€„แ€บแ€„แ€ถแ€›แ€ฑแ€ธ'
  124 ร— REF: 'แ€˜แ€ฌ|แ€œแ€ฒ' โ†’ HYP: 'แ€˜แ€ฌแ€œแ€ฒ'
  111 ร— REF: 'แ€…|แ' โ†’ HYP: 'แ€…แ'
  110 ร— REF: 'แ€แ€…แ€บ|แ€œแ€ฏแ€ถแ€ธ' โ†’ HYP: 'แ€แ€…แ€บแ€œแ€ฏแ€ถแ€ธ'
  107 ร— REF: 'แ€•แ€ซ|แ€ฅแ€ฎแ€ธ' โ†’ HYP: 'แ€•แ€ซแ€ฅแ€ฎแ€ธ'
  104 ร— REF: 'แ€แ€…แ€บ|แ€•แ€แ€บ' โ†’ HYP: 'แ€แ€…แ€บแ€•แ€แ€บ'
  103 ร— REF: 'แ€™|แ€แ€ฝแ€ฑแ€ท' โ†’ HYP: 'แ€™แ€แ€ฝแ€ฑแ€ท'

Most Frequent Complex Boundary Errors:
  350 ร— REF: 'แ€•แ€ฑแ€ซแ€บ' โ†’ HYP: 'แ€•แ€ฑแ€ซแ€บแ€™แ€พแ€ฌ'
  230 ร— REF: 'แ€–แ€ผแ€…แ€บ' โ†’ HYP: 'แ€žแ€Šแ€บ'
  218 ร— REF: 'แ€žแ€Šแ€บ' โ†’ HYP: 'แ€žแ€Šแ€บแ€™แ€พแ€ฌ'
  192 ร— REF: 'แ€Ÿแ€ฏแ€แ€บ' โ†’ HYP: 'แ€Ÿแ€ฏแ€แ€บแ€แ€šแ€บ'
  188 ร— REF: 'แ€”แ€ญแ€ฏแ€„แ€บแ€„แ€ถ' โ†’ HYP: 'แ€”แ€ญแ€ฏแ€„แ€บ'
  167 ร— REF: 'แ' โ†’ HYP: 'แ€„แ€ถแ'
  162 ร— REF: 'แ€ฆแ€ธ' โ†’ HYP: 'แ€ฅแ€ฎแ€ธ'
  150 ร— REF: 'แ€–แ€ผแ€…แ€บ' โ†’ HYP: 'แ€–แ€ผแ€…แ€บแ'
  146 ร— REF: 'แ€€แ€ผ' โ†’ HYP: 'แ€€แ€ผแ'
  120 ร— REF: 'แ€แ€ฑแ€ฌแ€„แ€บแ€ธแ€•แ€”แ€บ' โ†’ HYP: 'แ€แ€ฑแ€ฌแ€„แ€บแ€ธแ€•แ€”แ€บแ€•แ€ซ'

License

Source Code & Tools

MIT License - Full terms available at:
https://github.com/ye-kyaw-thu/oppaWord/blob/main/LICENSE

Dictionary Data

The dictionary combines words from multiple sources:

  • myG2P dictionary (originally from Myanmar Language Commission)
  • myPOS corpus
  • Personal names from myRoman corpus and LU Lab's Myanmar names collection (for R&D)
  • myMono monolingual corpus (not publicly released)

Usage Restrictions:

  • โœ… Allowed: Myanmar language NLP/AI research and development
  • โŒ Not Allowed: Commercial use without explicit permission

Detailed Technical Information

For those interested in oppaWord's segmentation methodology and development process:

  1. Technical Introduction

  2. Experimental Notebooks
    Open Notebooks
    Contains Jupyter notebooks demonstrating:

    • Different segmentation strategies
    • Error analysis workflows
    • Parameter tuning examples
  3. Research Publications (Coming Soon)

    • Planned journal paper on the hybrid segmentation approach

Citation

If you use oppaWord in your work, please cite it as follows:
(oppaWord แ€€แ€ญแ€ฏ แ€กแ€žแ€ฏแ€ถแ€ธแ€•แ€ผแ€ฏแ€•แ€ซแ€€ แ€กแ€ฑแ€ฌแ€€แ€บแ€•แ€ซแ€กแ€แ€ญแ€ฏแ€„แ€บแ€ธ แ€€แ€ญแ€ฏแ€ธแ€€แ€ฌแ€ธแ€–แ€ฑแ€ฌแ€บแ€•แ€ผแ€•แ€ฑแ€ธแ€•แ€ซแ‹)

@misc{oppaWord_2025,
  author       = {Ye Kyaw Thu},
  title        = {{oppaWord: Hybrid DAG+Bi-MM+LM Myanmar Word Segmenter}},
  version      = {1.0},
  month        = {August},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/ye-kyaw-thu/oppaWord},
  note         = {Accessed: YYYY-MM-DD},
  institution  = {Language Understanding Lab (LU Lab), Myanmar}
}

About

oppaWord: Super Fast Myanmar Word Segmenter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors