Code and paper for matching TUFS basic Vocab to OMW

Pipeline

tufsdata/*.txt  (PostgreSQL dumps from TUFS Open Language Resources)
      │
      ▼  scripts/munge.py
tufs-vocab.tsv  (10-column TSV: cid lang wid lemma comment iids examples is_basic scenes bunrui)
      │
      ▼  scripts/tufs2wn.py
build/lmf/tufs-{lang}.xml  (WN-LMF 1.4, one file per language)
      │
      ▼  Cygnet  (external/cygnet/, auto-cloned on first run)
docs/cygnet/tufs.db.gz  (SQLite for web UI)

scripts/munge.py — parses SQL dumps → flat TSV with columns: cid, lang, wid, lemma, comment, iids, examples, is_basic, scenes, bunrui.

scripts/tufs2wn.py — reads TSV → LMF XML per language, extracting:

lemmas and semicolon-separated variant forms
morphological variants from (morph:tag) annotations and Japanese comment patterns
pronunciations: hiragana readings (ja-Hira) as <Form> variants, pinyin (zh-pinyin), and audio URLs
synset definitions from 【意味】 comment fields (keeps TUFS-internal synsets alive in the DB)
span-annotated example sentences, thematic domain labels, basic-vocabulary flags

ILI mapping: tufs-omw-map.tsv — TUFS concept IDs → Princeton WordNet 3.0 synset IDs. ~644 concepts are ILI-linked; ~2,274 are TUFS-internal with definitions from 【意味】.

Requirements

uv — Python package manager
TUFS SQL dumps in tufsdata/ — from TUFS Open Language Resources
Internet access on first run (to clone Cygnet and download OEWN)

Cygnet is auto-cloned to external/cygnet/ on first build. A .venv virtual environment is created automatically by uv.

Build

bash build.sh              # full build: TSV → LMF XML → Cygnet DB → docs/
bash build.sh --lmf-only   # TSV + LMF XML only (no Cygnet)
bash build.sh --cygnet-only  # Cygnet packaging only (reuses existing XML)

Preview locally

bash run.sh   # serves docs/ at http://localhost:8000

Tests

uv run pytest tests/

Unit tests (test_tufs2wn.py, test_source.py) cover TSV parsing helpers. Integration tests (test_db.py) check the built docs/cygnet/tufs.db.gz.

Release

The typical workflow is:

bash build.sh                       # build + validate XMLs + deploy docs/
bash make-release.sh                # build tarballs + wn load test (inspect locally)
bash make-release.sh --pre-release  # tag, push, create pre-release on GitHub
# ... test locally with run.sh, share for review ...
bash make-release.sh --release      # promotes the existing pre-release (no re-upload)

If you discover a bug after pre-release that requires rebuilding the XMLs, bump VERSION, rebuild, and release under the new version.

GitHub Pages deployment

The web UI at https://omwn.github.io/tufs/ is deployed by the GitHub Actions workflow in .github/workflows/pages.yml. It runs automatically on every published release and can also be triggered manually via gh workflow run.

Why a workflow instead of committing the DB files? The .db.gz files are ~40 MB total and change with every release — too large to track in git. GitHub also blocks cross-origin fetches from releases/latest/download/ (CORS policy), so the browser cannot load them directly. The workflow solves both problems by downloading the DB files from the release assets and deploying them alongside docs/ as a single Pages artifact — same origin, no CORS issue.

The Pages source must be set to GitHub Actions (not "Deploy from branch") in the repo's Settings → Pages. If you ever need to re-deploy without a new release, run:

gh workflow run pages.yml --repo omwn/tufs

Intermediate TSV (`tufs-vocab.tsv`)

Francis Bond, Hiroki Nomoto, Luís Morgado da Costa, and Arthur Bond. 2020. Linking the TUFS Basic Vocabulary to the Open Multilingual Wordnet. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3181–3188, Marseille, France. European Language Resources Association.
Francis Bond. 2025. Adding Audio to Wordnets. Proceedings of the 13th International Global Wordnet Conference (GWC2025). (Note: non-final version)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code and paper for matching TUFS basic Vocab to OMW

Pipeline

Requirements

Build

Preview locally

Tests

Release

GitHub Pages deployment

Intermediate TSV (`tufs-vocab.tsv`)

Contents

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
docs		docs
etc		etc
scripts		scripts
tests		tests
tex		tex
tufsdata		tufsdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TUFS_notes_merged		TUFS_notes_merged
VERSION		VERSION
build.sh		build.sh
make-release.sh		make-release.sh
requirements.txt		requirements.txt
run.sh		run.sh
tufs-omw-map.tsv		tufs-omw-map.tsv
tufs-vocab.tsv		tufs-vocab.tsv

Folders and files

Latest commit

History

Repository files navigation

Code and paper for matching TUFS basic Vocab to OMW

Pipeline

Requirements

Build

Preview locally

Tests

Release

GitHub Pages deployment

Intermediate TSV (tufs-vocab.tsv)

Contents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Intermediate TSV (`tufs-vocab.tsv`)

Packages