Obtaining the mapping between the two tagsets based on the information from Apertium Wiki.
Loosely based on this code, hence the GPLv3 license.
To install the latest released version, run
python -m pip install apertium2udNB!
- The instrument is far from being perfect.
- It was originally developed for working with
apertium-kir, i.e. with the Kyrgyz language. - The package ships with the apertium-kir
.udxconversion rules. For other languages, you may need to add the corresponding submodule (see below).
The package needs two kinds of resources that are generated at build time and not committed to the repository:
apertium2ud/resources/tags_map.json— the Apertium↔UD tag map, scraped from the Apertium wiki "List of symbols" page; andapertium2ud/resources/<lang>.udx— copied from the relevant Apertium language repository, vendored as a git submodule underexternal/.
The Apertium language repositories are referenced as submodules, so clone with:
git clone --recurse-submodules https://github.com/alexeyev/apertium2ud.git
# or, if already cloned:
git submodule update --init --recursiveLanguages with packaged .udx rules: Kyrgyz (apertium-kir, the default),
Kazakh (apertium-kaz), and Uyghur (apertium-uig). Also selectable,
but falling back to the generic wiki-scraped map (no language-specific
overrides) because their Apertium repo ships no .udx: Tatar, Turkish,
Uzbek, Azerbaijani, and Sakha/Yakut. To add a language, add its
Apertium repo as a submodule and list it in SUPPORTED_LANGS (or, if it has no
.udx, GENERIC_MAP_LANGS) in build_resources.py:
git submodule add https://github.com/apertium/apertium-tat external/apertium-tatBuilding the package generates the resources automatically (a plain
pip install . or python -m build runs the wiki scrape and copies the
.udx files via an in-tree build backend), so the produced wheel/sdist already
contains them. To (re)generate the resources manually:
python build_resources.py # all configured languages
python build_resources.py --langs kir
python build_resources.py --skip-wiki # reuse existing tags_map.json.udx files are sanitized on copy to repair known upstream defects that would
otherwise emit invalid UD output (e.g. comma-joined feature values such as
Number[psor]=Plur,Sing, which is dropped as underspecified).
>>> from apertium2ud.convert import a2ud
>>> a2ud(["n", "pl", "acc"])
(['NOUN'], ['Number=Plur', 'Case=Acc', 'Definite=Def'])
>>> a2ud(["v", "tv", "ger", "nom", "cop", "aor", "p3", "pl"])
(['VERB'], ['VerbForm=Ger', 'Case=Nom', 'Tense=Aor', 'VerbForm=Fin', 'Mood=Ind', 'Person=3', 'Number=Plur'])
(Exact features depend on the bundled .udx rules; the examples above reflect
the packaged apertium-kir rules.)
By default a2ud uses the Kyrgyz (apertium-kir) rules. To use another
packaged language's rules, load them and pass them in:
>>> import apertium2ud
>>> apertium2ud.available_languages()
['aze', 'kaz', 'kir', 'sah', 'tat', 'tur', 'uig', 'uzb']
>>> kaz_rules = apertium2ud.load_language_rules("kaz")
>>> from apertium2ud.convert import a2ud
>>> a2ud(["n", "pl", "acc"], rules=kaz_rules)
(['NOUN'], ['Number=Plur', 'Case=Acc'])
The default behaviour (calling a2ud without rules=) is unchanged.
So far the conversion is far from perfect
Кыз NOUN {'Number[psor]=Sing', 'Number=Sing', 'Case=Nom', 'Person[psor]=3', 'Person=3'} ->
<px3sg><n><subj?nom?><sg><p3><px3sp>
досуна NOUN {'Number[psor]=Sing', 'Number=Sing', 'Person[psor]=3', 'Case=Dat', 'Person=3'} ->
<px3sg><n><sg><dat><p3><px3sp>
кат NOUN {'Case=Nom', 'Person=3', 'Number=Sing'} ->
<n><subj?nom?><sg><p3>
жазган VERB {'Aspect=Perf', 'Polarity=Pos', 'Number=Sing', 'Tense=Past', 'Person=3', 'Evident=Fh'} ->
<past3p><vblex?v?vbmod?><sg><aff><aor?past?pret?><perf><p3>
. PUNCT set() ->
<sent?apos?percent?clb?punct?>
python build_resources.py # generate resources first
python -m pytest
ruff check . # lintCI (GitHub Actions) lints with ruff, checks out the submodules recursively, builds the resources, runs the test suite across Python 3.8–3.12, and verifies that a freshly built wheel installs and works in a clean environment.
tests/integration_ud_eval.py evaluates the converter against real Universal
Dependencies data, with the treebanks vendored as submodules under external/.
Run a full report with:
python tests/integration_ud_eval.py # all available languages
python tests/integration_ud_eval.py kaz # one languageThe converter is exercised against eight Turkic UD treebanks (~52K tokens). For each, UPOS round-trip recovery is reported, and no invalid UD feature values are produced for any language:
| Lang | Treebank | Tokens | UPOS round-trip | Forward POS* |
|---|---|---|---|---|
| kir | Kyrgyz-KTMU | 11771 | 100.0% | — |
| kaz | Kazakh-KTB | 10007 | 98.9% | 98.2% |
| uig | Uyghur-UDT | 10330 | 99.9% | — |
| tur | Turkish-IMST | 10032 | 100.0% | — |
| uzb | Uzbek-UT | 5930 | 99.9% | — |
| tat | Tatar-NMCTT | 2280 | 99.9% | — |
| sah | Yakut-YKTDT | 1460 | 100.0% | 99.6% |
| aze | Azerbaijani-TueCL | 912 | 100.0% | — |
* Forward POS is the stronger, real-direction metric (gold Apertium tag →
a2ud → compared to gold UPOS). It is only available when a treebank stores
genuine Apertium tags in its XPOS column — among these, Kazakh-KTB and
Yakut-YKTDT do. The others either use a non-Apertium XPOS
(Kyrgyz/Turkish/Uyghur) or leave it blank (Uzbek/Tatar/Azerbaijani), so only the
(lossier) round-trip metric applies there; this is detected automatically.
Caveats: kir/kaz/uig use packaged .udx rules, while tat/tur/uzb/
aze/sah use the generic wiki map (less refined feature output). Some
treebanks add noise — Tatar-NMCTT is a Tatar–Russian code-switching corpus, so a
portion of its tokens are Russian.
- POS: fully covered — every UPOS in the eight treebanks is produced.
- Features: ~1.4% of feature instances in the treebanks use a UD pair the
converter cannot produce (most frequent:
Aspect=Prog,Polite=Infm,Reflex=Yesfor generic-map langs,Mood=Des,Voice=Rcp,Case=Equ). These are absent from the Apertium "List of symbols" wiki and from the upstream.udxfiles; they are not invented here. - Generic-map languages (
tat/tur/uzb/aze/sah) have no upstream.udx, so their feature output is coarser thankir/kaz/uig. chunksand XML tags are intentionally out of scope (upstream).- Verb transitivity (
tv/iv) is not emitted as a feature, matching the upstream.udxdecision. - No Turkic language with both an Apertium analyser and a UD treebank is currently left out; languages missing one of the two (e.g. Chuvash, Bashkir, Crimean Tatar, Dungan) cannot be added.
Greatly appreciated, if you use this work.
@misc{apertium2ud2023alekseev,
title = {{alexeyev/apertium2ud: mapping tagsets}},
year = {2023},
url = {https://github.com/alexeyev/apertium2ud}
}