A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed to for a long time (2~3 years).
- Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
- kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
- Kyrgyz in Leipzig Corpora Collection: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
- TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
- kloop corpus: 16,826 articles (sqlite3 DB file) + crawler code
- Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80,213) have been transformed to 50x50 images, then to CSV format
- KyrgyzLLM-Bench: KyrgyzMMLU (392 tasks) and KyrgyzRC (80 tasks) are the original datasets; GSM8K, HellaSwag, BoolQ, WinoGrande, and TruthfulQA are the translated and manually corrected versions of respective English-language datasets
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
- KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
- Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
- Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff
- WikiANN has a Kyrgyz language part
- KyrgyzNER: 1,499 news articles from the 24.KG news portal, 10,900 sentences, 39,075 entity mentions, 27 named entity classes
- Kyrgyz Multi-Label News Classification: training and evaluation code as well as the dataset of 1000/500 news documents are available
- Kyrgyz Word Embedding Evaluation: [not published yet]; the 2 best models have been released
- kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words) paper
- Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate
- Country names table: Kyrgyz-Russian-English
- KyrSpell — Kyrgyz orthographic thesaurus.
⚠️ License caveat: the package's terms appear to prohibit unpacking/extracting its contents — review the license before doing so. - Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)
- Kyrgyz Language Seed Dataset OLDI, 6,193 Kyrgyz-English sentences: github, experiments | huggingface PR | WMT2025 paper
- TurkLang-7, 426,190 Kyrgyz-Russian pairs (not publicly available): paper
- X-WMT, 500 Kyrgyz-English translated pairs: github, data | paper
- NLLBv1 (use with care, e.g. non-Kyrgyz sentences detected!), 21,360,637 Kyrgyz-English pairs: data
- GoURMET project data, 14,498 Kyrgyz-English and 23,017 Kyrgyz-Russian sentences
- FLORES+, professional translation, train:test 1012:997 sentences
- Mozilla Common Voice — Kyrgyz: crowd-sourced read speech with transcripts, released under CC-0. Kyrgyz has been part of the project since 2019; recent releases are mirrored on Hugging Face under
mozilla-foundation(e.g.,common_voice_12_0) and as community mirrors such asfsicoli/common_voice_22_0. - CSLT ASR data: 128h speech, 163 speakers (100m/63f), audio transcriptions, lexicon at the word level. License must be requested by emailing
shiying@cslt.orgorlilt@cslt.org; original Kyrgyz texts are not shared (only the phonetic transcription).
- Polyglot morfessor — pretrained morfessor model, number 6
- fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
- compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
- fastText trained on Leipzig Corpora — best-performant 100/300-dimensional fastText vectors provided by the authors of the HJ-Ky-0.1 paper.
- fastText from Kuriyozov et al.'2020 — trained on SketchEngine's KyWaC
- BERT-based NER —
bert-base-multilingual-casedfine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later. - Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas
- kyrgyz-tokenizers-collection — pre-trained subword tokenizers for Kyrgyz (by @metinovadilet)
- KyrgyzBert — BERT (6 encoders, 8 heads, hidden dim 512) trained on Kyrgyz texts (data is not available) from scratch (by @metinovadilet)
- AkylAI TTS Mini (code) — Matcha-TTS-based Kyrgyz text-to-speech model, trained on ~13h of speech / 7,000 samples; training scripts included.
- TurkicTTS — multilingual TTS by IS2AI covering ten Turkic languages including Kyrgyz; built on top of KazakhTTS2.
- spaCy basic support: tokenization, stopwords,
like_num - stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
- Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh. A much, much easier way:
import apertium; apertium.installer.install_module("kir"). - [DEPRECATED] kymopl: Kyrgyz morphology in Prolog
- ӨҮҢизатор: a proof-of-concept letter-replacement Telegram bot demo, fixes incorrect usages of 'О', 'У', 'Н' ⇒ 'Ө', 'Ү', 'Ң'.
- Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
- Number-to-words conversion (JavaScript) by @AzamatSooldaev
- Number-to-words conversion (TypeScript) by @timursaurus
- Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz
- Kyrgyz NLP bibliography: kyrgyznlp.github.io
- Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
- A useful Apertium's list of tools and other resources
- Online dictionaries and other useful resources on el-sozduk.kg
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University