Awesome Kyrgyz NLP

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed to for a long time (2~3 years).

Datasets

Corpora

Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
Kyrgyz in Leipzig Corpora Collection: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)

Raw text

kloop corpus: 16,826 articles (sqlite3 DB file) + crawler code

Character recognition

Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80,213) have been transformed to 50x50 images, then to CSV format

LLM Evaluation Data

KyrgyzLLM-Bench: KyrgyzMMLU (392 tasks) and KyrgyzRC (80 tasks) are the original datasets; GSM8K, HellaSwag, BoolQ, WinoGrande, and TruthfulQA are the translated and manually corrected versions of respective English-language datasets

Morphology & Syntax

UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff

Named Entity Recognition

WikiANN has a Kyrgyz language part
KyrgyzNER: 1,499 news articles from the 24.KG news portal, 10,900 sentences, 39,075 entity mentions, 27 named entity classes

Text Classification

Kyrgyz Multi-Label News Classification: training and evaluation code as well as the dataset of 1000/500 news documents are available

Word Similarity and Word Sense Disambiguation

Kyrgyz Word Embedding Evaluation: [not published yet]; the 2 best models have been released
kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words) paper

Instructions

Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate

Machine-readable dictionaries

Country names table: Kyrgyz-Russian-English
KyrSpell — Kyrgyz orthographic thesaurus. ⚠️ License caveat: the package's terms appear to prohibit unpacking/extracting its contents — review the license before doing so.
Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)

Machine Translation

Kyrgyz Language Seed Dataset OLDI, 6,193 Kyrgyz-English sentences: github, experiments | huggingface PR | WMT2025 paper
TurkLang-7, 426,190 Kyrgyz-Russian pairs (not publicly available): paper
X-WMT, 500 Kyrgyz-English translated pairs: github, data | paper
NLLBv1 (use with care, e.g. non-Kyrgyz sentences detected!), 21,360,637 Kyrgyz-English pairs: data
GoURMET project data, 14,498 Kyrgyz-English and 23,017 Kyrgyz-Russian sentences
FLORES+, professional translation, train:test 1012:997 sentences

Speech

Mozilla Common Voice — Kyrgyz: crowd-sourced read speech with transcripts, released under CC-0. Kyrgyz has been part of the project since 2019; recent releases are mirrored on Hugging Face under mozilla-foundation (e.g., common_voice_12_0) and as community mirrors such as fsicoli/common_voice_22_0.
CSLT ASR data: 128h speech, 163 speakers (100m/63f), audio transcriptions, lexicon at the word level. License must be requested by emailing shiying@cslt.org or lilt@cslt.org; original Kyrgyz texts are not shared (only the phonetic transcription).

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 6
fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
fastText trained on Leipzig Corpora — best-performant 100/300-dimensional fastText vectors provided by the authors of the HJ-Ky-0.1 paper.
fastText from Kuriyozov et al.'2020 — trained on SketchEngine's KyWaC
BERT-based NER — bert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas
kyrgyz-tokenizers-collection — pre-trained subword tokenizers for Kyrgyz (by @metinovadilet)
KyrgyzBert — BERT (6 encoders, 8 heads, hidden dim 512) trained on Kyrgyz texts (data is not available) from scratch (by @metinovadilet)

Speech

AkylAI TTS Mini (code) — Matcha-TTS-based Kyrgyz text-to-speech model, trained on ~13h of speech / 7,000 samples; training scripts included.
TurkicTTS — multilingual TTS by IS2AI covering ten Turkic languages including Kyrgyz; built on top of KazakhTTS2.

Methods/Software

spaCy basic support: tokenization, stopwords, like_num
stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing

Morphology

Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh. A much, much easier way: import apertium; apertium.installer.install_module("kir").
[DEPRECATED] kymopl: Kyrgyz morphology in Prolog

Hate Speech detection

Jupyter Notebook for hate speech detection

Spelling and orthography

ӨҮҢизатор: a proof-of-concept letter-replacement Telegram bot demo, fixes incorrect usages of 'О', 'У', 'Н' ⇒ 'Ө', 'Ү', 'Ң'.

Other

Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
Number-to-words conversion (JavaScript) by @AzamatSooldaev
Number-to-words conversion (TypeScript) by @timursaurus
Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz

Online Demos

Cyrillic-to-Latin online converter based on this resource.

Miscellaneous

Kyrgyz NLP bibliography: kyrgyznlp.github.io
Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
A useful Apertium's list of tools and other resources
Online dictionaries and other useful resources on el-sozduk.kg
Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University

Contributions to this list

@golden-ratio

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Kyrgyz NLP

Table of Contents

Datasets

Corpora

Raw text

Character recognition

LLM Evaluation Data

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity and Word Sense Disambiguation

Instructions

Machine-readable dictionaries

Machine Translation

Speech

Pretrained models

Speech

Methods/Software

Morphology

Hate Speech detection

Spelling and orthography

Other

Online Demos

Miscellaneous

Contributions to this list

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Kyrgyz NLP

Table of Contents

Datasets

Corpora

Raw text

Character recognition

LLM Evaluation Data

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity and Word Sense Disambiguation

Instructions

Machine-readable dictionaries

Machine Translation

Speech

Pretrained models

Speech

Methods/Software

Morphology

Hate Speech detection

Spelling and orthography

Other

Online Demos

Miscellaneous

Contributions to this list

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!