Skip to content

alexeyev/awesome-kyrgyz-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 

Repository files navigation

Awesome Kyrgyz NLP Awesome

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed to for a long time (2~3 years).

Table of Contents

Datasets

Corpora

  • Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
  • kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
  • Kyrgyz in Leipzig Corpora Collection: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
  • TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)

Raw text

  • kloop corpus: 16,826 articles (sqlite3 DB file) + crawler code

Character recognition

LLM Evaluation Data

  • KyrgyzLLM-Bench: KyrgyzMMLU (392 tasks) and KyrgyzRC (80 tasks) are the original datasets; GSM8K, HellaSwag, BoolQ, WinoGrande, and TruthfulQA are the translated and manually corrected versions of respective English-language datasets

Morphology & Syntax

Named Entity Recognition

  • WikiANN has a Kyrgyz language part
  • KyrgyzNER: 1,499 news articles from the 24.KG news portal, 10,900 sentences, 39,075 entity mentions, 27 named entity classes

Text Classification

Word Similarity and Word Sense Disambiguation

Instructions

Machine-readable dictionaries

Machine Translation

  • Kyrgyz Language Seed Dataset OLDI, 6,193 Kyrgyz-English sentences: github, experiments | huggingface PR | WMT2025 paper
  • TurkLang-7, 426,190 Kyrgyz-Russian pairs (not publicly available): paper
  • X-WMT, 500 Kyrgyz-English translated pairs: github, data | paper
  • NLLBv1 (use with care, e.g. non-Kyrgyz sentences detected!), 21,360,637 Kyrgyz-English pairs: data
  • GoURMET project data, 14,498 Kyrgyz-English and 23,017 Kyrgyz-Russian sentences
  • FLORES+, professional translation, train:test 1012:997 sentences

Speech

  • Mozilla Common Voice — Kyrgyz: crowd-sourced read speech with transcripts, released under CC-0. Kyrgyz has been part of the project since 2019; recent releases are mirrored on Hugging Face under mozilla-foundation (e.g., common_voice_12_0) and as community mirrors such as fsicoli/common_voice_22_0.
  • CSLT ASR data: 128h speech, 163 speakers (100m/63f), audio transcriptions, lexicon at the word level. License must be requested by emailing shiying@cslt.org or lilt@cslt.org; original Kyrgyz texts are not shared (only the phonetic transcription).

Pretrained models

Speech

  • AkylAI TTS Mini (code) — Matcha-TTS-based Kyrgyz text-to-speech model, trained on ~13h of speech / 7,000 samples; training scripts included.
  • TurkicTTS — multilingual TTS by IS2AI covering ten Turkic languages including Kyrgyz; built on top of KazakhTTS2.

Methods/Software

  • spaCy basic support: tokenization, stopwords, like_num
  • stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing

Morphology

Hate Speech detection

Spelling and orthography

  • ӨҮҢизатор: a proof-of-concept letter-replacement Telegram bot demo, fixes incorrect usages of 'О', 'У', 'Н' ⇒ 'Ө', 'Ү', 'Ң'.

Other

Online Demos

Miscellaneous

Contributions to this list

@golden-ratio

About

Kyrgyz language processing software, models and datasets.

Topics

Resources

Stars

Watchers

Forks

Contributors