Thai News Dataset from Thai government website.
-
Updated
Feb 19, 2025 - Jupyter Notebook
Thai News Dataset from Thai government website.
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
A python command-line tool to align utterances from subtitle language pairs
粵文語料篩選器 Cantonese text filter
Generate synthetic text from a variety of methods, eg. Context Free Grammars (CFGs), with parameterized complexity to test your NLP methods (like LLMs)
ParlaMint: Comparable Parliamentary Corpora
Web-based streamlit application form created for a multilingual aligner project.
Corpora for Machine Translation—Tamang to Nepali
Corpus for linguistic study of natural gas pipeline debates.
地球上最全的华语现代诗歌语料库,3k+诗人,80K+诗歌,15M+字
This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.
Contains syntactic and semantic annotations of 64 German experiencer-object verbs as well as the data and scripts for publications related to it.
open source corpora created, annotated or maintained by the ACoLi group at University of Augsburg, Germany.
This is the database of Mishnaic disputes and constituant arguments from Kazhdan and Kay (published in 2022 at JSIJ).
DANTE lexical database of English
General Missives in Text-Fabric
Bengali Natural Language Processing(BengaliNLP)
A parser for annotated MuseScore 3 files.
Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.
To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."