corpus

Here are 41 public repositories matching this topic...

lil-lab / nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

machine-learning natural-language-processing computer-vision corpus

Updated Aug 18, 2022
HTML

lxs602 / Chinese-Mandarin-Dictionaries

Star

中文词典 / 中文詞典。Chinese / Chinese-English dictionaries.

unicode dictionaries dictionary corpus english chinese hanzi goldendict chinese-language zhongwen chinese-mandarin-dictionaries handian

Updated Dec 23, 2025
HTML

JiangYanting / Pre-modern_Chinese_corpus_dataset

Star

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

machine-learning natural-language-processing data-mining corpus dataset

Updated Mar 4, 2025
HTML

ELI-Data-Mining-Group / PELIC-dataset

Star

The University of Pittsburgh English Language Institute Corpus (PELIC) dataset

corpus esl lexical-analysis longitudinal-data concordancer tesol second-language-acquisition learner-corpus intensive-english-program english-for-academic-purposes second-language-writing

Updated Mar 6, 2026
HTML

alitekdemir / Risale-i-Nur-Diyanet

Star

Risale-i Nur Külliyatı’nın, Diyanet –asıl nüsha tashihli– metni dijital ortamda!

html markdown corpus islam

Updated Dec 16, 2024
HTML

MiMoText / roman18

Star

Collection de romans français du dix-huitième siècle (1751-1800) / Collection of Eighteenth-Century French Novels (1751-1800)

corpus enlightenment novels french literature trier 18th-century

Updated Apr 23, 2024
HTML

tylergneill / pramana-nlp

Star

data, metadata, tools, and LDA experiments on a corpus of Sanskrit philosophy texts

corpus topic-modeling segmentation lda identifiers

Updated Nov 28, 2021
HTML

CuiShaohua / News-Review-Pickup

Star

新闻人物言论自动提取---->得到说话的人和说话的内容

flask word2vec corpus sbv myproject npy pyc

Updated Jan 17, 2020
HTML

dstl / muc3

Star

Message Understanding Conference 3 Corpus

html corpus tipster

Updated Feb 17, 2021
HTML

sonu-gupta / tosdr-terms-of-service-corpus

Star

This repository contains python code to create a corpus of 12,215 terms of service documents scraped from TOSDR, intended for legal, privacy, and natural language processing research.

python corpus language-resources tosdr terms-of-service-agreements

Updated Mar 14, 2023
HTML

lungetech / cgc-corpus

Star

DARPA CGC Corpus

corpus cgc

Updated May 1, 2017
HTML

pln-fing-udelar / humor

Star

HUMOR dataset for humor research

nlp machine-learning humor corpus dataset crowdsourcing nlp-machine-learning

Updated Mar 29, 2023
HTML

AndyTheFactory / article-extraction-dataset

Star

Article title, authors, date and body extraction dataset.

text-mining news html-to-markdown scraping corpus news-aggregator text-extraction dataset web-scraping readability datasets scraping-websites html2text news-crawler corpus-builder corpus-tools article-extractor text-cleaning text-preprocessing

Updated Mar 26, 2024
HTML

KurdishBLARK / KurdishLyricsCorpus

Star

A Corpus of the Kurdish Folkloric Lyrics

lyrics corpus kurdish folkloristics kurdish-language-processing

Updated Apr 12, 2023
HTML

slack0 / sumspeech

Star

A Text / Speech Summarizer

vocabulary corpus speech matrix-factorization sentence topic-modeling summarization tf-idf topic-extraction topic-distribution

Updated Nov 1, 2025
HTML

Jean-Baptiste-Camps / Geste

Star

Un corpus de chansons de geste

corpus corpus-data xml-tei pos-tagging old-french lemmatization

Updated Sep 14, 2021
HTML

burgos2021 / programa

Star

Materiales para el curso de verano, «Del corpus a la interpretación: Estilometría con R», Burgos, 2021

r corpus stylometry

Updated Sep 11, 2021
HTML

motazsaad / Arabic-Stories-Corpus

Star

Arabic Stories Corpus

stories corpus story arabic arabic-nlp arabic-language

Updated Dec 16, 2021
HTML

Kimonokimo / NLP-comment-project

Star

Toxic Comment Classification Project constructed by Qimo Li, Chen He and Kun Qiu for the course "Introduction to Natural Language Processing in Python" at Brandeis University.

python nlp data-science machine-learning natural-language-processing sentiment-analysis random-forest scikit-learn jupyter-notebook corpus cross-validation text-analysis linguistics spacy nltk classification logistic-regression postagging scattertext

Updated Dec 20, 2019
HTML

mr-segfault / fuzz_corpus_garden

Star

a garden of file formats from a collection of sources for use as inputs for fuzzing engines.

input seed corpus fuzzer fuzz fuzz-corpus-garden fuzzing-engines

Updated Oct 4, 2019
HTML

Improve this page

Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus

Here are 41 public repositories matching this topic...

lil-lab / nlvr

lxs602 / Chinese-Mandarin-Dictionaries

JiangYanting / Pre-modern_Chinese_corpus_dataset

ELI-Data-Mining-Group / PELIC-dataset

alitekdemir / Risale-i-Nur-Diyanet

MiMoText / roman18

tylergneill / pramana-nlp

CuiShaohua / News-Review-Pickup

dstl / muc3

sonu-gupta / tosdr-terms-of-service-corpus

lungetech / cgc-corpus

pln-fing-udelar / humor

AndyTheFactory / article-extraction-dataset

KurdishBLARK / KurdishLyricsCorpus

slack0 / sumspeech

Jean-Baptiste-Camps / Geste

burgos2021 / programa

motazsaad / Arabic-Stories-Corpus

Kimonokimo / NLP-comment-project

mr-segfault / fuzz_corpus_garden

Improve this page

Add this topic to your repo