corpus-tools

Here are 100 public repositories matching this topic...

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping image-classification datasets news-crawler corpus-tools commoncrawl web-corpus news-scraping cc-news image-extraction

Updated Nov 10, 2025
Python

catlism / catlism.github.io

Star

Companion website for "Corpus Approaches to Language in Social Media" - source and build versions

text-mining scraping linguistics digital-humanities text-processing corpus-linguistics corpus-tools corpus-processing scraping-python

Updated Nov 10, 2025
HTML

CSCfi / Kielipankki-utilities

Star

Scripts for data conversion

vrt corpus-tools korp corpus-processing

Updated Nov 7, 2025
Python

timarkh / tsakorpus

Star

Yet another search platform for linguistic corpora.

flask elasticsearch corpus linguistics corpus-linguistics corpus-tools linguistic-corpora language-documentation parallel-corpora media-aligned-corpora

Updated Nov 5, 2025
Python

ilinguistics / common_crawl_corpus

Star

Scripts for building a geo-located web corpus using Common Crawl data

corpora corpus-linguistics web-crawling corpus-tools corpus-processing

Updated Nov 3, 2025
Python

BLKSerene / Wordless

Star

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

translation tokenizer corpus linguistics tagger literature dependency-parser corpus-linguistics lemmatizer corpus-tools corpus-processing corpus-search corpus-statistics stopword corpus-analysis

Updated Oct 30, 2025
Python

czcorpus / kontext

Star

An advanced, extensible web front-end for the Manatee-open corpus search engine

user-interface corpora corpus-linguistics corpus-tools

Updated Oct 21, 2025
TypeScript

czcorpus / depreldb

Star

A fast database for UD dependency relations between lemmas

database linguistics corpus-linguistics universal-dependencies corpus-tools data-retrieval corpus-processing collocation-extraction

Updated Oct 14, 2025
Go

CLARIAH / wp6-missieven

Star

General Missives in Text-Fabric

nlp history dutch corpus-linguistics corpus-data corpus-tools corpus-processing

Updated Oct 10, 2025
Jupyter Notebook

czcorpus / xmlanntools

Star

XML annotation tools

linguistics annotation-processing xml-document tei-xml corpus-tools

Updated Oct 8, 2025
Python

jhlopesalves / CorpusAid

Star

Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.

python natural-language-processing regex corpus-linguistics data-cleaning corpus-builder corpus-tools corpus-processing text-preprocessing data-cleaning-automation

Updated Oct 2, 2025
Python

Helsinki-NLP / OpusFilter

Star

OpusFilter - Parallel corpus processing toolkit

nlp natural-language-processing machine-translation parallel-corpus corpus-tools corpus-processing

Updated Oct 1, 2025
Python

adbar / trafilatura

Sponsor

Star

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Sep 12, 2025
Python

johnwdubois / rezonator

Star

Rezonator: Dynamics of human engagement

game-development dialogue text-analysis discourse-analysis corpus-linguistics gamemaker-language linguistic-annotation-framework corpus-tools conversational-ai conversation-analysis gamemaker-studio-2 dialogic-syntax

Updated Sep 4, 2025
Game Maker Language

Signbank / FinSL-signbank

Star

Web based database for sign language lexicons and corpuses. Fork of NGT-signbank (https://github.com/Signbank/Global-signbank).

sign-language linguistics corpus-tools signbank

Updated Nov 5, 2025
Python

gani114433 / OCR_workflow

Star

N8N OCR workflow

nlp pdf opencv workflow ocr markdown-parser celery pdf-files computational-linguistics corpus-tools robotic-process-automation uipath goobi-workflow pdf-to-markdown

Updated Aug 26, 2025

lennes / spect

Star

SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/

annotation analysis speech transcript corpus-linguistics transcription spoken-language praat corpus-tools speech-analysis conversational-speech speech-corpus spect

Updated Aug 20, 2025
Praat

OpenPecha / Toolkit

Star

🛠 Tools to create, edit and export texts and annotations

annotations corpus-tools layered-text

Updated Nov 5, 2025
Python

Analyzes binary executables and can generate a test corpus for defined instruction paths, each discovered function, or it can generate a test corpus to reach every basic block detected in non library/shared object parts of the bin's text section.

binary fuzzing binary-analysis corpus-tools

Updated Jun 17, 2025
Python

adbar / simplemma

Sponsor

Star

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

nlp tokenizer language-detection wordlist lemmatizer morphological-analysis lemmatiser tokenization lemmatization corpus-tools language-identification low-resource-nlp

Updated Jun 6, 2025
Python

Improve this page

Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus-tools

Here are 100 public repositories matching this topic...

flairNLP / fundus

catlism / catlism.github.io

CSCfi / Kielipankki-utilities

timarkh / tsakorpus

ilinguistics / common_crawl_corpus

BLKSerene / Wordless

czcorpus / kontext

czcorpus / depreldb

CLARIAH / wp6-missieven

czcorpus / xmlanntools

jhlopesalves / CorpusAid

Helsinki-NLP / OpusFilter

adbar / trafilatura

johnwdubois / rezonator

Signbank / FinSL-signbank

gani114433 / OCR_workflow

lennes / spect

OpenPecha / Toolkit

Xenios91 / BinCorp-Generator

adbar / simplemma

Improve this page

Add this topic to your repo