A very simple news crawler with a funny name
-
Updated
Nov 10, 2025 - Python
A very simple news crawler with a funny name
Companion website for "Corpus Approaches to Language in Social Media" - source and build versions
Yet another search platform for linguistic corpora.
Scripts for building a geo-located web corpus using Common Crawl data
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
An advanced, extensible web front-end for the Manatee-open corpus search engine
A fast database for UD dependency relations between lemmas
General Missives in Text-Fabric
XML annotation tools
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
OpusFilter - Parallel corpus processing toolkit
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Rezonator: Dynamics of human engagement
Web based database for sign language lexicons and corpuses. Fork of NGT-signbank (https://github.com/Signbank/Global-signbank).
N8N OCR workflow
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
🛠 Tools to create, edit and export texts and annotations
Analyzes binary executables and can generate a test corpus for defined instruction paths, each discovered function, or it can generate a test corpus to reach every basic block detected in non library/shared object parts of the bin's text section.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."