Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Sep 12, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
A very simple news crawler with a funny name
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Bitextor generates translation memories from multilingual websites
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
Yet another search platform for linguistic corpora.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Utilities for Processing the Switchboard Dialogue Act Corpus
A set of workflows for corpus building through OCR, post-correction and normalisation
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
文本数据分析, Text-Analysis
A parser for annotated MuseScore 3 files.
🛠 Tools to create, edit and export texts and annotations
Web based database for sign language lexicons and corpuses. Fork of NGT-signbank (https://github.com/Signbank/Global-signbank).
Utilities for Processing the HCRC Map Task Corpus
Searching in-memory corpus with Corpus Query Language (CQL)
Library for Python to use Korp API
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."