Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Sep 12, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
A very simple news crawler with a funny name
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Bitextor generates translation memories from multilingual websites
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
An advanced, extensible web front-end for the Manatee-open corpus search engine
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Yet another search platform for linguistic corpora.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Utilities for Processing the Switchboard Dialogue Act Corpus
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
A set of workflows for corpus building through OCR, post-correction and normalisation
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Reading the data from OPIEC - an Open Information Extraction corpus
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
Software for multi-level annotation of linguistic corpora
文本数据分析, Text-Analysis
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."