OpusFilter - Parallel corpus processing toolkit
-
Updated
Nov 12, 2025 - Python
OpusFilter - Parallel corpus processing toolkit
Yet another search platform for linguistic corpora.
A very simple news crawler with a funny name
Scripts for building a geo-located web corpus using Common Crawl data
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
XML annotation tools
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Web based database for sign language lexicons and corpuses. Fork of NGT-signbank (https://github.com/Signbank/Global-signbank).
🛠 Tools to create, edit and export texts and annotations
Analyzes binary executables and can generate a test corpus for defined instruction paths, each discovered function, or it can generate a test corpus to reach every basic block detected in non library/shared object parts of the bin's text section.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
A parser for annotated MuseScore 3 files.
A corpus server library with a document database backend.
An open-source collaborative web-based application for multi-task lexical normalisation
Searching in-memory corpus with Corpus Query Language (CQL)
This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.
Bitextor generates translation memories from multilingual websites
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."