-
htmldate Public
Fast and robust date extraction from web pages, with Python or on the command-line
-
trafilatura Public
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
simplemma Public
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
-
courlan Public
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
-
rust-primes Public
Rust code and command-line utility to compute and visualize prime sequences
-
-
lichess-bot Public
Forked from lichess-bot-devs/lichess-botA bridge between Lichess bots and chess engines
Python GNU Affero General Public License v3.0 UpdatedJan 15, 2025 -
python-chess Public
Forked from niklasf/python-chessA chess library for Python, with move generation and validation, PGN parsing and writing, Polyglot opening book reading, Gaviota tablebase probing, Syzygy tablebase probing, and UCI/XBoard engine c…
Python GNU General Public License v3.0 UpdatedJan 12, 2025 -
haystack-integrations Public
Forked from deepset-ai/haystack-integrations🚀 A list of Haystack Integrations, maintained by the community or deepset.
UpdatedJan 7, 2025 -
-
py3langid Public
Forked from saffsd/langid.pyFaster, modernized fork of the language identification tool langid.py
-
German-NLP Public
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
-
awesome-digital-humanities Public
Forked from dh-tech/awesome-digital-humanitiesSoftware for humanities scholars using quantitative or computational methods.
-
awesome-web-scraping Public
Forked from lorien/awesome-web-scrapingList of libraries, tools and APIs for web scraping and data processing.
-
coronakorpus Public archive
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
-
btw21 Public
Forked from jfilter/btw21Visualization of the most frequent words in the German federal election in 2021
Jupyter Notebook MIT License UpdatedSep 24, 2021 -
awesome-crawler Public
Forked from BruceDone/awesome-crawlerA collection of awesome web crawler,spider in different languages
-
jparser Public
Forked from fxsjy/jparserA readability parser which can extract title, content, images from html pages
-
jlcl-style Public archive
Experiments to modernize the LaTeX class of the JLCL
-
geokelone Public
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
-
toponyms Public
Old prototype for toponym extraction in historical texts written in German
-
vardial-experiments Public
Experiments conducted on the occasion of the VarDial shared tasks
-
valency-oriented-chunker Public
A one-pass FSA valency-oriented chunker for German (proof of concept)
Perl GNU Lesser General Public License v3.0 UpdatedOct 14, 2016 -
-
-
flux-toolchain Public
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
-
equipe-crawler Public archive
Automatically exported from code.google.com/p/equipe-crawler
Perl UpdatedJul 3, 2015 -
gps-corpus-builder Public archive
Automatically exported from code.google.com/p/gps-corpus-builder
Perl UpdatedJul 3, 2015 -
zeitcrawler Public archive
Automatically exported from code.google.com/p/zeitcrawler
-
laclos Public
LAnguage-CLassified OpenSubtitles