Skip to content
View wumpus's full-sized avatar
📡
📡

Organizations

@commoncrawl @blekko @cocrawler @olympiag3 @rendance @eventhorizontelescope

Block or report wumpus

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Makefile 1 2 Updated Apr 21, 2026

A jupyter notebook illistrating the basics of Common Crawl's datasets.

Jupyter Notebook 4 1 Updated Dec 31, 2025

CC signals is a framework for a simple pact between those stewarding data, and those reusing it for AI development. CC signals provide a set of shared ground rules for an AI ecosystem that is mutua…

123 25 Updated Dec 4, 2025

A pure Linux Bash Script for block IP Range using Autonomous System Number

Shell 16 2 Updated Mar 11, 2026

A collaborative catalog of NLP resources for Indic languages

636 96 Updated Dec 14, 2024
Python 17 4 Updated Nov 26, 2024

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

69 94 Updated Jun 13, 2026

Open source project for data preparation for GenAI applications

HTML 939 253 Updated Jun 4, 2026

Burrow is a globally distributed HTTP proxy via AWS Lambda

Go 241 9 Updated Dec 17, 2024

Code for collecting, processing, and preparing datasets for the Common Pile

Python 258 26 Updated Feb 11, 2026

Prototype scripts that are easy to edit variables for different outputs. Example searches one crawl for all .co.uk websites geolocated by postcode to the Bristol area.

Python 5 2 Updated Apr 16, 2025

Quantifying the Commons: measure the size and diversity of the commons--the collection of works that are openly licensed or in the public domain

Python 48 73 Updated Jun 9, 2026

A whirlwind tour of Common Crawl's data using Python

Python 45 9 Updated Apr 13, 2026

Java library for reading and writing WARC files with a typed API

Java 59 18 Updated Apr 27, 2026

A polite and user-friendly downloader for Common Crawl data

Rust 84 6 Updated Jun 11, 2026

A tool for detecting viruses and NSFW material in WARC files

Python 18 1 Updated Jun 9, 2026

How Media Cloud approaches extracting metadata from online news stories

Python 17 6 Updated Apr 15, 2026

A classifier for detecting soft 404 pages

Jupyter Notebook 17 4 Updated Sep 10, 2022

A modern and functional monospaced typeface with a focus on legibility.

Shell 160 3 Updated Apr 22, 2026

A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.

Rust 5,710 247 Updated Jun 12, 2026

Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts

Python 7,190 418 Updated Jun 9, 2026

Library for the Streaming Protocol for Exchange of Astronomical Data (SPEAD)

C++ 27 16 Updated Jun 8, 2026

High Availability Shared Pipeline Engine

C 17 19 Updated Sep 15, 2023

VLBI Experiment File Parser and Writer

C 2 3 Updated Feb 7, 2025

A dark and sleek Emacs setup for general purpose editing and programming

Emacs Lisp 967 34 Updated Sep 9, 2024

This Word Does Not Exist

Python 1,020 84 Updated Jun 4, 2026

Fake English word generator for JavaScript/TypeScript

TypeScript 97 6 Updated Mar 6, 2023

CXS: a high performance VLBI correlator written in Python, based on Apache Spark

Python 4 1 Updated Jun 30, 2024

Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.

Python 13,377 790 Updated Jun 13, 2026

App that explores various array choices using a cheap-imaging algorithm.

Python 3 1 Updated Oct 27, 2021
Next