-
Common Crawl Foundation
- Palo Alto, CA
- https://www.linkedin.com/in/greglindahl
- https://orcid.org/0000-0002-6100-4772
Stars
A jupyter notebook illistrating the basics of Common Crawl's datasets.
CC signals is a framework for a simple pact between those stewarding data, and those reusing it for AI development. CC signals provide a set of shared ground rules for an AI ecosystem that is mutua…
A pure Linux Bash Script for block IP Range using Autonomous System Number
A collaborative catalog of NLP resources for Indic languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
Open source project for data preparation for GenAI applications
Burrow is a globally distributed HTTP proxy via AWS Lambda
Code for collecting, processing, and preparing datasets for the Common Pile
Prototype scripts that are easy to edit variables for different outputs. Example searches one crawl for all .co.uk websites geolocated by postcode to the Bristol area.
Quantifying the Commons: measure the size and diversity of the commons--the collection of works that are openly licensed or in the public domain
A whirlwind tour of Common Crawl's data using Python
Java library for reading and writing WARC files with a typed API
A polite and user-friendly downloader for Common Crawl data
A tool for detecting viruses and NSFW material in WARC files
How Media Cloud approaches extracting metadata from online news stories
dogancanbakir / soft-404
Forked from TeamHG-Memex/soft404A classifier for detecting soft 404 pages
A modern and functional monospaced typeface with a focus on legibility.
A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Library for the Streaming Protocol for Exchange of Astronomical Data (SPEAD)
A dark and sleek Emacs setup for general purpose editing and programming
Fake English word generator for JavaScript/TypeScript
ajvazquez / CXS
Forked from MITHaystack/CorrelXCXS: a high performance VLBI correlator written in Python, based on Apache Spark
Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.
App that explores various array choices using a cheap-imaging algorithm.