common-crawl

Here are 46 public repositories matching this topic...

SanjamRaj10 / C_Strings

C string exercises cover reversal, length, swap, concatenation, frequency, case analysis, and substring handling, providing practical string manipulation examples for learners 🐙

fast json base64 string ascii bytes dataset sorting-algorithms beautifulsoup pattern-recognition sds vigenere ndjson math-parser substring math-parser-library common-crawl b64

Updated Dec 18, 2025
C

MigoXLab / dingo

Star

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

Updated Dec 18, 2025
JavaScript

commoncrawl / cc-webgraph

Star

Tools to construct and process Common Crawl webgraphs

pagerank webgraph commoncrawl common-crawl centrality-measures webgraph-framework

Updated Dec 17, 2025
Java

commoncrawl / cc-crawl-statistics

Star

Statistics of Common Crawl monthly archives mined from URL index files

statistics commoncrawl common-crawl

Updated Dec 5, 2025
Python

commoncrawl / cc-notebooks

Star

Various Jupyter notebooks about Common Crawl data

jupyter-notebook aws-athena commoncrawl common-crawl webarchiving webgraph-framework

Updated Nov 22, 2025
Jupyter Notebook

commoncrawl / cc-pyspark

Star

Process Common Crawl data with Python and Spark

spark pyspark sparksql wet commoncrawl common-crawl warc-files wat-files

Updated Nov 13, 2025
Python

oscar-project / ungoliant

Star

🕷️ The pipeline for the OSCAR corpus

nlp crawler corpus-linguistics fasttext oscar commoncrawl common-crawl language-classification

Updated Nov 9, 2025
Rust

AI-NOSUKE / JPCC-RANDOM-PICKER

Star

JPCC-RANDOM-PICKER：とにかく早く結果がほしい人向け。JPCCから高速ランダムサンプリングでキーワード抽出。統計的に十分な精度で大幅高速化。普通の用途なら、まずはこれを試してみてください。

python text-mining japanese s3 corpus research-tool data-collection boto3 keyword-extraction common-crawl jpcc

Updated Sep 30, 2025
Python

Dr-Istanbul / Project-Daily-Life

Star

DailyLifeAI: Professional ML platform for life task automation using Common Crawl data, BERT models, and AWS infrastructure.

python nlp aws data-science machine-learning natural-language-processing ai data-engineering web-scraping bert common-crawl

Updated Sep 26, 2025
Python

AI-NOSUKE / JPCC-PICKER

Star

JPCC-PICKER：研究者・完璧主義な人向け。学術研究向けなど、JPCCからキーワード抽出を完璧にするための対応。表記ゆれ対応、100%の取りこぼし防止。絶対に見落としがあってはいけない研究や調査に。

python text-mining japanese s3 corpus research-tool data-collection boto3 keyword-extraction common-crawl jpcc

Updated Sep 23, 2025
Python

AI-NOSUKE / JPCC-RAPID-PICKER

Star

JPCC-RAPID-PICKER：時間がかかってもしっかり調べたい人向け。JPCCからキーワード抽出する標準版。バイト正規表現による最適化で全データスキャン。より多くのデータが必要な時や、取りこぼしが心配な時に。

python text-mining japanese s3 corpus research-tool data-collection boto3 keyword-extraction common-crawl jpcc

Updated Sep 23, 2025
Python

Jasper0077 / py-search-engine

Star

A lightweight, POC, vector-based search engine implementation with Porter stemming algorithm for improved text preprocessing and search accuracy.

python3 text-processing porter-stemmer-algorithm common-crawl vector-search

Updated Jul 28, 2025
Python

alumik / common-crawl-downloader

Star

Distributed download scripts for Common Crawl data

downloader common-crawl

Updated May 12, 2025
Python

mehrantsi / common-crawl-analyzer

Star

Tools to extract and analyze domains and URLs from Common Crawl data files.

stemmer large-dataset common-crawl term-analysis term-frequency-inverse-document

Updated May 6, 2025
Python

cisnlp / GlotCC

Star

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

crawler multlingual corpus-linguistics glot language-identification commoncrawl common-crawl glotcc multilingual-dataset glotlid

Updated Apr 6, 2025
Jupyter Notebook

oscar-project / oscar-website

Star

The website of the Oscar Project

nlp website machine-learning hugo language-model common-crawl

Updated Mar 27, 2025
TeX

thunderpoot / cc-getpage

Sponsor

Star

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

commoncrawl common-crawl common-crawl-with-python common-crawl-python common-crawl-data

Updated Mar 2, 2025
Python

commoncrawl / news-crawl

Star

News crawling with StormCrawler - stores content as WARC

crawler news web-crawler apache-storm warc commoncrawl common-crawl storm-crawler

Updated Feb 19, 2025
Java

ilyankou / cc-gpx

Star

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

gpx hiking common-crawl

Updated Dec 5, 2024
Jupyter Notebook

crissyfield / troll-a

Star

Drill into WARC web archives

security internet-archive command-line-tool warc security-tools common-crawl

Updated Oct 16, 2024
Go

Improve this page

Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common-crawl

Here are 46 public repositories matching this topic...

SanjamRaj10 / C_Strings

MigoXLab / dingo

commoncrawl / cc-webgraph

commoncrawl / cc-crawl-statistics

commoncrawl / cc-notebooks

commoncrawl / cc-pyspark

oscar-project / ungoliant

AI-NOSUKE / JPCC-RANDOM-PICKER

Dr-Istanbul / Project-Daily-Life

AI-NOSUKE / JPCC-PICKER

AI-NOSUKE / JPCC-RAPID-PICKER

Jasper0077 / py-search-engine

alumik / common-crawl-downloader

mehrantsi / common-crawl-analyzer

cisnlp / GlotCC

oscar-project / oscar-website

thunderpoot / cc-getpage

commoncrawl / news-crawl

ilyankou / cc-gpx

crissyfield / troll-a

Improve this page

Add this topic to your repo