Code to extract a small amount of ENGLISH Common Crawl data into json files with the corresponding URLS and IDs.
Developed by Sean McLeish (University of Maryland)
NOTE: This code is designed for extracting a small amount of data to play with, not downloading and processing entire crawls.
Go the the Common Crawl Blog and pick your favourite common crawl, the examples give are for November/December 2023.
Click on the associated links for the warc.paths.gz, wat.paths.gz and wet.paths.gz. This will download them for you, then you need to place them in the folder you plan on developing in.
Install warcio for reading the files: $ pip install warcio==1.7.4. I developed in Python 3.10.4
Run $ python download.py --path <YOUR PATH> --max_files <NUMBER OF FILES>
--warc: only process the warc files--wet: only process the wet files--delete_after: automatically deletes the temporary files after--path(REQUIRED): path to the directory of thewarc.paths.gzfiles--max_files(REQUIRED): maximum number of files to download--offset: sample files with this offset from the interval, e.g. ifoffset=1instead of sampling file 0, 100, 200, ... we sample 1, 101, 201, ...
You now have two files warc_json_data and wet_json_data containing a small sample of common crawl data to play with.
combined_json_data merges warc_json_data and wet_json_data, taking only common files from both to create a combined set of json files.
Please open pull requests and issues to add features or ask questions.
Alot of this work is based off of this blog post.