How To Download a Small English Slice From the Common Crawl

Code to extract a small amount of ENGLISH Common Crawl data into json files with the corresponding URLS and IDs.

Developed by Sean McLeish (University of Maryland)

NOTE: This code is designed for extracting a small amount of data to play with, not downloading and processing entire crawls.

Step 1:

Go the the Common Crawl Blog and pick your favourite common crawl, the examples give are for November/December 2023.

Click on the associated links for the warc.paths.gz, wat.paths.gz and wet.paths.gz. This will download them for you, then you need to place them in the folder you plan on developing in.

Step 2:

Install warcio for reading the files: $ pip install warcio==1.7.4. I developed in Python 3.10.4

Step 3:

Run $ python download.py --path <YOUR PATH> --max_files <NUMBER OF FILES>

Options:

--warc: only process the warc files
--wet: only process the wet files
--delete_after: automatically deletes the temporary files after
--path (REQUIRED): path to the directory of the warc.paths.gz files
--max_files (REQUIRED): maximum number of files to download
--offset: sample files with this offset from the interval, e.g. if offset=1 instead of sampling file 0, 100, 200, ... we sample 1, 101, 201, ...

Step 4:

You now have two files warc_json_data and wet_json_data containing a small sample of common crawl data to play with.

Step 5:

combined_json_data merges warc_json_data and wet_json_data, taking only common files from both to create a combined set of json files.

Contributing

Please open pull requests and issues to add features or ask questions.

References

Alot of this work is based off of this blog post.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE.md		LICENSE.md
README.md		README.md
download.py		download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How To Download a Small English Slice From the Common Crawl

Step 1:

Step 2:

Step 3:

Options:

Step 4:

Step 5:

Contributing

References

About

Uh oh!

Releases

Packages

Languages

License

mcleish7/common_crawl_made_easy

Folders and files

Latest commit

History

Repository files navigation

How To Download a Small English Slice From the Common Crawl

Step 1:

Step 2:

Step 3:

Options:

Step 4:

Step 5:

Contributing

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages