Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.

Requirements and installation

Linux (tested on Ubuntu 22.04)
Java 11 (higher Java versions should also work)
ant and maven
Compact Language Detector 2

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Configuration

Go to the project root folder nutch-cc and edit the files in the folder conf/ esp. conf/nutch-site.xml. But also the URL filter configuration files may require to be adapted to your use case.

Notes:

it's required to configure at least the property http.agent.name in the file conf/nutch-site.xml
if the configuration is changed Nutch needs to be recompiled because configuration files are contained in the job file (runtime/local/apache-nutch-*.job)

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
conf		conf
LICENSE		LICENSE
README.md		README.md
crawl.sh		crawl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

Requirements and installation

Compile Nutch and required projects

Configuration

Run crawl

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

commoncrawl/cc-nutch-example

Folders and files

Latest commit

History

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

Requirements and installation

Compile Nutch and required projects

Configuration

Run crawl

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages