Skip to content

commoncrawl/cc-nutch-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.

Requirements and installation

sudo apt install libcld2-0 libcld2-dev ant maven

Compile Nutch and required projects

git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..

Configuration

Go to the project root folder nutch-cc and edit the files in the folder conf/ esp. conf/nutch-site.xml. But also the URL filter configuration files may require to be adapted to your use case.

Notes:

  • it's required to configure at least the property http.agent.name in the file conf/nutch-site.xml
  • if the configuration is changed Nutch needs to be recompiled because configuration files are contained in the job file (runtime/local/apache-nutch-*.job)

Run crawl

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt

About

Apache Nutch example project to archive content in WARC files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages