A short description how to set up Common Crawl's Fork of Apache Nutch for crawling and to store the crawled content in WARC files.
- Linux (tested on Ubuntu 22.04)
- Java 11 (higher Java versions should also work)
- ant and maven
- Compact Language Detector 2
sudo apt install libcld2-0 libcld2-dev ant mavengit clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..
git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..
git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..Go to the project root folder nutch-cc and edit the files in the
folder conf/ esp. conf/nutch-site.xml. But also the URL filter
configuration files may require to be adapted to your use case.
Notes:
- it's required to configure at least the property
http.agent.namein the fileconf/nutch-site.xml - if the configuration is changed Nutch needs to be recompiled because
configuration files are contained in the job file (
runtime/local/apache-nutch-*.job)
echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt
./crawl.sh crawl 3 urls.txt