Based on official Apache Nutch release (current Nutch version is 2.3.1).
- alpine, latest (Dockerfile)
- Nutch 2.3.1
- OpenJDK 8
- Gora 0.6.1
- Gora MongoDB 0.6.1
Use docker-compose.yml
file to run MongoDB and Apache Nutch
docker-compose up -d
docker-compose logs -f nutch
- Create youw own
Dockerfile
FROM pure/nutch-mongo:alpine
ADD urls/ /urls/
ADD conf/ /nutch/conf/
- Create your own configuration files and seed list
- urls/seeds.txt - file containing URLs to crawl
- conf/gora.properties - set MongoDB credentials and database name
- conf/nutch-site.xml - tune your own crawler parametrs
- conf/regex-urlfilter.txt - set Regular Expressions for your URLS to crawl
- Build your own docker image:
docker build -t my-nutch .
- Run your own Nutch with desired count of iterations:
docker run \
-d
-e ITERATIONS=5 \
--name my-crawler \
my-nutch
- Check logs
docker logs -f my-nutch