Dockerfile for fscrawler
Mostly inspired by elasticsearch's alpine dockerfile
Supported tags
2.2with fscrawler version 2.2 and alpine 3.52.4with fscrawler 2.4 and alpine 3.52.5-SNAPSHOT-ubuntuwith fscrawler2.5-SNAPSHOTand ubuntu 16.04 (built from dockerfile inubuntufolder)
Dockerfile includes tesseract (via alpine 3.5)
PS: The Ubuntu image was added because the alpine image was giving an error upon mvn clean install
It said that initial heap size larger than max heap size and I couldn't figure it out.
The alpine image was 308 MB, whereas the ubuntu image is 1.2 GB (but also includes tesseract-fra).
Probably a good idea to get the alpine image to work.
The image is published on docker hub here.
To run it against an elasticsearch instance served locally at port 9200,
docker run -it --rm --name my-fscrawler \
-v <data folder>:/usr/share/fscrawler/data/:ro \
-v <config folder>:/usr/share/fscrawler/config-mount/<project-name>:ro \
shadiakiki1986/fscrawler \
[CLI options]where
- data folder is the path to the folder with the files to index
- config folder is the path to the host fscrawler config dir
- if the config folder is not mounted from the host, the docker container will have an empty
configfolder, thus prompting the user for confirmationY/Nof creating the first project file - CLI options are documented here
Using docker-compose, startup elasticsearch and run fscrawler on files in test/data every 15 minutes:
docker-compose up elasticsearch1 fscrawlerFor the remaining examples, the default config depends on having a running elasticsearch instance on the localhost at port 9200. Start one with:
# [Ref](https://github.com/docker-library/elasticsearch/issues/111)
sudo sysctl -w vm.max_map_count=262144
docker-compose run -p 9200:9200 -d elasticsearch1For the versions of the docker-compose file, docker-compose, and docker, check the travis builds
Notice that the docker-compose fscrawler service is wired to wait for a healthcheck in elasticsearch.
In the case of a manual launch of elasticsearch:
- wait for around 15 seconds,
- or watch the logs,
- or check
http://$host:9200/_cat/health?h=statuswhere you need to wait foryelloworgreen, depending on your application
To index the test files provided in this repo
docker run -it --rm \
--net="host" \
--name my-fscrawler \
-v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
shadiakiki1986/fscrawlerSame example above, but with loop=1 to run it only once
docker run -it --rm \
--net="host" \
--name my-fscrawler \
-v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
-v $PWD/config/myjob:/usr/share/fscrawler/config-mount/myjob:ro \
shadiakiki1986/fscrawler \
--config_dir /usr/share/fscrawler/config \
--loop 1 \
--trace \
myjobTo build the docker image
git clone https://github.com/shadiakiki1986/docker-fscrawler
docker build -t shadiakiki1986/fscrawler:local .
To test against elasticsearch locally, follow steps in .travis.yml
To update fscrawler in this docker container:
- update the version number used in
Dockerfile- also update the URL to the maven zip file to download
- try to build, e.g.
docker build -t shadiakiki1986/fscrawler:2.4 . - try to run
- commit, tag, push
To update the automated build on hub.docker.com
- the "latest" tag will get re-built automatically with the
pushabove - to add a new version tag, need to
build settingsand add it manually, then clicksaveandtrigger
To update elasticsearch in the docker-compose for the purpose of testing (e.g. .travis.yml)
- edit
build/elasticsearch/Dockerfileby changingFROMimage - follow steps in
.travis.yml
Version 2.4 (2017-12-27)
- update fscrawler from 2.2 to 2.4
- use
config-mountfor mounting config folder into fscrawler docker container - update elasticsearch service from 5.1.2 to 6.1.1
- elasticsearch 5.1.2 was not working with fscrawler 2.4 anyway because of dadoonet/fscrawler#472
- replace git submodule of my fork of elasticsearch-docker with just
build/elasticsearch/Dockerfile- the purpose of the fork was to push healthchecks into upstream, but my PR was rejected
- fork was at https://github.com/shadiakiki1986/elasticsearch-docker
- PR was at elastic/elasticsearch-docker#27
- argumentation at elastic/elasticsearch-docker#60
- proposed solution of just using docker-compose healthcheck would be too long in order to wait for "green" status
Version 2.2 (2017-02-22)
- use fscrawler 2.2
LAST PUSH: 04.07.2019@15:33