A simple web crawler built using Go (powered by go routines)
- Run
go get -vto download the dependencies - Run
go build -o webcrawlerto build the binary
./webcrawler -baseurl https://golang.org -max-depth 2
This will start the webcrawler and generate two files
url-tree.txtwhich shows the links between pages. The default file name can be changed by-tree-file-nameflag. (You can disable the tree generation by setting-show-treeflag tofalse).sitemap.xmlwhich contains the sitemap in xml format.
go test -v ./...- Command line flags: The flag handling can be improved by using https://github.com/spf13/cobra
- Configuration: It would be nice to have some configuration management (It could be done by https://github.com/spf13/viper)
- Performance: All the go routines write to a single shared state. The performance might improve if we use channels. (we will have to benchmark it to find the actual performance improvements)
Please note that the tree generated shows children for any given node only once For example, if there is a page with URLs links as
foo
|-bar
|-lorem
|-ipsum
|-lorem
|-ipsum
the generated tree would look like
foo
|-bar
|-lorem
|-ipsum
|-lorem
That is, the child nodes of any given node are shown only once.