Spidy

A simple web crawler in Java.

The GUI is made using Swing.

The user enters a URL, a depth, a list of target words and a destination folder to save the results.

The crawler does a BFS till the given depth and downloads a web page if the URL contains any of the target words. And saves it in the correspoding folder

Jaunt API is used for HTML parsing and downloading a web page completely along with all its resources (scripts, images etc).

So, for example if you want to crawl the website of Times Of India. Then enter "http://www.timesofindia.com" as the URL.

Enter any positive value (say 3) as the depth.

And then, enter the target words separated by semicolon. For example, if you want to download the web pages that contain "delhi" "mumbai" or "kolkata" in their URL, then enter "delhi;mumbai;kolkata".

Also, give the address of the destination folder where the results have to be saved. For example "/home/tushar/results/". Then the program will create 3 sub folders -

/home/tushar/results/delhi

/home/tushar/results/mumbai

/home/tushar/results/kolkata

All these folders will be used to save the results. So, all the crawled pages having the word "delhi" in their URL will get downloaded to /home/tushar/results/delhi.

If the URL of some page contains more than one of the target words, then it will be saved in all the folders.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.settings		.settings
bin		bin
libs		libs
src		src
.classpath		.classpath
.project		.project
README.md		README.md
information.html		information.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spidy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spidy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages