Stars
7
stars
written in Java
Clear filter
Free and Open Source, Distributed, RESTful Search Engine
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
(T)he (N)ew (H)otness. Improved full-txt search of archival web data.
Builds Lucene/Solr indexes out of NutchWAX segments and revisit records via Hadoop.