Assignment-Ⅰ: Creating a Custom Search Engine with Apache Nutch
-by Ashish Kumar Bostan (2112017)
Abstract
 he purpose of this project was to design and implement a functional search engine using Apache Nutch
T
for web crawling and Apache Tomcat for deployment. By leveraging these open-source technologies, we
 aimed to create a search system capable of efficiently crawling, indexing, and retrieving relevant results
  from multiple web sources. The project validated the feasibility of using Nutch and Tomcat to construct a
   customized search engine that delivers accurate search results based on indexed content.
 roject Overview
P
This project focused on building a search engine using Apache Nutch and Apache Tomcat. Our goal was
 to create a web-based system that could effectively index and retrieve web content, providing users with
  accurate and relevant search results. By utilizing open-source tools, we demonstrated the practicality of
   constructing a tailored search engine solution.
Environment and Tools
    
    ●        perating System:macOS Sonoma 14.1 (Unix-based)
            O
    ●     Apache Nutch:Version 0.9
     ●     Apache Tomcat:Version 9.0.82
      ●     Java SDK:Version 21.0.1
Setting Up the Software
    1. Installing Apache Tomcat
           To enable local deployment and testing, we downloaded and extracted
            "Apache-Tomcat-9.0.82.tar" from Apache’s official site, placing it in the "Downloads" folder.
             Tomcat facilitated local deployment of our web application. We also set up theJAVA_HOME
              environment variable to ensure compatibility. Tomcat was started using the command:
               “/Users/ranjan/Downloads/apache-tomcat-9.0.82/bin/startup.sh”
               
     2. Installing and Configuring Apache Nutch
               Apache Nutch was selected for its extensive web-crawling capabilities. After downloading
                nutch-0.9.tarand placing it in
                                                 /Users/ranjan/Downloads/nutch-0.9           , we created a
            urlsfolder in
                           nutch-0.9/bin                   seed.txtfile with URLs for crawling. The
                                         , which included a
                 http://www.nits.ac.inwas used for demonstration.
            URL
                                                nutch-0.9/confas follows:
            Configuration changes were made in
               ○ 
                   crawl-urlfilter.txt
                                      : Added the pattern
                   “+^http://([a-z0-9]*\.)*www.nits.ac.in/”
                   
               ○ 
                   regex-urlfilter.txt
                                      : Added the line
                   “+^http://([a-z0-9]*\.)*www.nits.ac.in”
                   
 tarting the Crawling Process
S
                                                             nutch-0.9/binin the Terminal and
Following configuration, we began crawling by navigating to
            ./nutch crawl urls -dir Crawled_Data -depth 3 -topN 10
executing:
     Crawled_Datadirectory was used to store crawled data, while
 he
T                                                                  depthand
                                                                              topNcontrolled the
crawl’s depth and page count.
 eploying the Search Engine on Tomcat
D
                                                                     nutch-0.9.warin Tomcat’s
Once the data was crawled, we deployed the search engine by placing
webappsdirectory (
                    /Users/ranjan/Downloads/apache-tomcat-9.0.82/webapps
                                                                         ). We then
              search.dirproperty in
 odified the
m                                     nutch-site.xmlto point to
                                                                  Crawled_Data
                                                                              , allowing the search
engine to read indexed data.
 fter starting Tomcat, we accessed the search interface at
A                                                            http://localhost:8080/nutch-0.9/       .A
sample search for "b.tech" returned nine relevant results from the indexed content, confirming the
 system’s functionality.
 hallenges and Solutions
C
                            Search.jspat line 151, which was resolved by adding an escape sequence
We encountered an error in
     header.html
for          . Restarting Tomcat after this adjustment corrected the issue.
 esults
R
The search engine was successfully deployed, displaying a homepage with a logo and search bar.
 Search queries, such as "practice," returned 42 relevant results, confirming the search engine’s
  effectiveness and operational success.