Information Retrieval
Assignment 1
             Session: 2020 – 2024
             Submitted by:
     Saqlain Nawaz           2020-CS-135
             Supervised by:
        Sir Khaldoon Syed Khurshid
     Department of Computer Science
University of Engineering and Technology
             Lahore Pakistan
Introduction
Welcome to the Inverted Indexing and Text Search Manual. This manual provides
comprehensive guidance on utilizing a Python tool designed to create an inverted index from
a collection of text documents and conduct text searches within them. Whether you're an
experienced programmer or have limited coding skills, this manual will help you make the
most of this powerful tool.
Purpose of the Program:
The Inverted Indexing and Text Search Tool is a versatile utility designed to assist you in
various text-related tasks. It allows you to:
   ●   Create an inverted index: Transform a collection of text documents into a structured
       index that facilitates efficient text retrieval.
   ●   Search for specific terms: Locate documents that contain particular words or
       phrases.
   ●   Count word occurrences: Quantify how frequently specific words appear within each
       document.
By the end of this manual, you'll be proficient in using this tool to streamline your text
analysis tasks and extract valuable insights from your documents.
Installation and Setup:
Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.
NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
                                    pip install nltk
Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
                                     python text_search.py
Explanation and Guide
Imports (Libraries)
import os
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import os
  ●   Purpose: The os module provides a way to work with the operating system, allowing
      you to perform various file and directory operations.
  ●   Use in the Program: In the code, os is used to manipulate file paths and interact
      with the filesystem. It's used to list files in a directory, join file paths, and determine
      the script's directory path.
import nltk
  ●   Purpose: The nltk (Natural Language Toolkit) library is a comprehensive library for
      natural language processing tasks.
  ●   Use in the Program: nltk is used extensively for text processing in this code. It
      provides tools for tokenization, part-of-speech tagging, and stemming, which are
      crucial for creating an inverted index and performing text searches.
import string
  ●   Purpose: The string module provides a collection of common string operations,
      including a list of punctuation characters.
  ●   Use in the Program: In the code, string.punctuation is used to filter out
      punctuation characters from the text. This is important when tokenizing sentences
      into words.
from nltk.corpus import stopwords
  ●   Purpose: The NLTK corpus module includes predefined lists of stopwords for various
      languages, including English.
  ●   Use in the Program: The stopwords module is used to access a set of common
      English stopwords. Stopwords are words that are commonly used in text but often do
      not carry significant meaning (e.g., "the," "and"). Filtering out stopwords is a common
      preprocessing step in text analysis.
from nltk.stem import PorterStemmer
  ●   Purpose: The PorterStemmer is a stemming algorithm that reduces words to their
      base or root form. Stemming helps in reducing words to their essential meaning.
  ●   Use in the Program: In the code, the PorterStemmer is used to stem words in text
      documents before they are indexed. This simplifies the process of matching different
      forms of a word during text searches.
from nltk.tokenize import word_tokenize, sent_tokenize
  ●   Purpose: The nltk.tokenize module provides functions for breaking text into
      words or sentences.
  ●   Use in the Program: In the code, word_tokenize and sent_tokenize functions
      are used to tokenize text into words and sentences, respectively. This tokenization is
      essential for processing text at the word and sentence level.
Variables
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}                  # Add more
characters if needed
# Initialize a Porter stemmer
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
  ●   Explanation: The variable stop_words is assigned a set of English stopwords
      using NLTK's stopwords.words('english'). These stopwords will be used to
      filter out common words from the text documents being processed. This filtering
      helps reduce the size of the inverted index and focuses on the content-carrying
      words.
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #
  ●   Explanation: This variable unwanted_chars is a set containing characters that are
      considered unwanted and should be removed from the text before processing. The
      characters include various forms of quotes, dashes, and ellipses. If additional
      unwanted characters are identified, they can be added to this set.
stemmer = PorterStemmer()
   ●    Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
        stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
        this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
        "runner") are treated as the same word during indexing and searching. This is
        particularly important for improving the accuracy of the inverted index and search
        results.
Functions
def create_index(dir_path)
def create_index(dir_path):
       # Initialize an empty dictionary for the inverted index
       inverted_index = {}
   1. def create_index(dir_path): This line defines a Python function called
      create_index. It takes one argument, dir_path, which is the path to the directory
      containing the text documents that you want to index. This function will be
      responsible for building the inverted index and word counts for each document.
   2. # Initialize an empty dictionary for the inverted index: This is a
      comment that explains the purpose of the next line of code. It's initializing an empty
      dictionary named inverted_index, which will be used to store the inverted index.
   3. inverted_index = {}: This line creates an empty Python dictionary called
      inverted_index. Inverted indexing is a technique used for text retrieval, where
      words are associated with the documents they appear in. This dictionary will store
      those associations.
def create_index(dir_path):
    # Initialize an empty dictionary for the inverted index
    inverted_index = {}
    # Initialize a dictionary to store word counts per document
    word_counts_per_document = {}
   1. # Initialize a dictionary to store word counts per document:
      This comment explains that the following line initializes a dictionary to store word
      counts for each document in the directory.
   2. word_counts_per_document = {}: This line creates an empty dictionary called
        word_counts_per_document. This dictionary will be used to keep track of the
        frequency of each word within each document, essentially counting how many times
        each word appears in each text file. It is crucial for later search and retrieval
        operations.
 # For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
                        for word, pos in tagged_words:
                              if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
                                  stemmed_word = stemmer.stem(word)
                                  if stemmed_word not in inverted_index:
                                      inverted_index[stemmed_word] = []
                                  inverted_index[stemmed_word].append(filename)
                                  # Update word counts for this document
                                  if stemmed_word not in word_counts:
                                      word_counts[stemmed_word] = 1
                                  else:
                                      word_counts[stemmed_word] += 1
                    # Store word counts for this document
                    word_counts_per_document[filename] = word_counts
            except UnicodeDecodeError:
                print(f"Skipping file {filename} due to UnicodeDecodeError")
   return inverted_index, word_counts_per_document
   1. for filename in os.listdir(dir_path): This line sets up a loop that
      iterates over each file in the directory specified by dir_path. The os.listdir()
      function returns a list of all files and directories in the given directory, and this loop
      iterates through the file names.
   2. if filename.endswith('.txt'): This line checks if the current filename
      ends with the ".txt" extension, which typically indicates a text file.
   3. try: This line begins a try-except block to handle potential errors during file
      processing.
   4. with open(os.path.join(dir_path, filename), 'r',
       encoding='utf8') as file: Within the try block, this line opens the current text
       file for reading. It uses os.path.join() to create the full path to the file by
      combining dir_path with the filename. The file is opened in text mode ('r') and
      with the 'utf8' encoding to handle text files encoded in UTF-8.
   5. sentences = sent_tokenize(file.read().lower()): This line reads the
       content of the file using file.read(), converts the content to lowercase using
      .lower(), and then uses sent_tokenize (from NLTK) to split the content into a
      list of sentences. This step prepares the text for further processing.
   6. word_counts = {}: This line creates an empty dictionary called word_counts to
      store word frequencies for the current document. This dictionary will be populated in
      the following steps.
   7. for sentence in sentences: This line sets up a loop to iterate over each
       sentence in the sentences list.
   8. sentence_without_punctuation = "".join([char for char in
       sentence if char not in string.punctuation and char not in
       unwanted_chars]): This line removes punctuation and unwanted characters from
       the current sentence. It creates a new string called
   sentence_without_punctuation by joining characters that are not in
   string.punctuation or unwanted_chars.
9. words = word_tokenize(sentence_without_punctuation): This line
    tokenizes the sentence_without_punctuation into a list of words using the
    word_tokenize function from NLTK.
10. tagged_words = nltk.pos_tag(words): This line uses nltk.pos_tag to tag
   each word in words with its part of speech. The result is stored in the
   tagged_words list of word-tag pairs.
11. # For each word, if it's a noun or verb, stem it and add an
    entry in the inverted index pointing to this filename: This
    comment explains that the code will process each word in the current sentence,
    checking if it's a noun or verb, and then stemming it before associating it with the
    current filename in the inverted index.
12. for word, pos in tagged_words: This line sets up a loop to iterate over each
   word and its corresponding part of speech in the tagged_words list.
13. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
    'VBN', 'VBP'] and word not in stop_words:This line checks two
    conditions for each word:
        ● Whether the word's part of speech (pos) is in the specified list of noun and
            verb POS tags. If it is, it's considered for further processing.
        ● Whether the word is not in the set of stop_words, which are common words
            that are often filtered out in text analysis.
14. stemmed_word = stemmer.stem(word): If a word passes the previous
    conditions, it is stemmed using the Porter Stemmer. The stemmed word is stored in
    the variable stemmed_word.
15. if stemmed_word not in inverted_index: This line checks if the
   stemmed_word is not already in the inverted_index.
16. inverted_index[stemmed_word] = []: If the word is not in the inverted index,
    it initializes an empty list as the value for that word in the inverted index.
17. inverted_index[stemmed_word].append(filename): Regardless of whether
   the word was already in the inverted index or not, it appends the filename of the
    current document to the list associated with the stemmed_word. This associates the
    word with the document where it appears in the inverted index.
18. if stemmed_word not in word_counts: This line checks if the
   stemmed_word is not in the word_counts dictionary.
19. word_counts[stemmed_word] = 1: If the word is not in word_counts, it
    initializes it with a count of 1, indicating that this word has been found once in the
    current document.
20. else: If the word is already in word_counts, this block of code is executed.
21. word_counts[stemmed_word] += 1: It increments the count for the word in
   word_counts to indicate that the word has been found again in the current
   document.
   22. # Store word counts for this document: This comment explains that the
       code is about to store the word counts for the current document.
   23. word_counts_per_document[filename] = word_counts: This line stores
       the word_counts dictionary (word counts for the current document) in the
       word_counts_per_document dictionary with the filename as the key. This
       associates the word counts with the document.
   24. except UnicodeDecodeError: This is an exception handler that catches
       UnicodeDecodeError exceptions. This exception occurs when a file cannot be
       decoded using the specified encoding, which can happen when processing text files
       with non-standard encodings.
   25. print(f"Skipping file {filename} due to UnicodeDecodeError"): If
       a UnicodeDecodeError is raised, this line prints a message indicating that the file
       is being skipped due to this encoding-related error.
   26. return inverted_index, word_counts_per_document: This line returns two
       values as a tuple:
           ● inverted_index: This is a dictionary containing the inverted index, where
               each stemmed word is associated with a list of filenames where it appears.
           ● word_counts_per_document: This is a dictionary containing word counts
               for each document, showing how many times each word appears in each
               document.
def main()
Code
# Now you can create the index, search it, and count word occurrences within documents
def main():
   dir_path = os.path.dirname(os.path.abspath(__file__))
   inverted_index, word_counts_per_document = create_index(dir_path)
   print(inverted_index)
   def search(query):
       # Tokenize and stem the query
       query_words = word_tokenize(query.lower())
       stemmed_query_words = [stemmer.stem(word) for word in query_words]
       # Retrieve the filenames for each query word
       matching_filenames_for_each_word = {word: inverted_index.get(word, []) for word in
stemmed_query_words}
       return matching_filenames_for_each_word
   # User-friendly search prompt
   while True:
       user_query = input("Enter a search query (or 'exit' to quit): ")
       if user_query == 'exit':
            break
       results = search(user_query)
       # Collect and count unique filenames
       unique_filenames = set()
       for filenames in results.values():
            unique_filenames.update(filenames)
       for filename in unique_filenames:
            word_count = sum(word_counts_per_document[filename].get(word, 0) for word in
results.keys())
            print(f"The word(s) appear in '{filename}' {word_count} time(s):")
       if not unique_filenames:
            print("No matching documents found for the query.")
if __name__ == "__main__":
   main()
Explanation
def main():
 dir_path = os.path.dirname(os.path.abspath(__file__))
   1. def main():This line defines the main function, which is the entry point of your
      program. It doesn't take any arguments.
  2. dir_path = os.path.dirname(os.path.abspath(__file__)): This line
     sets dir_path to the directory path of the script file itself. It uses os.path to obtain
     the absolute path of the current script (__file__) and then extracts the directory
     path from it. This is used to determine the directory where the text documents are
     located.
 inverted_index, word_counts_per_document = create_index(dir_path)
    print(inverted_index)
  3. inverted_index, word_counts_per_document =
     create_index(dir_path): Here, the code calls the create_index function to
     build the inverted index and word counts for the documents in the directory specified
     by dir_path. It stores the results in inverted_index and
     word_counts_per_document.
  4. print(inverted_index): This line prints the inverted_index to the console. It
     provides a visual representation of the inverted index, showing how words are
     associated with the documents they appear in.
 def search(query):
  5. def search(query):This line defines a new function called search, which takes
     a single argument, query. This function is responsible for searching the inverted
     index based on user queries.
 query_words = word_tokenize(query.lower())
        stemmed_query_words = [stemmer.stem(word) for word in
query_words]
  6. The lines within the search function tokenize and stem the user's query:
         ○   query_words = word_tokenize(query.lower()): The query is
             tokenized into individual words using word_tokenize, and all the words are
             converted to lowercase to ensure consistent matching.
         ○   stemmed_query_words = [stemmer.stem(word) for word in
             query_words]: Each tokenized word in the query is stemmed using the
             stemmer.stem function. This ensures that the query words are in the same
             form as the words in the inverted index.
        matching_filenames_for_each_word = {word:
inverted_index.get(word, []) for word in stemmed_query_words}
        return matching_filenames_for_each_word
  7. The code retrieves filenames associated with each query word from the inverted
     index. It creates a dictionary, matching_filenames_for_each_word, where
     each query word is the key, and the associated list of filenames is the value. This
     information will be used for search results.
  8. return matching_filenames_for_each_word: The function returns
     matching_filenames_for_each_word, which contains the search results
     indicating which documents contain the query words.
    while True:
       user_query = input("Enter a search query (or 'exit' to quit): ")
  9. This code initiates a user-friendly search interface where users can input search
     queries. It uses a while loop to repeatedly prompt the user for input.
  10. user_query = input("Enter a search query (or 'exit' to quit):
     "): This line reads the user's search query from the console. Users can type a
     search query or type 'exit' to quit the search interface.
         if user_query == 'exit':
              break
  11. This conditional statement checks if the user entered 'exit' as the query. If they did,
      the while loop is exited, ending the search interface.
         results = search(user_query)
  12. results = search(user_query): The user's search query is passed to the
     search function, and the results are stored in the results variable.
    unique_filenames = set()
         for filenames in results.values():
              unique_filenames.update(filenames)
  13. This part of the code processes the search results:
  ● unique_filenames is initialized as an empty set to collect unique filenames that
      match the search query.
  ● The for loop iterates over the filenames associated with each query word from the
     results dictionary and updates the unique_filenames set with those filenames.
         for filename in unique_filenames:
              word_count =
sum(word_counts_per_document[filename].get(word, 0) for word in
results.keys())
              print(f"The word(s) appear in '{filename}' {word_count}
time(s):")
   if not unique_filenames:
              print("No matching documents found for the query.")
   if not unique_filenames:
                print("No matching documents found for the query.")
 The code further processes the unique filenames:
   ●   It iterates over the unique filenames.
   ●   For each filename, it calculates the total word count for the query words found in that
       document. This is done by iterating over the query words and checking how many
       times each of them appears in the document.
   ● It then prints the filename along with the word count for the query words found in that
       document.
   15. If there are no unique filenames (i.e., no matching documents for the query), it prints
       a message indicating that no matching documents were found for the query.
if __name__ == "__main__":
       main()
   16. Finally, this code checks if the script is being run as the main program (not imported
       as a module). If it is, it calls the main function to start the search interface.
Data Flow Diagram
Block Diagram