Information Retrieval
Assignment 2
Session: 2020 – 2024
Submitted by:
Saqlain Nawaz 2020-CS-135
Supervised by:
Sir Khaldoon Syed Khurshid
Department of Computer Science
University of Engineering and Technology
Lahore Pakistan
Libraries Used:
❖ os:
➢ Purpose: Provides functions for interacting with the operating system,
specifically used for file operations and directory traversal.
❖ string:
➢ Purpose: The string module provides a collection of string constants
for various character sets and types of characters, including ASCII
letters, digits, punctuation characters, and whitespace characters. It is
often used for text manipulation and character-related operations.
❖ math:
➢ Purpose: The math library provides mathematical functions and
constants used in various calculations in the code.
❖ nltk (Natural Language Toolkit):
➢ Purpose: NLTK is used for natural language processing tasks,
including tokenization, stemming, and part-of-speech tagging.
❖ collections.defaultdict:
➢ Purpose: The defaultdict class from the collections module is
used to create dictionaries with default values for keys. In this code, it
is used to create dictionaries for the inverted index and document
counts.
❖ nltk.corpus.stopwords:
➢ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords. Stopwords are words that are often excluded from text
analysis due to their high frequency and low informativeness.
❖ nltk.stem.PorterStemmer:
➢ Purpose: The PorterStemmer class from NLTK implements the
Porter stemming algorithm. It reduces words to their root or base form,
helping standardize words.
❖ nltk.tokenize.word_tokenize:
➢ Purpose: NLTK's word_tokenize function tokenizes sentences into
individual words, breaking down text into its constituent units.
Code Flow:
1. Preprocessing Function (preprocess):
○ This function takes a document as input and performs text
preprocessing steps.
○ It removes punctuation characters and unwanted special characters
defined in unwanted_chars.
○ Tokenizes the document into words using NLTK's word_tokenize.
○ Tags each word with its part of speech using NLTK's pos_tag.
○ Selects words that are nouns (NN, NNS, NNP, NNPS) or verbs (VB,
VBD, VBG, VBN, VBP) and not in the list of English stopwords.
○ Stems the selected words using the Porter stemmer and returns the
processed document.
2. Creating the Inverted Index (create_index):
○ This function builds an inverted index for the collection of text
documents in the specified directory.
○ It initializes two defaultdicts: one for the inverted index itself
(inverted_index) and another for counting the number of
documents each word appears in (doc_count).
○ It iterates through each text file in the directory.
○ For each file, it reads the content, preprocesses it using the
preprocess function, and counts the frequency of each word within
the document.
○ Entries are added to the inverted index, where the stemmed word is
the key, and a tuple containing the filename and word frequency is the
value.
○ Document frequency (DF) is also updated in the doc_count
dictionary.
3. TF-IDF Scoring Function (tf_idf):
○ This function calculates the TF-IDF score for a given query and a
specific document.
○ It uses the TF-IDF formula with the document frequency (DF) of the
query term to calculate the IDF component.
○ The function calculates the TF-IDF score for each query term in the
document and aggregates them to get the final score.
○ It also keeps track of contributing words for each document in the
contributing_words dictionary.
4. Search Function (search):
○ The search function takes a user's search query and directory path as
input.
○ It creates the inverted index and doc_count using the create_index
function.
○ For each document in the directory, it calculates the TF-IDF score for
the query and the document.
○ The scores are normalized using the Euclidean norm and sorted in
descending order.
○ The ranked documents and contributing words are returned.
Main Execution:
○ The script enters a loop where the user can input search queries
interactively.
○ It calls the search function for each query and prints the ranked
documents along with their relevance scores and contributing words.
○ The loop continues until the user enters "exit."
Block Diagram:
A block diagram is a visual representation of the code's structure and key
components.
Data Flow Diagram (DFD):
A DFD illustrates how data moves through your code.