0% found this document useful (0 votes)
49 views5 pages

Assignment 3 BIM IR

This document describes an information retrieval assignment that involves creating an inverted index and binary term-document matrix from a collection of text documents. It discusses the libraries used, preprocessing steps like tokenization and stemming, representing a query, scoring and ranking documents, and retrieving the top results. The code flow details creating the index, handling queries, scoring documents based on the query and matrix, ranking results, and returning the top matches to the user interactively.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views5 pages

Assignment 3 BIM IR

This document describes an information retrieval assignment that involves creating an inverted index and binary term-document matrix from a collection of text documents. It discusses the libraries used, preprocessing steps like tokenization and stemming, representing a query, scoring and ranking documents, and retrieving the top results. The code flow details creating the index, handling queries, scoring documents based on the query and matrix, ranking results, and returning the top matches to the user interactively.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Libraries Used:

1. os:
○ Purpose: Provides functions for interacting with the operating system,
particularly used for file operations and directory traversal.
2. nltk:
○ Purpose: The Natural Language Toolkit (NLTK) library is used for
natural language processing tasks such as tokenization, stemming,
and part-of-speech tagging.
3. nltk.corpus.stopwords:
○ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords, which are words typically excluded from text analysis due to
their high frequency and low informativeness.
4. nltk.stem.PorterStemmer:
○ Purpose: The PorterStemmer class from NLTK implements the Porter
stemming algorithm, which reduces words to their root or base form,
standardizing words for analysis.

Code Flow:

Preprocessing and Creating the Inverted Index:

● The code begins by importing necessary libraries and initializing NLTK's


PorterStemmer and English stopwords.
● The create_index function is defined to create an inverted index and a
binary term-document matrix for a collection of text documents in a specified
directory.
● It iterates through each text file in the directory, reads the content, and
tokenizes it into sentences.
● For each sentence, it tokenizes it into words and tags their parts of speech
using NLTK's pos_tag.
● The code identifies words that are nouns (NN, NNS, NNP, NNPS) and not in
the list of English stopwords. These words are stemmed using the Porter
stemmer.
● Entries are added to the inverted index, where the stemmed word is the key,
and a list of filenames where the word appears is the value.
● The binary term-document matrix is also created, where each term is
associated with documents in which it appears with a binary weight of 1.
● UnicodeDecodeError exceptions are handled for files that cannot be decoded.

Representing a Query and Scoring Documents:

● The represent_query function tokenizes and stems a user's search query


and represents it as a query vector.
● The score_documents function calculates document scores based on the
query vector and the binary term-document matrix.
● Document scores are normalized by dividing them by the number of terms in
each document.

Ranking and Retrieving Documents:

● The rank_documents function sorts documents by their scores in


descending order.
● The retrieve_top_k_documents function retrieves the top-K documents
from the ranked list.

Main Execution:

● The script obtains the directory path of the code file and creates the inverted
index and binary term-document matrix using the create_index function.
● It enters a loop where the user can input search queries interactively.
● For each query, it represents the query, scores documents, ranks them,
retrieves the top 2 documents, and presents the results.
● The loop continues until the user enters "exit."

Block Diagram and Data Flow Diagram (DFD):

You might also like