0% found this document useful (0 votes)

18 views14 pages

Lab1 IR

The document is an introduction to using an inverted indexing and text search tool in Python. It discusses installing Python and NLTK, running the tool on text documents in a directory, and creating an inverted index and word counts from the documents. The tool allows searching for terms in documents and counting word occurrences. Key aspects like tokenization, stemming, stopwords removal and building the inverted index are explained.

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views14 pages

Lab1 IR

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Information Retrieval

Assignment 1

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science

University of Engineering and Technology
Lahore Pakistan
Introduction
Welcome to the Inverted Indexing and Text Search Manual. This manual provides
comprehensive guidance on utilizing a Python tool designed to create an inverted index from
a collection of text documents and conduct text searches within them. Whether you're an
experienced programmer or have limited coding skills, this manual will help you make the
most of this powerful tool.

Purpose of the Program:

The Inverted Indexing and Text Search Tool is a versatile utility designed to assist you in
various text-related tasks. It allows you to:

● Create an inverted index: Transform a collection of text documents into a structured

index that facilitates efficient text retrieval.
● Search for specific terms: Locate documents that contain particular words or
phrases.
● Count word occurrences: Quantify how frequently specific words appear within each
document.

By the end of this manual, you'll be proficient in using this tool to streamline your text
analysis tasks and extract valuable insights from your documents.

Installation and Setup:

Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.

NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk

Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.

python text_search.py
Explanation and Guide

Imports (Libraries)

import os
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

import os

● Purpose: The os module provides a way to work with the operating system, allowing
you to perform various file and directory operations.
● Use in the Program: In the code, os is used to manipulate file paths and interact
with the filesystem. It's used to list files in a directory, join file paths, and determine
the script's directory path.

import nltk

● Purpose: The nltk (Natural Language Toolkit) library is a comprehensive library for
natural language processing tasks.
● Use in the Program: nltk is used extensively for text processing in this code. It
provides tools for tokenization, part-of-speech tagging, and stemming, which are
crucial for creating an inverted index and performing text searches.

import string

● Purpose: The string module provides a collection of common string operations,

including a list of punctuation characters.
● Use in the Program: In the code, string.punctuation is used to filter out
punctuation characters from the text. This is important when tokenizing sentences
into words.

from nltk.corpus import stopwords

● Purpose: The NLTK corpus module includes predefined lists of stopwords for various
languages, including English.
● Use in the Program: The stopwords module is used to access a set of common
English stopwords. Stopwords are words that are commonly used in text but often do
not carry significant meaning (e.g., "the," "and"). Filtering out stopwords is a common
preprocessing step in text analysis.

from nltk.stem import PorterStemmer

● Purpose: The PorterStemmer is a stemming algorithm that reduces words to their

base or root form. Stemming helps in reducing words to their essential meaning.
● Use in the Program: In the code, the PorterStemmer is used to stem words in text
documents before they are indexed. This simplifies the process of matching different
forms of a word during text searches.

from nltk.tokenize import word_tokenize, sent_tokenize

● Purpose: The nltk.tokenize module provides functions for breaking text into
words or sentences.
● Use in the Program: In the code, word_tokenize and sent_tokenize functions
are used to tokenize text into words and sentences, respectively. This tokenization is
essential for processing text at the word and sentence level.

Variables
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} # Add more
characters if needed
# Initialize a Porter stemmer
stemmer = PorterStemmer()

stop_words = set(stopwords.words('english'))

● Explanation: The variable stop_words is assigned a set of English stopwords

using NLTK's stopwords.words('english'). These stopwords will be used to
filter out common words from the text documents being processed. This filtering
helps reduce the size of the inverted index and focuses on the content-carrying
words.

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are

considered unwanted and should be removed from the text before processing. The
characters include various forms of quotes, dashes, and ellipses. If additional
unwanted characters are identified, they can be added to this set.
stemmer = PorterStemmer()

● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable

stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.

Functions

def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}

1. def create_index(dir_path): This line defines a Python function called

create_index. It takes one argument, dir_path, which is the path to the directory
containing the text documents that you want to index. This function will be
responsible for building the inverted index and word counts for each document.
2. # Initialize an empty dictionary for the inverted index: This is a
comment that explains the purpose of the next line of code. It's initializing an empty
dictionary named inverted_index, which will be used to store the inverted index.
3. inverted_index = {}: This line creates an empty Python dictionary called
inverted_index. Inverted indexing is a technique used for text retrieval, where
words are associated with the documents they appear in. This dictionary will store
those associations.
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# Initialize a dictionary to store word counts per document
word_counts_per_document = {}

1. # Initialize a dictionary to store word counts per document:

This comment explains that the following line initializes a dictionary to store word
counts for each document in the directory.
2. word_counts_per_document = {}: This line creates an empty dictionary called
word_counts_per_document. This dictionary will be used to keep track of the
frequency of each word within each document, essentially counting how many times
each word appears in each text file. It is crucial for later search and retrieval
operations.

# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)

# Update word counts for this document

if stemmed_word not in word_counts:
word_counts[stemmed_word] = 1
else:
word_counts[stemmed_word] += 1

# Store word counts for this document

word_counts_per_document[filename] = word_counts

except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that

iterates over each file in the directory specified by dir_path. The os.listdir()
function returns a list of all files and directories in the given directory, and this loop
iterates through the file names.
2. if filename.endswith('.txt'): This line checks if the current filename
ends with the ".txt" extension, which typically indicates a text file.
3. try: This line begins a try-except block to handle potential errors during file
processing.
4. with open(os.path.join(dir_path, filename), 'r',
encoding='utf8') as file: Within the try block, this line opens the current text
file for reading. It uses os.path.join() to create the full path to the file by
combining dir_path with the filename. The file is opened in text mode ('r') and
with the 'utf8' encoding to handle text files encoded in UTF-8.
5. sentences = sent_tokenize(file.read().lower()): This line reads the
content of the file using file.read(), converts the content to lowercase using
.lower(), and then uses sent_tokenize (from NLTK) to split the content into a
list of sentences. This step prepares the text for further processing.
6. word_counts = {}: This line creates an empty dictionary called word_counts to
store word frequencies for the current document. This dictionary will be populated in
the following steps.
7. for sentence in sentences: This line sets up a loop to iterate over each
sentence in the sentences list.
8. sentence_without_punctuation = "".join([char for char in
sentence if char not in string.punctuation and char not in
unwanted_chars]): This line removes punctuation and unwanted characters from
the current sentence. It creates a new string called
sentence_without_punctuation by joining characters that are not in
string.punctuation or unwanted_chars.
9. words = word_tokenize(sentence_without_punctuation): This line
tokenizes the sentence_without_punctuation into a list of words using the
word_tokenize function from NLTK.
10. tagged_words = nltk.pos_tag(words): This line uses nltk.pos_tag to tag
each word in words with its part of speech. The result is stored in the
tagged_words list of word-tag pairs.
11. # For each word, if it's a noun or verb, stem it and add an
entry in the inverted index pointing to this filename: This
comment explains that the code will process each word in the current sentence,
checking if it's a noun or verb, and then stemming it before associating it with the
current filename in the inverted index.
12. for word, pos in tagged_words: This line sets up a loop to iterate over each
word and its corresponding part of speech in the tagged_words list.
13. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words:This line checks two
conditions for each word:
● Whether the word's part of speech (pos) is in the specified list of noun and
verb POS tags. If it is, it's considered for further processing.
● Whether the word is not in the set of stop_words, which are common words
that are often filtered out in text analysis.
14. stemmed_word = stemmer.stem(word): If a word passes the previous
conditions, it is stemmed using the Porter Stemmer. The stemmed word is stored in
the variable stemmed_word.
15. if stemmed_word not in inverted_index: This line checks if the
stemmed_word is not already in the inverted_index.
16. inverted_index[stemmed_word] = []: If the word is not in the inverted index,
it initializes an empty list as the value for that word in the inverted index.
17. inverted_index[stemmed_word].append(filename): Regardless of whether
the word was already in the inverted index or not, it appends the filename of the
current document to the list associated with the stemmed_word. This associates the
word with the document where it appears in the inverted index.
18. if stemmed_word not in word_counts: This line checks if the
stemmed_word is not in the word_counts dictionary.
19. word_counts[stemmed_word] = 1: If the word is not in word_counts, it
initializes it with a count of 1, indicating that this word has been found once in the
current document.
20. else: If the word is already in word_counts, this block of code is executed.
21. word_counts[stemmed_word] += 1: It increments the count for the word in
word_counts to indicate that the word has been found again in the current
document.
22. # Store word counts for this document: This comment explains that the
code is about to store the word counts for the current document.
23. word_counts_per_document[filename] = word_counts: This line stores
the word_counts dictionary (word counts for the current document) in the
word_counts_per_document dictionary with the filename as the key. This
associates the word counts with the document.
24. except UnicodeDecodeError: This is an exception handler that catches
UnicodeDecodeError exceptions. This exception occurs when a file cannot be
decoded using the specified encoding, which can happen when processing text files
with non-standard encodings.
25. print(f"Skipping file {filename} due to UnicodeDecodeError"): If
a UnicodeDecodeError is raised, this line prints a message indicating that the file
is being skipped due to this encoding-related error.
26. return inverted_index, word_counts_per_document: This line returns two
values as a tuple:
● inverted_index: This is a dictionary containing the inverted index, where
each stemmed word is associated with a list of filenames where it appears.
● word_counts_per_document: This is a dictionary containing word counts
for each document, showing how many times each word appears in each
document.

def main()

Code

# Now you can create the index, search it, and count word occurrences within documents

def main():

dir_path = os.path.dirname(os.path.abspath(__file__))

inverted_index, word_counts_per_document = create_index(dir_path)

print(inverted_index)

def search(query):

# Tokenize and stem the query

query_words = word_tokenize(query.lower())

stemmed_query_words = [stemmer.stem(word) for word in query_words]

# Retrieve the filenames for each query word

matching_filenames_for_each_word = {word: inverted_index.get(word, []) for word in

stemmed_query_words}
return matching_filenames_for_each_word

# User-friendly search prompt

while True:

user_query = input("Enter a search query (or 'exit' to quit): ")

if user_query == 'exit':

break

results = search(user_query)

# Collect and count unique filenames

unique_filenames = set()

for filenames in results.values():

unique_filenames.update(filenames)

for filename in unique_filenames:

word_count = sum(word_counts_per_document[filename].get(word, 0) for word in

results.keys())

print(f"The word(s) appear in '{filename}' {word_count} time(s):")

if not unique_filenames:

print("No matching documents found for the query.")

if __name__ == "__main__":

main()

Explanation

def main():

dir_path = os.path.dirname(os.path.abspath(__file__))

1. def main():This line defines the main function, which is the entry point of your
program. It doesn't take any arguments.
2. dir_path = os.path.dirname(os.path.abspath(__file__)): This line
sets dir_path to the directory path of the script file itself. It uses os.path to obtain
the absolute path of the current script (__file__) and then extracts the directory
path from it. This is used to determine the directory where the text documents are
located.

inverted_index, word_counts_per_document = create_index(dir_path)

print(inverted_index)

3. inverted_index, word_counts_per_document =
create_index(dir_path): Here, the code calls the create_index function to
build the inverted index and word counts for the documents in the directory specified
by dir_path. It stores the results in inverted_index and
word_counts_per_document.
4. print(inverted_index): This line prints the inverted_index to the console. It
provides a visual representation of the inverted index, showing how words are
associated with the documents they appear in.

def search(query):

5. def search(query):This line defines a new function called search, which takes
a single argument, query. This function is responsible for searching the inverted
index based on user queries.

query_words = word_tokenize(query.lower())

stemmed_query_words = [stemmer.stem(word) for word in

query_words]

6. The lines within the search function tokenize and stem the user's query:
○ query_words = word_tokenize(query.lower()): The query is
tokenized into individual words using word_tokenize, and all the words are
converted to lowercase to ensure consistent matching.
○ stemmed_query_words = [stemmer.stem(word) for word in
query_words]: Each tokenized word in the query is stemmed using the
stemmer.stem function. This ensures that the query words are in the same
form as the words in the inverted index.

matching_filenames_for_each_word = {word:
inverted_index.get(word, []) for word in stemmed_query_words}

return matching_filenames_for_each_word

7. The code retrieves filenames associated with each query word from the inverted
index. It creates a dictionary, matching_filenames_for_each_word, where
each query word is the key, and the associated list of filenames is the value. This
information will be used for search results.
8. return matching_filenames_for_each_word: The function returns
matching_filenames_for_each_word, which contains the search results
indicating which documents contain the query words.

while True:

user_query = input("Enter a search query (or 'exit' to quit): ")

9. This code initiates a user-friendly search interface where users can input search
queries. It uses a while loop to repeatedly prompt the user for input.
10. user_query = input("Enter a search query (or 'exit' to quit):
"): This line reads the user's search query from the console. Users can type a
search query or type 'exit' to quit the search interface.

if user_query == 'exit':
break

11. This conditional statement checks if the user entered 'exit' as the query. If they did,
the while loop is exited, ending the search interface.

results = search(user_query)

12. results = search(user_query): The user's search query is passed to the

search function, and the results are stored in the results variable.

unique_filenames = set()
for filenames in results.values():
unique_filenames.update(filenames)

13. This part of the code processes the search results:

● unique_filenames is initialized as an empty set to collect unique filenames that
match the search query.
● The for loop iterates over the filenames associated with each query word from the
results dictionary and updates the unique_filenames set with those filenames.

for filename in unique_filenames:

word_count =
sum(word_counts_per_document[filename].get(word, 0) for word in
results.keys())
print(f"The word(s) appear in '{filename}' {word_count}
time(s):")
if not unique_filenames:
print("No matching documents found for the query.")
if not unique_filenames:
print("No matching documents found for the query.")

The code further processes the unique filenames:

● It iterates over the unique filenames.

● For each filename, it calculates the total word count for the query words found in that
document. This is done by iterating over the query words and checking how many
times each of them appears in the document.
● It then prints the filename along with the word count for the query words found in that
document.
15. If there are no unique filenames (i.e., no matching documents for the query), it prints
a message indicating that no matching documents were found for the query.

if __name__ == "__main__":
main()

16. Finally, this code checks if the script is being run as the main program (not imported
as a module). If it is, it calls the main function to start the search interface.
Data Flow Diagram
Block Diagram

Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Assignment 3 BIM IR
No ratings yet
Assignment 3 BIM IR
5 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
Text Processing
No ratings yet
Text Processing
16 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
All Practicals
No ratings yet
All Practicals
33 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
Irs Lab Week-3
No ratings yet
Irs Lab Week-3
2 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Inverted Index Code Guide
No ratings yet
Inverted Index Code Guide
4 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
Batch 2
No ratings yet
Batch 2
13 pages
TSA Student
No ratings yet
TSA Student
20 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
DSBDL Assn 07
No ratings yet
DSBDL Assn 07
4 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
Lab 2
No ratings yet
Lab 2
49 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
DSBDA Practical 7 Tutorial
No ratings yet
DSBDA Practical 7 Tutorial
11 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
Jot Brochure 2021-22
No ratings yet
Jot Brochure 2021-22
60 pages
COA Chapter 2
No ratings yet
COA Chapter 2
23 pages
Classical Electromagnetism - H C Verma
91% (11)
Classical Electromagnetism - H C Verma
526 pages
3D Printing Troubleshooting Guide
No ratings yet
3D Printing Troubleshooting Guide
18 pages
Ooad Complete Material
No ratings yet
Ooad Complete Material
177 pages
Tray Sdrywer
No ratings yet
Tray Sdrywer
10 pages
An Iot Based Monitoring System For Induction Motor Faults Utilizing Deep Learning Models
No ratings yet
An Iot Based Monitoring System For Induction Motor Faults Utilizing Deep Learning Models
11 pages
20230302114652PPT
No ratings yet
20230302114652PPT
38 pages
Bhel Mini Pro Report On Turbo Generators 1
100% (2)
Bhel Mini Pro Report On Turbo Generators 1
53 pages
Sporlan TXV Bulletin 10-91
No ratings yet
Sporlan TXV Bulletin 10-91
19 pages
Rpms Engg SPC Cs 013
No ratings yet
Rpms Engg SPC Cs 013
13 pages
Competitive Programming Tasks
No ratings yet
Competitive Programming Tasks
14 pages
14 Colpitts Oscillators
No ratings yet
14 Colpitts Oscillators
3 pages
CSSPsample PDF
No ratings yet
CSSPsample PDF
5 pages
Chand Mohammad Resume
No ratings yet
Chand Mohammad Resume
2 pages
15 Tribon
No ratings yet
15 Tribon
10 pages
Basic Civil Questions
No ratings yet
Basic Civil Questions
5 pages
An Introduction To Digital Design Using A
No ratings yet
An Introduction To Digital Design Using A
30 pages
Cordierite-Mullite Refractory Study
No ratings yet
Cordierite-Mullite Refractory Study
16 pages
HVAC/Plumbing QC Interview Guide
No ratings yet
HVAC/Plumbing QC Interview Guide
23 pages
Practical Problem - Graded
No ratings yet
Practical Problem - Graded
10 pages
LEGO NXT Hardware Installation Guide
No ratings yet
LEGO NXT Hardware Installation Guide
7 pages
Unit-5 Air and Noise Pollution
No ratings yet
Unit-5 Air and Noise Pollution
11 pages
Sabal - Statistics Activity 2
No ratings yet
Sabal - Statistics Activity 2
4 pages
Overall Dimensions and Mounting: Solar Water Pump Controller Mu - G3 Solar Mu - G5 Solar Mu - G7.5 Solar Mu - G10 Solar
No ratings yet
Overall Dimensions and Mounting: Solar Water Pump Controller Mu - G3 Solar Mu - G5 Solar Mu - G7.5 Solar Mu - G10 Solar
2 pages
Computerized Building Energy Simulation Handbook PDF
No ratings yet
Computerized Building Energy Simulation Handbook PDF
149 pages
Graphs and Situations Practice
No ratings yet
Graphs and Situations Practice
12 pages
Cads RC V8 4
0% (1)
Cads RC V8 4
202 pages
Class IX Practical
100% (1)
Class IX Practical
24 pages

Lab1 IR

Uploaded by

Lab1 IR

Uploaded by

Information Retrieval

Session: 2020 – 2024

Department of Computer Science

Purpose of the Program:

● Create an inverted index: Transform a collection of text documents into a structured

Installation and Setup:

● Purpose: The string module provides a collection of common string operations,

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

● Purpose: The PorterStemmer is a stemming algorithm that reduces words to their

from nltk.tokenize import word_tokenize, sent_tokenize

● Explanation: The variable stop_words is assigned a set of English stopwords

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are

● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable

1. def create_index(dir_path): This line defines a Python function called

1. # Initialize a dictionary to store word counts per document:

# Update word counts for this document

# Store word counts for this document

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that

inverted_index, word_counts_per_document = create_index(dir_path)

# Tokenize and stem the query

stemmed_query_words = [stemmer.stem(word) for word in query_words]

# Retrieve the filenames for each query word

matching_filenames_for_each_word = {word: inverted_index.get(word, []) for word in

# User-friendly search prompt

user_query = input("Enter a search query (or 'exit' to quit): ")

# Collect and count unique filenames

for filenames in results.values():

for filename in unique_filenames:

word_count = sum(word_counts_per_document[filename].get(word, 0) for word in

print(f"The word(s) appear in '{filename}' {word_count} time(s):")

print("No matching documents found for the query.")

inverted_index, word_counts_per_document = create_index(dir_path)

stemmed_query_words = [stemmer.stem(word) for word in

user_query = input("Enter a search query (or 'exit' to quit): ")

12. results = search(user_query): The user's search query is passed to the

13. This part of the code processes the search results:

for filename in unique_filenames:

The code further processes the unique filenames:

● It iterates over the unique filenames.

You might also like