Assignment 3 BIM IR

This document describes an information retrieval assignment that involves creating an inverted index and binary term-document matrix from a collection of text documents. It discusses the libraries used, preprocessing steps like tokenization and stemming, representing a query, scoring and ranking documents, and retrieving the top results. The code flow details creating the index, handling queries, scoring documents based on the query and matrix, ranking results, and returning the top matches to the user interactively.

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views5 pages

Assignment 3 BIM IR

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science

University of Engineering and Technology
Lahore Pakistan
Libraries Used:

1. os:
○ Purpose: Provides functions for interacting with the operating system,
particularly used for file operations and directory traversal.
2. nltk:
○ Purpose: The Natural Language Toolkit (NLTK) library is used for
natural language processing tasks such as tokenization, stemming,
and part-of-speech tagging.
3. nltk.corpus.stopwords:
○ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords, which are words typically excluded from text analysis due to
their high frequency and low informativeness.
4. nltk.stem.PorterStemmer:
○ Purpose: The PorterStemmer class from NLTK implements the Porter
stemming algorithm, which reduces words to their root or base form,
standardizing words for analysis.

Code Flow:

Preprocessing and Creating the Inverted Index:

● The code begins by importing necessary libraries and initializing NLTK's

PorterStemmer and English stopwords.
● The create_index function is defined to create an inverted index and a
binary term-document matrix for a collection of text documents in a specified
directory.
● It iterates through each text file in the directory, reads the content, and
tokenizes it into sentences.
● For each sentence, it tokenizes it into words and tags their parts of speech
using NLTK's pos_tag.
● The code identifies words that are nouns (NN, NNS, NNP, NNPS) and not in
the list of English stopwords. These words are stemmed using the Porter
stemmer.
● Entries are added to the inverted index, where the stemmed word is the key,
and a list of filenames where the word appears is the value.
● The binary term-document matrix is also created, where each term is
associated with documents in which it appears with a binary weight of 1.
● UnicodeDecodeError exceptions are handled for files that cannot be decoded.

Representing a Query and Scoring Documents:

● The represent_query function tokenizes and stems a user's search query

and represents it as a query vector.
● The score_documents function calculates document scores based on the
query vector and the binary term-document matrix.
● Document scores are normalized by dividing them by the number of terms in
each document.

Ranking and Retrieving Documents:

● The rank_documents function sorts documents by their scores in

descending order.
● The retrieve_top_k_documents function retrieves the top-K documents
from the ranked list.

Main Execution:

● The script obtains the directory path of the code file and creates the inverted
index and binary term-document matrix using the create_index function.
● It enters a loop where the user can input search queries interactively.
● For each query, it represents the query, scores documents, ranks them,
retrieves the top 2 documents, and presents the results.
● The loop continues until the user enters "exit."

Block Diagram and Data Flow Diagram (DFD):

Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Inverted Index Code Guide
No ratings yet
Inverted Index Code Guide
4 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Project Report
No ratings yet
Project Report
5 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
IR Journal 21054
No ratings yet
IR Journal 21054
30 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
IR Prac 1
No ratings yet
IR Prac 1
3 pages
Irs Lab Week-3
No ratings yet
Irs Lab Week-3
2 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
IRS Syllabus
No ratings yet
IRS Syllabus
2 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Unit 4 Source Code
No ratings yet
Unit 4 Source Code
11 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Assignment 4
No ratings yet
Assignment 4
13 pages
Document Indexing & Retrieval Guide
No ratings yet
Document Indexing & Retrieval Guide
20 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
CS 3308 Programming Assignment 2
No ratings yet
CS 3308 Programming Assignment 2
3 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
Assessment 2
No ratings yet
Assessment 2
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Record
No ratings yet
NLP Record
15 pages
IRS Theory & Lab Syllabus
100% (1)
IRS Theory & Lab Syllabus
3 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Batch 2
No ratings yet
Batch 2
13 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Assignment 3 NonOverlap IR
No ratings yet
Assignment 3 NonOverlap IR
3 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
Assignment 3 NonOverlap IR
No ratings yet
Assignment 3 NonOverlap IR
3 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
NLP Exercises
No ratings yet
NLP Exercises
2 pages
CS 3308 Programming Assignment Unit 2
No ratings yet
CS 3308 Programming Assignment Unit 2
10 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
CS122L Pe2
No ratings yet
CS122L Pe2
2 pages
7b - Java Virtual Machine
No ratings yet
7b - Java Virtual Machine
3 pages
IRC Bot Ping Script
No ratings yet
IRC Bot Ping Script
3 pages
New Text Document
No ratings yet
New Text Document
4 pages
Keerthi Tester Resume
No ratings yet
Keerthi Tester Resume
3 pages
Script Quizizz
No ratings yet
Script Quizizz
4 pages
Lecture 2 Introduction To Phases of Compiler
No ratings yet
Lecture 2 Introduction To Phases of Compiler
4 pages
Chap 7 CW 1
0% (1)
Chap 7 CW 1
3 pages
NGAP - NG Application Protocol Over N2 Interface - Techplayon
No ratings yet
NGAP - NG Application Protocol Over N2 Interface - Techplayon
7 pages
xv6 Riscv
No ratings yet
xv6 Riscv
98 pages
1 Object-Oriented Programming in C++
No ratings yet
1 Object-Oriented Programming in C++
9 pages
Fuji Course Catalog
100% (1)
Fuji Course Catalog
29 pages
Programming Assignment
No ratings yet
Programming Assignment
3 pages
Class 8 Computer Question Paper Updated
No ratings yet
Class 8 Computer Question Paper Updated
3 pages
Class XI (As Per CBSE Board) : Computer Science
No ratings yet
Class XI (As Per CBSE Board) : Computer Science
18 pages
Performance Measuring Metrics For Computer System
No ratings yet
Performance Measuring Metrics For Computer System
10 pages
My SQL Notes
No ratings yet
My SQL Notes
13 pages
SQL Data Types and Constraints Guide
No ratings yet
SQL Data Types and Constraints Guide
26 pages
Introduction To Advanced Data Models
0% (1)
Introduction To Advanced Data Models
18 pages
CD Unit-2 Part 1
No ratings yet
CD Unit-2 Part 1
26 pages
Real Time Systems Exam Guide
No ratings yet
Real Time Systems Exam Guide
11 pages
cs110 Disc3
No ratings yet
cs110 Disc3
18 pages
Python Programming for Beginners
No ratings yet
Python Programming for Beginners
3 pages
COMPILER DESIGN ASSIGNMENT TWO 17 12 2022 Submit
No ratings yet
COMPILER DESIGN ASSIGNMENT TWO 17 12 2022 Submit
18 pages
Continuous Delivery Foundation - 2021 - Pulkit Sharma
No ratings yet
Continuous Delivery Foundation - 2021 - Pulkit Sharma
17 pages
Questions Answers Answer (1) Part
No ratings yet
Questions Answers Answer (1) Part
5 pages
Chapter 14: Protection: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 14: Protection: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
33 pages
Documentum High-Volume Server 6.5 Development Guide
0% (1)
Documentum High-Volume Server 6.5 Development Guide
88 pages
Kata Yuan
No ratings yet
Kata Yuan
4 pages

Assignment 3 BIM IR

Uploaded by

Assignment 3 BIM IR

Uploaded by

Information Retrieval

Session: 2020 – 2024

Department of Computer Science

Preprocessing and Creating the Inverted Index:

● The code begins by importing necessary libraries and initializing NLTK's

Representing a Query and Scoring Documents:

● The represent_query function tokenizes and stems a user's search query

Ranking and Retrieving Documents:

● The rank_documents function sorts documents by their scores in

Block Diagram and Data Flow Diagram (DFD):

You might also like