Assignment 2 IR

This document describes an information retrieval system that builds an inverted index from documents, calculates TF-IDF scores to rank documents for queries, and returns the top results. It uses NLTK and common Python libraries. The code preprocesses text, tokenizes it, filters stopwords and stems words. An inverted index and document frequency counts are created from the documents. For queries, TF-IDF scores identify relevant documents by weighting terms based on frequency and document distribution. The system runs interactively, ranking results for user-inputted search terms until exiting.

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views6 pages

Assignment 2 IR

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Information Retrieval

Assignment 2

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science

University of Engineering and Technology
Lahore Pakistan
Libraries Used:
❖ os:
➢ Purpose: Provides functions for interacting with the operating system,
specifically used for file operations and directory traversal.
❖ string:
➢ Purpose: The string module provides a collection of string constants
for various character sets and types of characters, including ASCII
letters, digits, punctuation characters, and whitespace characters. It is
often used for text manipulation and character-related operations.
❖ math:
➢ Purpose: The math library provides mathematical functions and
constants used in various calculations in the code.
❖ nltk (Natural Language Toolkit):
➢ Purpose: NLTK is used for natural language processing tasks,
including tokenization, stemming, and part-of-speech tagging.
❖ collections.defaultdict:
➢ Purpose: The defaultdict class from the collections module is
used to create dictionaries with default values for keys. In this code, it
is used to create dictionaries for the inverted index and document
counts.
❖ nltk.corpus.stopwords:
➢ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords. Stopwords are words that are often excluded from text
analysis due to their high frequency and low informativeness.
❖ nltk.stem.PorterStemmer:
➢ Purpose: The PorterStemmer class from NLTK implements the
Porter stemming algorithm. It reduces words to their root or base form,
helping standardize words.
❖ nltk.tokenize.word_tokenize:
➢ Purpose: NLTK's word_tokenize function tokenizes sentences into
individual words, breaking down text into its constituent units.

Code Flow:
1. Preprocessing Function (preprocess):
○ This function takes a document as input and performs text
preprocessing steps.
○ It removes punctuation characters and unwanted special characters
defined in unwanted_chars.
○ Tokenizes the document into words using NLTK's word_tokenize.
○ Tags each word with its part of speech using NLTK's pos_tag.
○ Selects words that are nouns (NN, NNS, NNP, NNPS) or verbs (VB,
VBD, VBG, VBN, VBP) and not in the list of English stopwords.
○ Stems the selected words using the Porter stemmer and returns the
processed document.
2. Creating the Inverted Index (create_index):
○ This function builds an inverted index for the collection of text
documents in the specified directory.
○ It initializes two defaultdicts: one for the inverted index itself
(inverted_index) and another for counting the number of
documents each word appears in (doc_count).
○ It iterates through each text file in the directory.
○ For each file, it reads the content, preprocesses it using the
preprocess function, and counts the frequency of each word within
the document.
○ Entries are added to the inverted index, where the stemmed word is
the key, and a tuple containing the filename and word frequency is the
value.
○ Document frequency (DF) is also updated in the doc_count
dictionary.
3. TF-IDF Scoring Function (tf_idf):
○ This function calculates the TF-IDF score for a given query and a
specific document.
○ It uses the TF-IDF formula with the document frequency (DF) of the
query term to calculate the IDF component.
○ The function calculates the TF-IDF score for each query term in the
document and aggregates them to get the final score.
○ It also keeps track of contributing words for each document in the
contributing_words dictionary.
4. Search Function (search):
○ The search function takes a user's search query and directory path as
input.
○ It creates the inverted index and doc_count using the create_index
function.
○ For each document in the directory, it calculates the TF-IDF score for
the query and the document.
○ The scores are normalized using the Euclidean norm and sorted in
descending order.
○ The ranked documents and contributing words are returned.
Main Execution:
○ The script enters a loop where the user can input search queries
interactively.
○ It calls the search function for each query and prints the ranked
documents along with their relevance scores and contributing words.
○ The loop continues until the user enters "exit."

Block Diagram:
A block diagram is a visual representation of the code's structure and key
components.
Data Flow Diagram (DFD):
A DFD illustrates how data moves through your code.

Assignment 3 BIM IR
No ratings yet
Assignment 3 BIM IR
5 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Assignment 4
No ratings yet
Assignment 4
13 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
IR Journal 21054
No ratings yet
IR Journal 21054
30 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Project Report
No ratings yet
Project Report
5 pages
Document Indexing & Retrieval Guide
No ratings yet
Document Indexing & Retrieval Guide
20 pages
Inverted Index Code Guide
No ratings yet
Inverted Index Code Guide
4 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Methodology
No ratings yet
Methodology
9 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Written Assignmen Unit Four IR
No ratings yet
Written Assignmen Unit Four IR
3 pages
Pipeline
No ratings yet
Pipeline
9 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
DSBDL Assn 07
No ratings yet
DSBDL Assn 07
4 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
Module III
No ratings yet
Module III
42 pages
NLP Record
No ratings yet
NLP Record
16 pages
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
Batch 2
No ratings yet
Batch 2
13 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
Getting Started With Natural Language Processing
No ratings yet
Getting Started With Natural Language Processing
10 pages
CS 3308 Programming Assignment 2
No ratings yet
CS 3308 Programming Assignment 2
3 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
NLP Text Processing Techniques
No ratings yet
NLP Text Processing Techniques
6 pages
Thesis Final - Pham Dung - Quang Anh - Ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - Ver2
30 pages
DS 7
No ratings yet
DS 7
3 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
TF-IDF Guide for Data Scientists
No ratings yet
TF-IDF Guide for Data Scientists
20 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
CS 3308 Programming Assignment Unit 4
No ratings yet
CS 3308 Programming Assignment Unit 4
7 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
Unit 4 Source Code
No ratings yet
Unit 4 Source Code
11 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
PhpMyAdmin SQL Dump
No ratings yet
PhpMyAdmin SQL Dump
6 pages
Database Design - 2nd Edition
No ratings yet
Database Design - 2nd Edition
144 pages
DBMS Exam Questions Guide
No ratings yet
DBMS Exam Questions Guide
10 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Handover For ILDS MTO
No ratings yet
Handover For ILDS MTO
8 pages
Search History For Literature Review
100% (1)
Search History For Literature Review
7 pages
Dhruv RDBMS PRACTICAL 2
No ratings yet
Dhruv RDBMS PRACTICAL 2
16 pages
It Practical Session 2
No ratings yet
It Practical Session 2
11 pages
Query 6 Company Database
No ratings yet
Query 6 Company Database
7 pages
PGVector - ? ? LangChain
No ratings yet
PGVector - ? ? LangChain
14 pages
CBO-Model Bank - Installation - Dossier - 1.0
100% (1)
CBO-Model Bank - Installation - Dossier - 1.0
19 pages
Oracle Database Administration
No ratings yet
Oracle Database Administration
13 pages
SQL Server Always On - Overview
No ratings yet
SQL Server Always On - Overview
4 pages
Code Review Checklist
No ratings yet
Code Review Checklist
3 pages
Week 5 - Sqlite - 4 - SQL TCL
No ratings yet
Week 5 - Sqlite - 4 - SQL TCL
3 pages
Class 12 CS Practical File
No ratings yet
Class 12 CS Practical File
27 pages
Flashback in ORACLE
No ratings yet
Flashback in ORACLE
17 pages
Unit No.7 Crash Recovery & Backup
No ratings yet
Unit No.7 Crash Recovery & Backup
17 pages
Lesson Access
No ratings yet
Lesson Access
2 pages
File System vs. DBMS Disadvantages
No ratings yet
File System vs. DBMS Disadvantages
3 pages
Core Concepts of Accounting Information Systems 13th Edition Simkin Solutions Manualdownload
100% (15)
Core Concepts of Accounting Information Systems 13th Edition Simkin Solutions Manualdownload
41 pages
David Baba
No ratings yet
David Baba
9 pages
7 Snowflake Reference Architectures For Application Builders
No ratings yet
7 Snowflake Reference Architectures For Application Builders
13 pages
CAP Theorem vs ACID in Databases
100% (1)
CAP Theorem vs ACID in Databases
22 pages
Abhimanyu Project File 2
No ratings yet
Abhimanyu Project File 2
64 pages
SQL Triggers: Prepared By: Rahim Suwal (29) Shyam Rajak
100% (1)
SQL Triggers: Prepared By: Rahim Suwal (29) Shyam Rajak
55 pages
Database Management Key Topics
No ratings yet
Database Management Key Topics
2 pages
SQL Vs NoSQL Databases A Comprehensive Guide
No ratings yet
SQL Vs NoSQL Databases A Comprehensive Guide
6 pages
Snowflake SnowPro Advanced - Architect - Practice Exam - Medium
No ratings yet
Snowflake SnowPro Advanced - Architect - Practice Exam - Medium
7 pages
Python Data Structures Guide
No ratings yet
Python Data Structures Guide
22 pages

Assignment 2 IR

Uploaded by

Assignment 2 IR

Uploaded by

Information Retrieval

Session: 2020 – 2024

Department of Computer Science

You might also like