0% found this document useful (0 votes)

19 views9 pages

IR Practical Theory

Ir theory

Uploaded by

shubham8795298332

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views9 pages

IR Practical Theory

Ir theory

Uploaded by

shubham8795298332

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Information Retrieval

Practical No 1
Aim: Document Indexing andRetrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the
constructed index.

Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing
a specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and
each term points to a list of documents or web pages that contain that
term.
● Inverted indexes are widely used in search engines, database systems,
and other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a document or a
set of documents.

Rules to create an inverted index -

1) The text of each document is first preprocessed by removing stop words
: Stop words are the most occurring and useless words in documents like
“I”, “the”, “we”, “is”, and “an”.
2) The text is tokenized, meaning that it is split into individual terms.
3) The terms are then added to the index, with each term pointing to
the documents in which it appears.
Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and
cosine similarity.

Theory :
A)Boolean Retrieval Model -
● A Boolean model is a fundamental concept in Information
Retrieval (IR) that is used to represent and retrieve documents or
information based on Boolean logic.

● In this model, a document is typically represented as a set of terms

(words or phrases), and queries are also represented using
Boolean operators (AND, OR, NOT) to specify the desired
information.

Here's how the Boolean model works in IR:

1. Document Representation: Each document in the collection

is represented as a set of terms. These terms can be extracted
from the document's content and can be single words,
phrases, or other units of information.

2. Query Representation: Queries are also represented as sets of

terms, and Boolean operators (AND, OR, NOT) are used to
combine these terms to express the user's information needs. For
example, a query might be "cats AND dogs," meaning the user
wants documents that contain both "cats" and "dogs."

3.Boolean Operators:
●AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
●OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT
dogs" "cats" but not
"dogs."
B)TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is
a widely used statistical method in information retrieval.
● It measures how important a term is within a
document relative to a collection of documents.

Term Frequency (TF): TF of a term or word is the number of times the

term appears in a document compared to the total number of words in
the document.

Inverse Document Frequency(IDF):

● IDF of a term reflects the proportion of documents in the
corpus that contain the term.
IDF = log( N / df ) where,
N= total no. of documents
df = no. of documents containing a term

● The TF-IDF of a term is calculated by

multiplying TF and IDF scores. TF-IDF =
TF*IDE

C)Cosine Similarity -

● Cosine similarity is a measure of similarity between two

non- zero vectors defined in an inner product space.
● Cosine similarity is the cosine of the angle
between the vectors. ● The cosine similarity always
belongs to the interval [−1,1].
● In cosine similarity, data objects in a dataset are treated
as a vector. ● The formula to find the cosine similarity between
two vectors is -
Here A . B is the product of the vector.
Practical No 3

Aim: Spelling Correction in IR Systems

● Develop a spelling correction module using edit distance algorithms.
● Integrate the spelling correction module into an information
retrieval system.

Theory:

Edit Distance :

● Edit distance is a measure of the similarity between two strings by

calculating the minimum number of single-character edits (insertions,
deletions, or substitutions) required to change one string into the other.
●The smaller the edit distance, the more similar the strings are.

Consider two strings str1 and str2 of length M and N respectively.

For finding edit distance there are performed below operations -

1. Operation 1 (INSERT): Insert any character before or after
any index value
2.Operation 2 (REMOVE): Remove a character
3. Operation 3 (Replace): Replace a character at any index value
with some other character
Practical No: 4

Aim: Evaluation Metrics for IR Systems

A)Calculate precision, recall, and F-measure for a given set of retrieval
results.
B)Use an evaluation toolkit to measure average precision and
other evaluation metrics.

Theory:
1.Precision:
● Precision is the ratio of correctly predicted positive observations
to the total predicted positives.
● It is also called Positive Predictive Value (PPV).
●Precision is calculated using the following formula:
Precision = TP / TP+FP
Where:

• TP (True Positives) is the number of instances

correctly predicted as positive. • FP (False Positives) is the
number of instances incorrectly predicted as positive.
High precision indicates that the model has a low rate of false positives. In
other words, when the model predicts a positive result, it is likely to be
correct.

2.Recall:
• Recall is the ratio of correctly predicted positive observations to
all observations in actual class.
• Recall is calculated using the following
formula: Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances
correctly predicted as positive. • FN (False Negatives) is the
number of instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In
other words, the model is effective at capturing all the positive instances.

3.F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing
a balanced measure of a model's performance.

• The formula for F-measure is:

• The F-measure ranges from 0 to 1, where 1 indicates perfect

precision and recall.

4.Average Precision:
Average Precision is used to find the Average of the model precision based
on relevancy of result. • Algorithm:

In order to find Average Precision:

1)Take 2 variables X and Y as 0
2)We will then go through the prediction from left to right:
3) In case the prediction is 0, we will only increment Y by 1 and
not find prediction score
4)In case the prediction is 1, we will increment both X and Y by 1
5) After incrementing, we use the formula X/Y to get the current
position prediction score.
6) Lastly we will find summation of all prediction scores and divide
them by total number of positive predictions.
Practical No 5
Aim: Text Categorization
A)Implement a text classification algorithm (e.g.,
Naive Bayes or Support Vector Machines).
B)Train the classifier on a labelled dataset
and evaluate its performance.
Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning
algorithm, which is based on Bayes theorem and used
for solving classification problems.
• It is mainly used in text classification that
includes a high- dimensional training
dataset.
• Naive Bayes Classifier is one of the simple and
most effective Classification algorithms which
helps in building the fast machine learning models
that can make quick predictions.
• It is a probabilistic classifier, which means it predicts
on the basis of the probability of an object.
• Some popular examples of Naive Bayes Algorithm are
spam filtration, Sentimental analysis, and
classifying articles.

Bayes' Theorem:

• Bayes' theorem is also known as Bayes' Rule or

Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
• The formula for
Bayes' theorem is
given as: P(B|A) *
P(A)
P(A|B) =
P(B)

IR Journal
No ratings yet
IR Journal
36 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
IR Journal
No ratings yet
IR Journal
20 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Document Indexing & Retrieval Guide
No ratings yet
Document Indexing & Retrieval Guide
20 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
23 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
asila-IR
No ratings yet
asila-IR
16 pages
NLP See
No ratings yet
NLP See
9 pages
Ir QB
No ratings yet
Ir QB
8 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Final Exam (Spring 2020 - V1)
No ratings yet
Final Exam (Spring 2020 - V1)
11 pages
NLP See
No ratings yet
NLP See
27 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
1 Overview
No ratings yet
1 Overview
44 pages
Algebraic Model in Information Retrieval Techniques
No ratings yet
Algebraic Model in Information Retrieval Techniques
3 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Comprehensive Guide to IR Models
100% (3)
Comprehensive Guide to IR Models
58 pages
Information Retrieval Exam 2008
100% (1)
Information Retrieval Exam 2008
8 pages
153 Sanskriti IR File
No ratings yet
153 Sanskriti IR File
55 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
IR Practical
No ratings yet
IR Practical
24 pages
Probabilistic IR & Query Expansion
No ratings yet
Probabilistic IR & Query Expansion
37 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
Types of Keys in Rdbms. What Is RDBMS? Advantage & Disadvantage of Dbms
No ratings yet
Types of Keys in Rdbms. What Is RDBMS? Advantage & Disadvantage of Dbms
6 pages
Literature Searching Strategy
No ratings yet
Literature Searching Strategy
14 pages
Google Analytics 360 Upgrade and Migration Plan
No ratings yet
Google Analytics 360 Upgrade and Migration Plan
4 pages
CIA II QP & Scheme Format 2024-25
No ratings yet
CIA II QP & Scheme Format 2024-25
1 page
ABCD Library System Manual 2.0
No ratings yet
ABCD Library System Manual 2.0
159 pages
Naïve Bayes vs C4.5 for Toddler Nutrition
No ratings yet
Naïve Bayes vs C4.5 for Toddler Nutrition
11 pages
1511 DPA MySQL 12 Steps Infographic
No ratings yet
1511 DPA MySQL 12 Steps Infographic
1 page
DWDM Unit-I
No ratings yet
DWDM Unit-I
25 pages
Nccer Module 7
No ratings yet
Nccer Module 7
20 pages
Data Warehouse & BI Project Guide
No ratings yet
Data Warehouse & BI Project Guide
1 page
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
No ratings yet
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
13 pages
List of Microeconomics Books
100% (1)
List of Microeconomics Books
4 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Pega Best Practices UI Validation Compliance Score
No ratings yet
Pega Best Practices UI Validation Compliance Score
3 pages
Introduction to DBMS: Key Concepts
No ratings yet
Introduction to DBMS: Key Concepts
28 pages
Lecture 03 IoT Design Methodology
No ratings yet
Lecture 03 IoT Design Methodology
36 pages
Lesson 13
No ratings yet
Lesson 13
5 pages
W1D2CST200A
100% (2)
W1D2CST200A
2 pages
Gis Question Bank
100% (7)
Gis Question Bank
23 pages
(Updated) OMG-OCSMP-MU100 Certification - Exam Detail - Syllabus - Q & A - PDF Room
No ratings yet
(Updated) OMG-OCSMP-MU100 Certification - Exam Detail - Syllabus - Q & A - PDF Room
20 pages
Comprehensive ERP Workflow Guide
No ratings yet
Comprehensive ERP Workflow Guide
22 pages
CS8492-DBMS Syllabus
No ratings yet
CS8492-DBMS Syllabus
2 pages
Purpose of GIS
No ratings yet
Purpose of GIS
17 pages
JD PythonDataAnalyst
No ratings yet
JD PythonDataAnalyst
2 pages
Snowflake Fundamentals Anand Jha
No ratings yet
Snowflake Fundamentals Anand Jha
50 pages
Lesson 1 - Introduction To Database Management System
No ratings yet
Lesson 1 - Introduction To Database Management System
12 pages
Technical Proposal To Implement SAP Signavio Suite
100% (1)
Technical Proposal To Implement SAP Signavio Suite
14 pages
Unit 4 Database Design and Development - Assignment 2021
No ratings yet
Unit 4 Database Design and Development - Assignment 2021
3 pages
DW User Form
No ratings yet
DW User Form
2 pages

IR Practical Theory

Uploaded by

IR Practical Theory

Uploaded by

Information Retrieval

Rules to create an inverted index -

●​ In this model, a document is typically represented as a set of terms

Here's how the Boolean model works in IR:

1.​ Document Representation: Each document in the collection

2.​ Query Representation: Queries are also represented as sets of

Term Frequency (TF): TF of a term or word is the number of times the

Inverse Document Frequency(IDF):

●​ The TF-IDF of a term is calculated by

●​ Cosine similarity is a measure of similarity between two

Aim: Spelling Correction in IR Systems

●​ Edit distance is a measure of the similarity between two strings by

Consider two strings str1 and str2 of length M and N respectively.

For finding edit distance there are performed below operations -

Aim: Evaluation Metrics for IR Systems

•​ TP (True Positives) is the number of instances

•​ The formula for F-measure is:

•​ The F-measure ranges from 0 to 1, where 1 indicates perfect

In order to find Average Precision:

•​ Bayes' theorem is also known as Bayes' Rule or

You might also like

● In this model, a document is typically represented as a set of terms

1. Document Representation: Each document in the collection

2. Query Representation: Queries are also represented as sets of

● The TF-IDF of a term is calculated by

● Cosine similarity is a measure of similarity between two

● Edit distance is a measure of the similarity between two strings by

• TP (True Positives) is the number of instances

• The formula for F-measure is:

• The F-measure ranges from 0 to 1, where 1 indicates perfect

• Bayes' theorem is also known as Bayes' Rule or