Information Retrieval
Practical No 1
Aim: Document Indexing andRetrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the
constructed index.
Theory :
● An Inverted Index is a data structure used in information retrieval
systems to efficiently retrieve documents or web pages containing
a specific term or set of terms.
● In an inverted index, the index is organised by terms (words), and
each term points to a list of documents or web pages that contain that
term.
● Inverted indexes are widely used in search engines, database systems,
and other applications where efficient text search is required.
● They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow. An
inverted index is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a document or a
set of documents.
Rules to create an inverted index -
1) The text of each document is first preprocessed by removing stop words
: Stop words are the most occurring and useless words in documents like
“I”, “the”, “we”, “is”, and “an”.
2) The text is tokenized, meaning that it is split into individual terms.
3) The terms are then added to the index, with each term pointing to
the documents in which it appears.
Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and
cosine similarity.
Theory :
A)Boolean Retrieval Model -
● A Boolean model is a fundamental concept in Information
Retrieval (IR) that is used to represent and retrieve documents or
information based on Boolean logic.
● In this model, a document is typically represented as a set of terms
(words or phrases), and queries are also represented using
Boolean operators (AND, OR, NOT) to specify the desired
information.
Here's how the Boolean model works in IR:
1. Document Representation: Each document in the collection
is represented as a set of terms. These terms can be extracted
from the document's content and can be single words,
phrases, or other units of information.
2. Query Representation: Queries are also represented as sets of
terms, and Boolean operators (AND, OR, NOT) are used to
combine these terms to express the user's information needs. For
example, a query might be "cats AND dogs," meaning the user
wants documents that contain both "cats" and "dogs."
3.Boolean Operators:
●AND: "cats AND dogs,"
both "cats" and "dogs" will be retrieved.
●OR: "cats OR dogs,"
"cats" or "dogs" or both will be retrieved.
● NOT: "cats NOT
dogs" "cats" but not
"dogs."
B)TF-IDF
● Term Frequency - Inverse Document Frequency (TF-IDF) is
a widely used statistical method in information retrieval.
● It measures how important a term is within a
document relative to a collection of documents.
Term Frequency (TF): TF of a term or word is the number of times the
term appears in a document compared to the total number of words in
the document.
Inverse Document Frequency(IDF):
● IDF of a term reflects the proportion of documents in the
corpus that contain the term.
IDF = log( N / df ) where,
N= total no. of documents
df = no. of documents containing a term
● The TF-IDF of a term is calculated by
multiplying TF and IDF scores. TF-IDF =
TF*IDE
C)Cosine Similarity -
● Cosine similarity is a measure of similarity between two
non- zero vectors defined in an inner product space.
● Cosine similarity is the cosine of the angle
between the vectors. ● The cosine similarity always
belongs to the interval [−1,1].
● In cosine similarity, data objects in a dataset are treated
as a vector. ● The formula to find the cosine similarity between
two vectors is -
Here A . B is the product of the vector.
Practical No 3
Aim: Spelling Correction in IR Systems
● Develop a spelling correction module using edit distance algorithms.
● Integrate the spelling correction module into an information
retrieval system.
Theory:
Edit Distance :
● Edit distance is a measure of the similarity between two strings by
calculating the minimum number of single-character edits (insertions,
deletions, or substitutions) required to change one string into the other.
●The smaller the edit distance, the more similar the strings are.
Consider two strings str1 and str2 of length M and N respectively.
For finding edit distance there are performed below operations -
1. Operation 1 (INSERT): Insert any character before or after
any index value
2.Operation 2 (REMOVE): Remove a character
3. Operation 3 (Replace): Replace a character at any index value
with some other character
Practical No: 4
Aim: Evaluation Metrics for IR Systems
A)Calculate precision, recall, and F-measure for a given set of retrieval
results.
B)Use an evaluation toolkit to measure average precision and
other evaluation metrics.
Theory:
1.Precision:
● Precision is the ratio of correctly predicted positive observations
to the total predicted positives.
● It is also called Positive Predictive Value (PPV).
●Precision is calculated using the following formula:
Precision = TP / TP+FP
Where:
• TP (True Positives) is the number of instances
correctly predicted as positive. • FP (False Positives) is the
number of instances incorrectly predicted as positive.
High precision indicates that the model has a low rate of false positives. In
other words, when the model predicts a positive result, it is likely to be
correct.
2.Recall:
• Recall is the ratio of correctly predicted positive observations to
all observations in actual class.
• Recall is calculated using the following
formula: Recall= TP /TP+FN
Where:
• TP (True Positives) is the number of instances
correctly predicted as positive. • FN (False Negatives) is the
number of instances incorrectly predicted as negative.
High recall indicates that the model has a low rate of false negatives. In
other words, the model is effective at capturing all the positive instances.
3.F-measure:
• The F-measure is a metric commonly used in performance evaluation.
• It combines precision and recall into a single value, providing
a balanced measure of a model's performance.
• The formula for F-measure is:
• The F-measure ranges from 0 to 1, where 1 indicates perfect
precision and recall.
4.Average Precision:
Average Precision is used to find the Average of the model precision based
on relevancy of result. • Algorithm:
In order to find Average Precision:
1)Take 2 variables X and Y as 0
2)We will then go through the prediction from left to right:
3) In case the prediction is 0, we will only increment Y by 1 and
not find prediction score
4)In case the prediction is 1, we will increment both X and Y by 1
5) After incrementing, we use the formula X/Y to get the current
position prediction score.
6) Lastly we will find summation of all prediction scores and divide
them by total number of positive predictions.
Practical No 5
Aim: Text Categorization
A)Implement a text classification algorithm (e.g.,
Naive Bayes or Support Vector Machines).
B)Train the classifier on a labelled dataset
and evaluate its performance.
Theory:
Naive Bayes
• The Naïve Bayes algorithm is a supervised learning
algorithm, which is based on Bayes theorem and used
for solving classification problems.
• It is mainly used in text classification that
includes a high- dimensional training
dataset.
• Naive Bayes Classifier is one of the simple and
most effective Classification algorithms which
helps in building the fast machine learning models
that can make quick predictions.
• It is a probabilistic classifier, which means it predicts
on the basis of the probability of an object.
• Some popular examples of Naive Bayes Algorithm are
spam filtration, Sentimental analysis, and
classifying articles.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or
Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
• The formula for
Bayes' theorem is
given as: P(B|A) *
P(A)
P(A|B) =
P(B)