0% found this document useful (0 votes)

66 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Yash Pahlani D17B 49

Aim: To Implement any Information Retrieval Modeling technique

Theory:

Information Retrieval (IR) modeling techniques are essential for efficient and
accurate extraction of relevant information from vast document repositories. By
analyzing and structuring data, these techniques facilitate ranking and presentation
of documents in alignment with a user's query. The choice of IR technique depends
on factors such as query complexity, document collection size, and desired
precision-recall trade-offs, highlighting the diverse strategies available to optimize
information retrieval processes.

Here are some commonly used IR modeling techniques:

Boolean Model: The Boolean Model is a fundamental and straightforward

approach to information retrieval. It treats documents and queries as sets of terms
(words), and it uses Boolean operators (AND, OR, NOT) to combine these sets. In
this model, a document is either considered relevant (1) or not relevant (0) to a
query. The Boolean Model provides a way to express complex queries using
logical operators.

Vector Space Model (VSM): The Vector Space Model represents documents and
queries as vectors in a high-dimensional space, where each dimension corresponds
to a term. Terms are typically weighted using techniques like TF-IDF to reflect
their importance in the document. The relevance between a query vector and a
document vector is often computed using the cosine similarity.

Probabilistic Models: Probabilistic models approach information retrieval from a

statistical perspective, estimating the probability that a document is relevant to a
given query. These models aim to find a balance between precision and recall by
ranking documents based on their likelihood of relevance.

1
Yash Pahlani D17B 49

Vector Space Model (VSM):

The Vector Space Model (VSM) is a fundamental technique in Information

Retrieval (IR) that transforms textual data into a geometric framework. In this
model, documents and queries are represented as vectors in a high-dimensional
space, with each dimension corresponding to a unique term. By measuring the
similarity between these vectors, the VSM assesses the relevance of documents to
user queries. Originally proposed by Gerard Salton in the 1960s, the VSM has
since become a cornerstone of modern IR systems.

Working of Vector Space Model:

The VSM transforms documents and queries into numerical vectors within a
high-dimensional space. Each dimension corresponds to a unique term in the
vocabulary. The key steps in the VSM's working are:

Term Frequency (TF) Calculation: For each document and query, the frequency of
each term is computed. This forms the term frequency vector.
The Term-Frequency is computed with respect to the i-th term and j-th document :

Inverse Document Frequency (IDF) Calculation: The inverse document frequency
of each term is determined, representing its importance in the entire document
collection.
The Inverse-Document-Frequency takes into consideration the i-th terms and all
the documents in the collection :

Term Weighting (TF-IDF): The product of term frequency and inverse document
frequency results in the TF-IDF score, which reflects the significance of each term
within a document or query.

2
Yash Pahlani D17B 49

Vector Creation: Each document and query is represented as a vector, where each
dimension corresponds to a term and the value is its corresponding TF-IDF score.

Cosine Similarity: The relevance between documents and queries is assessed using
the cosine similarity between their respective vectors. Documents with higher
cosine similarities are considered more relevant.
Cosine Similarity is computed using:

Advantages of Vector Space Model

Partial Matching: The VSM accommodates partial keyword matches, allowing

relevant documents to be retrieved even if they share only a subset of terms with
the query.

Term Importance: TF-IDF captures term importance, emphasizing rare and

distinctive terms over common ones.

Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

most relevant documents first.

Disadvantages of Vector Space Model

Semantic Gap: The VSM lacks understanding of word semantics, leading to

challenges in capturing context and meaning.

High-Dimensional Space: As the vocabulary grows, the dimensionality of the
space increases, which can lead to computational complexities.

Query Sparsity: Short queries or those with few relevant terms may result in
imprecise retrieval.

3
Yash Pahlani D17B 49

Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')
# Sample corpus
corpus = [
'In computer science artificial intelligence sometimes called
machine intelligence is intelligence demonstrated by machines',
'Experimentation calculation and Observation is called science',
'Physics is a natural science that involves the study of matter
and its motion through space and time, along with related concepts
such as energy and force',
'In mathematics and computer science an algorithm is a finite
sequence of well-defined computer-implementable instructions',
'Chemistry is the scientific discipline involved with elements
and compounds composed of atoms, molecules and ions',
'Biochemistry is the branch of science that explores the chemical
processes within and related to living organisms',
'Sociology is the study of society, patterns of social
relationships, social interaction, and culture that surrounds
everyday life',
]

# Preprocess and clean the corpus

cleaned_corpus = []
for doc in corpus:
tokens = word_tokenize(doc.lower())
tokens = [word for word in tokens if word.isalnum()]
cleaned_corpus.append(' '.join(tokens))

# Create a TF-IDF vectorizer

vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(cleaned_corpus)

# Query
query = 'computer science'
query_tokens = word_tokenize(query.lower())

4
Yash Pahlani D17B 49

query = ' '.join([word for word in query_tokens if word not in

stopwords.words('english')])

# Transform the query into a vector using the same vectorizer

query_vector = vectorizer.transform([query])

# Calculate cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(doc_vectors,
query_vector).flatten()

# Get the indices of related documents

related_docs_indices = cosine_similarities.argsort()[::-1]

# Print related documents with cosine similarity values

for i in related_docs_indices:
tokens = word_tokenize(cleaned_corpus[i])
filtered_tokens = [word for word in tokens if word not in
stopwords.words('english')]
data = ' '.join(filtered_tokens)
similarity_value = cosine_similarities[i]
print(f"Similarity: {similarity_value:.4f} - {data}")

Output:

5
Yash Pahlani D17B 49

Boolean Information Retrieval Model

The Boolean Information Retrieval Model is a fundamental approach in the field of

information retrieval, which involves searching for and retrieving relevant documents
from a collection based on user-defined queries. It operates on the principles of Boolean
logic, which was developed by George Boole in the mid-19th century. The Boolean
model is particularly effective for precise, structured searches, making it well-suited for
certain types of information retrieval tasks.

Working of Boolean Information Retrieval Model

The Boolean Information Retrieval Model operates based on a set of principles derived
from Boolean logic. The key components that define how the model works include
Boolean operators, queries, and documents. Here's a step-by-step explanation of how the
Boolean Information Retrieval Model works:

Document Indexing: Before searching can begin, a collection of documents is indexed.

This involves parsing each document to extract individual terms (words), removing
stopwords (common words like "and," "the," "is"), and creating an index that maps each
term to the documents where it appears.

Query Formulation: Users create queries by combining terms and Boolean operators
(AND, OR, NOT). The terms are the keywords or phrases that users want to search for
within the document collection. The Boolean operators define how the terms are related
and help narrow down or broaden the search.

Boolean Operators:

AND: When users use the AND operator, they are specifying that documents must
contain all the terms connected by AND. This narrows down the search to documents that
satisfy all the conditions.

OR: The OR operator retrieves documents that contain at least one of the terms connected
by OR. It broadens the search by including documents that meet any of the specified
conditions.

6
Yash Pahlani D17B 49

NOT: The NOT operator excludes documents that contain the term following it. It refines
the search results by excluding unwanted documents.

Retrieval of Documents: The indexing structure allows the system to efficiently retrieve
documents that match the terms and Boolean operators specified in the query.

Advantages of Boolean Information Retrieval Model

Precision Control: Users can precisely define search criteria using Boolean operators,
ensuring accurate retrieval of specific information.

Structured Queries: Ideal for systematic searches where exact term matches are critical,
such as legal or scientific research.

Consistent Results: The same query always produces the same results, ensuring
reproducibility.

Disadvantages of Boolean Information Retrieval Model

No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to

potential difficulties in identifying more important results.
Limited Language Handling: Struggles with variations in language, such as synonyms or
related terms, which can result in missed information.

Complex Query Construction: Formulating intricate queries with multiple terms and
operators can be complex and error-prone.

Binary Output: Documents are classified as either relevant or irrelevant, lacking the
nuance of degrees of relevance.

Code:
import nltk
nltk.download('stopwords')
from typing import OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

7
Yash Pahlani D17B 49

# Sample documents in the corpus

documents = [
"Taj Mahal is a beautiful monument",
"Victoria Memorial is also a monument",
"I like to visit Agra"
]

stemmer = PorterStemmer()

txtFiles = []
stemmedwords = []
dictionary = {}
OrderedDictionary = []

for doc_id, doc_text in enumerate(documents):

tokens = word_tokenize(doc_text.lower())

# Remove stopwords and punctuations

stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words and t not in
string.punctuation]

stemmedwords = [stemmer.stem(token) for token in tokens]

for term in stemmedwords:

if term not in dictionary:
dictionary[term] = []
dictionary[term].append(doc_id + 1) # Adding 1 to doc_id to match
document numbering

OrderedDictionary = OrderedDict(sorted(dictionary.items()))

print("Inverted Index:")
for term, posting_list in OrderedDictionary.items():
print(term, posting_list)

query = input("Enter query: ")

query = word_tokenize(query.lower())
query = [stemmer.stem(word) for word in query]

print("Processed Query:", query)

result_set = set(range(1, len(documents) + 1))

8
Yash Pahlani D17B 49

i = 0
while i < len(query):
term = query[i]

if term == 'and':
i += 1
next_term = query[i]
result_set &= set(OrderedDictionary.get(next_term, []))
elif term == 'or':
i += 1
next_term = query[i]
result_set |= set(OrderedDictionary.get(next_term, []))
elif term == 'not':
i += 1
next_term = query[i]
result_set -= set(OrderedDictionary.get(next_term, []))
else:
result_set = set(OrderedDictionary.get(term, []))

i += 1

if result_set:
print("\nMatching Documents:", result_set)
else:
print("\nNo matching documents.")

Output:

AND

9
Yash Pahlani D17B 49

NOT

Conclusion
While the Boolean Information Retrieval Model offers precision and structured searches,
it falls short in adapting to contextual nuances and ranking relevance. As the Vector
Space Model quantifies text and addresses these limitations, it stands as a more adaptable
and nuanced approach, though challenges in contextual understanding and user intent
remain for further development.

Unit - II
No ratings yet
Unit - II
5 pages
Unit 2
No ratings yet
Unit 2
13 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
1) Explain User Interaction With IR With The Help of A Diagram
No ratings yet
1) Explain User Interaction With IR With The Help of A Diagram
12 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Unit 4
No ratings yet
Unit 4
17 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
NLP See
No ratings yet
NLP See
27 pages
Algebraic Model in Information Retrieval Techniques
No ratings yet
Algebraic Model in Information Retrieval Techniques
3 pages
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
1 Overview
No ratings yet
1 Overview
44 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
NLP See
No ratings yet
NLP See
9 pages
Retrieval Models & Ranking Overview
No ratings yet
Retrieval Models & Ranking Overview
16 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
IR Journal
No ratings yet
IR Journal
36 pages
Web Search
No ratings yet
Web Search
30 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
30 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Comprehensive Guide to IR Models
100% (3)
Comprehensive Guide to IR Models
58 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
IR Cheatsheet Final
No ratings yet
IR Cheatsheet Final
3 pages
Pascal's Law and Its Applications
No ratings yet
Pascal's Law and Its Applications
15 pages
PLM 09 v05
No ratings yet
PLM 09 v05
12 pages
IOE Entrance Exam Guidelines
No ratings yet
IOE Entrance Exam Guidelines
23 pages
Sony Cpd-E200
No ratings yet
Sony Cpd-E200
42 pages
Unit 06. PWP (22616)
No ratings yet
Unit 06. PWP (22616)
16 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
FSP 150 Ge 100 Series
No ratings yet
FSP 150 Ge 100 Series
4 pages
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
No ratings yet
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
14 pages
Active Suspension System For Railway Pantographs
No ratings yet
Active Suspension System For Railway Pantographs
15 pages
Zokop Eh81 - ENG
No ratings yet
Zokop Eh81 - ENG
5 pages
Relief Valves 24RV16 Rev 0812
No ratings yet
Relief Valves 24RV16 Rev 0812
2 pages
Keele (1983-10 AES Preprint) - Horn Covers Flat Rectangular Area
No ratings yet
Keele (1983-10 AES Preprint) - Horn Covers Flat Rectangular Area
22 pages
Model QP - (21ECE701) - 2
No ratings yet
Model QP - (21ECE701) - 2
4 pages
Rdbms Versus Ordbms Versus Oodbms
100% (1)
Rdbms Versus Ordbms Versus Oodbms
19 pages
Pipe Size and Wall Thickness Chart
No ratings yet
Pipe Size and Wall Thickness Chart
8 pages
ME Syllabus Iitk
No ratings yet
ME Syllabus Iitk
340 pages
Modern App Development On Salesforce - Data Model Design Contd
No ratings yet
Modern App Development On Salesforce - Data Model Design Contd
27 pages
226D Cab
0% (1)
226D Cab
23 pages
History A2b Tahun 2024-2025 Dept Equip
No ratings yet
History A2b Tahun 2024-2025 Dept Equip
12 pages
Welder Performance Test For Foreigners (Material and Welding Method)
No ratings yet
Welder Performance Test For Foreigners (Material and Welding Method)
4 pages
Chapter 4
No ratings yet
Chapter 4
7 pages
C15 and C18 Generator Set
No ratings yet
C15 and C18 Generator Set
4 pages
Student Powerpoint Presentation Rubric
No ratings yet
Student Powerpoint Presentation Rubric
1 page
Industrial Pressure Transmitter: 0.13% FS Accuracy, External Adjustments, 4 To 20 Ma Output
No ratings yet
Industrial Pressure Transmitter: 0.13% FS Accuracy, External Adjustments, 4 To 20 Ma Output
3 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
GO - Cumulative Pacing Guide - 2
No ratings yet
GO - Cumulative Pacing Guide - 2
6 pages
Nolimitdrones Overview: Your Drone. This All Changes With NLD
No ratings yet
Nolimitdrones Overview: Your Drone. This All Changes With NLD
20 pages
Cancel
No ratings yet
Cancel
2 pages
Mahindra XUV700 Official Product Brief
No ratings yet
Mahindra XUV700 Official Product Brief
4 pages
TESDA Circular No. 039-2025
No ratings yet
TESDA Circular No. 039-2025
3 pages

Information Retrieval Practical

Uploaded by

Information Retrieval Practical

Uploaded by

Yash Pahlani D17B 49

Aim: To Implement any Information Retrieval Modeling technique

Here are some commonly used IR modeling techniques:

Boolean Model: The Boolean Model is a fundamental and straightforward

Probabilistic Models: Probabilistic models approach information retrieval from a

Vector Space Model (VSM):

The Vector Space Model (VSM) is a fundamental technique in Information

Working of Vector Space Model:

Advantages of Vector Space Model

​ Partial Matching: The VSM accommodates partial keyword matches, allowing

​ Term Importance: TF-IDF captures term importance, emphasizing rare and

​ Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

Disadvantages of Vector Space Model

​ Semantic Gap: The VSM lacks understanding of word semantics, leading to

# Preprocess and clean the corpus

# Create a TF-IDF vectorizer

query = ' '.join([word for word in query_tokens if word not in

# Transform the query into a vector using the same vectorizer

# Calculate cosine similarity

# Get the indices of related documents

# Print related documents with cosine similarity values

Boolean Information Retrieval Model

The Boolean Information Retrieval Model is a fundamental approach in the field of

Working of Boolean Information Retrieval Model

​ Document Indexing: Before searching can begin, a collection of documents is indexed.

Advantages of Boolean Information Retrieval Model

Disadvantages of Boolean Information Retrieval Model

​ No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to

# Sample documents in the corpus

for doc_id, doc_text in enumerate(documents):

# Remove stopwords and punctuations

stemmedwords = [stemmer.stem(token) for token in tokens]

for term in stemmedwords:

query = input("Enter query: ")

print("Processed Query:", query)

You might also like

Partial Matching: The VSM accommodates partial keyword matches, allowing

Term Importance: TF-IDF captures term importance, emphasizing rare and

Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

Semantic Gap: The VSM lacks understanding of word semantics, leading to

Document Indexing: Before searching can begin, a collection of documents is indexed.

No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to