Yash Pahlani D17B 49
Aim: To Implement any Information Retrieval Modeling technique
Theory:
Information Retrieval (IR) modeling techniques are essential for efficient and
accurate extraction of relevant information from vast document repositories. By
analyzing and structuring data, these techniques facilitate ranking and presentation
of documents in alignment with a user's query. The choice of IR technique depends
on factors such as query complexity, document collection size, and desired
precision-recall trade-offs, highlighting the diverse strategies available to optimize
information retrieval processes.
Here are some commonly used IR modeling techniques:
Boolean Model: The Boolean Model is a fundamental and straightforward
approach to information retrieval. It treats documents and queries as sets of terms
(words), and it uses Boolean operators (AND, OR, NOT) to combine these sets. In
this model, a document is either considered relevant (1) or not relevant (0) to a
query. The Boolean Model provides a way to express complex queries using
logical operators.
Vector Space Model (VSM): The Vector Space Model represents documents and
queries as vectors in a high-dimensional space, where each dimension corresponds
to a term. Terms are typically weighted using techniques like TF-IDF to reflect
their importance in the document. The relevance between a query vector and a
document vector is often computed using the cosine similarity.
Probabilistic Models: Probabilistic models approach information retrieval from a
statistical perspective, estimating the probability that a document is relevant to a
given query. These models aim to find a balance between precision and recall by
ranking documents based on their likelihood of relevance.
1
Yash Pahlani D17B 49
Vector Space Model (VSM):
The Vector Space Model (VSM) is a fundamental technique in Information
Retrieval (IR) that transforms textual data into a geometric framework. In this
model, documents and queries are represented as vectors in a high-dimensional
space, with each dimension corresponding to a unique term. By measuring the
similarity between these vectors, the VSM assesses the relevance of documents to
user queries. Originally proposed by Gerard Salton in the 1960s, the VSM has
since become a cornerstone of modern IR systems.
Working of Vector Space Model:
The VSM transforms documents and queries into numerical vectors within a
high-dimensional space. Each dimension corresponds to a unique term in the
vocabulary. The key steps in the VSM's working are:
Term Frequency (TF) Calculation: For each document and query, the frequency of
each term is computed. This forms the term frequency vector.
The Term-Frequency is computed with respect to the i-th term and j-th document :
Inverse Document Frequency (IDF) Calculation: The inverse document frequency
of each term is determined, representing its importance in the entire document
collection.
The Inverse-Document-Frequency takes into consideration the i-th terms and all
the documents in the collection :
Term Weighting (TF-IDF): The product of term frequency and inverse document
frequency results in the TF-IDF score, which reflects the significance of each term
within a document or query.
2
Yash Pahlani D17B 49
Vector Creation: Each document and query is represented as a vector, where each
dimension corresponds to a term and the value is its corresponding TF-IDF score.
Cosine Similarity: The relevance between documents and queries is assessed using
the cosine similarity between their respective vectors. Documents with higher
cosine similarities are considered more relevant.
Cosine Similarity is computed using:
Advantages of Vector Space Model
Partial Matching: The VSM accommodates partial keyword matches, allowing
relevant documents to be retrieved even if they share only a subset of terms with
the query.
Term Importance: TF-IDF captures term importance, emphasizing rare and
distinctive terms over common ones.
Ranking: Cosine similarity provides a natural ranking mechanism, presenting the
most relevant documents first.
Disadvantages of Vector Space Model
Semantic Gap: The VSM lacks understanding of word semantics, leading to
challenges in capturing context and meaning.
High-Dimensional Space: As the vocabulary grows, the dimensionality of the
space increases, which can lead to computational complexities.
Query Sparsity: Short queries or those with few relevant terms may result in
imprecise retrieval.
3
Yash Pahlani D17B 49
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
nltk.download('punkt')
nltk.download('stopwords')
# Sample corpus
corpus = [
'In computer science artificial intelligence sometimes called
machine intelligence is intelligence demonstrated by machines',
'Experimentation calculation and Observation is called science',
'Physics is a natural science that involves the study of matter
and its motion through space and time, along with related concepts
such as energy and force',
'In mathematics and computer science an algorithm is a finite
sequence of well-defined computer-implementable instructions',
'Chemistry is the scientific discipline involved with elements
and compounds composed of atoms, molecules and ions',
'Biochemistry is the branch of science that explores the chemical
processes within and related to living organisms',
'Sociology is the study of society, patterns of social
relationships, social interaction, and culture that surrounds
everyday life',
]
# Preprocess and clean the corpus
cleaned_corpus = []
for doc in corpus:
tokens = word_tokenize(doc.lower())
tokens = [word for word in tokens if word.isalnum()]
cleaned_corpus.append(' '.join(tokens))
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(cleaned_corpus)
# Query
query = 'computer science'
query_tokens = word_tokenize(query.lower())
4
Yash Pahlani D17B 49
query = ' '.join([word for word in query_tokens if word not in
stopwords.words('english')])
# Transform the query into a vector using the same vectorizer
query_vector = vectorizer.transform([query])
# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(doc_vectors,
query_vector).flatten()
# Get the indices of related documents
related_docs_indices = cosine_similarities.argsort()[::-1]
# Print related documents with cosine similarity values
for i in related_docs_indices:
tokens = word_tokenize(cleaned_corpus[i])
filtered_tokens = [word for word in tokens if word not in
stopwords.words('english')]
data = ' '.join(filtered_tokens)
similarity_value = cosine_similarities[i]
print(f"Similarity: {similarity_value:.4f} - {data}")
Output:
5
Yash Pahlani D17B 49
Boolean Information Retrieval Model
The Boolean Information Retrieval Model is a fundamental approach in the field of
information retrieval, which involves searching for and retrieving relevant documents
from a collection based on user-defined queries. It operates on the principles of Boolean
logic, which was developed by George Boole in the mid-19th century. The Boolean
model is particularly effective for precise, structured searches, making it well-suited for
certain types of information retrieval tasks.
Working of Boolean Information Retrieval Model
The Boolean Information Retrieval Model operates based on a set of principles derived
from Boolean logic. The key components that define how the model works include
Boolean operators, queries, and documents. Here's a step-by-step explanation of how the
Boolean Information Retrieval Model works:
Document Indexing: Before searching can begin, a collection of documents is indexed.
This involves parsing each document to extract individual terms (words), removing
stopwords (common words like "and," "the," "is"), and creating an index that maps each
term to the documents where it appears.
Query Formulation: Users create queries by combining terms and Boolean operators
(AND, OR, NOT). The terms are the keywords or phrases that users want to search for
within the document collection. The Boolean operators define how the terms are related
and help narrow down or broaden the search.
Boolean Operators:
AND: When users use the AND operator, they are specifying that documents must
contain all the terms connected by AND. This narrows down the search to documents that
satisfy all the conditions.
OR: The OR operator retrieves documents that contain at least one of the terms connected
by OR. It broadens the search by including documents that meet any of the specified
conditions.
6
Yash Pahlani D17B 49
NOT: The NOT operator excludes documents that contain the term following it. It refines
the search results by excluding unwanted documents.
Retrieval of Documents: The indexing structure allows the system to efficiently retrieve
documents that match the terms and Boolean operators specified in the query.
Advantages of Boolean Information Retrieval Model
Precision Control: Users can precisely define search criteria using Boolean operators,
ensuring accurate retrieval of specific information.
Structured Queries: Ideal for systematic searches where exact term matches are critical,
such as legal or scientific research.
Consistent Results: The same query always produces the same results, ensuring
reproducibility.
Disadvantages of Boolean Information Retrieval Model
No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to
potential difficulties in identifying more important results.
Limited Language Handling: Struggles with variations in language, such as synonyms or
related terms, which can result in missed information.
Complex Query Construction: Formulating intricate queries with multiple terms and
operators can be complex and error-prone.
Binary Output: Documents are classified as either relevant or irrelevant, lacking the
nuance of degrees of relevance.
Code:
import nltk
nltk.download('stopwords')
from typing import OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
7
Yash Pahlani D17B 49
# Sample documents in the corpus
documents = [
"Taj Mahal is a beautiful monument",
"Victoria Memorial is also a monument",
"I like to visit Agra"
]
stemmer = PorterStemmer()
txtFiles = []
stemmedwords = []
dictionary = {}
OrderedDictionary = []
for doc_id, doc_text in enumerate(documents):
tokens = word_tokenize(doc_text.lower())
# Remove stopwords and punctuations
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words and t not in
string.punctuation]
stemmedwords = [stemmer.stem(token) for token in tokens]
for term in stemmedwords:
if term not in dictionary:
dictionary[term] = []
dictionary[term].append(doc_id + 1) # Adding 1 to doc_id to match
document numbering
OrderedDictionary = OrderedDict(sorted(dictionary.items()))
print("Inverted Index:")
for term, posting_list in OrderedDictionary.items():
print(term, posting_list)
query = input("Enter query: ")
query = word_tokenize(query.lower())
query = [stemmer.stem(word) for word in query]
print("Processed Query:", query)
result_set = set(range(1, len(documents) + 1))
8
Yash Pahlani D17B 49
i = 0
while i < len(query):
term = query[i]
if term == 'and':
i += 1
next_term = query[i]
result_set &= set(OrderedDictionary.get(next_term, []))
elif term == 'or':
i += 1
next_term = query[i]
result_set |= set(OrderedDictionary.get(next_term, []))
elif term == 'not':
i += 1
next_term = query[i]
result_set -= set(OrderedDictionary.get(next_term, []))
else:
result_set = set(OrderedDictionary.get(term, []))
i += 1
if result_set:
print("\nMatching Documents:", result_set)
else:
print("\nNo matching documents.")
Output:
AND
9
Yash Pahlani D17B 49
NOT
OR
Conclusion
While the Boolean Information Retrieval Model offers precision and structured searches,
it falls short in adapting to contextual nuances and ranking relevance. As the Vector
Space Model quantifies text and addresses these limitations, it stands as a more adaptable
and nuanced approach, though challenges in contextual understanding and user intent
remain for further development.
10