0% found this document useful (0 votes)
23 views9 pages

NLP See

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

NLP See

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MODULE 5

Q1) Explain the Information Retrieval (IR) and Classical Problem in Information Retrieval (IR)
System.
Ans. Information Retrieval (IR) is the process of obtaining relevant information from a large
repository, such as a database or the web, based on user queries.

Key Concepts:
1. Document Collection: A repository of structured or unstructured text documents.
2. Queries: User input specifying the required information.
3. Relevance: Matching documents based on their content and user query.
4. Ranking: Arranging documents in order of relevance to the query.

Example: Search engines like Google use IR techniques to provide relevant results for user
queries.

Design Features of IR:


i. Document Indexing: Efficiently organize documents using structures like inverted
indexes for quick retrieval.
ii. Query Processing: Parse, normalize, and expand user queries to handle diverse formats
and improve performance.
iii. Relevance Ranking: Rank documents based on relevance using algorithms like TF-IDF
or BM25.
iv. Scalability: Manage large datasets with efficient storage, distributed indexing, and
retrieval mechanisms.
v. Natural Language Support: Process queries and documents using stemming,
lemmatization, and phrase detection.
vi. User Feedback Integration: Enable iterative refinement of results through relevance
feedback.
vii. Semantic Search: Match concepts rather than keywords using contextual understanding.
viii. Multimedia Retrieval: Support various content types, including text, images, videos,
and audio.
ix. Real-Time Updates: Allow dynamic addition or modification of indexed documents.
x. Cross-Language Retrieval: Facilitate multilingual searching using machine translation
or language-independent indexing.
xi. Personalization: Tailor results based on user preferences, search history, and behavior.
xii. Security and Privacy: Ensure data confidentiality and protection for sensitive queries or
documents.

Classical Problems in IR Systems

Ad-Hoc Retrieval Problem


 Definition: A core problem where the system retrieves a ranked list of documents
relevant to a user's specific query without prior knowledge of the query.
 Characteristics:
 The user query is often vague or ambiguous.
 The document collection is static.
 Relevance ranking is crucial.
 Challenges:
 Determining user intent accurately.
 Ranking documents effectively despite query ambiguity.
 Key Components:
 Query Processing: Parsing and understanding the user query.
 Indexing: Precomputing document features for efficient matching.
 Matching and Ranking: Scoring documents based on relevance metrics like TF-IDF or
BM25.
 Real-World Example: When a user searches for "best phones," the system retrieves
relevant documents from a static collection and ranks them based on factors like reviews,
features, and popularity.

Q2) Types of Information Retrieval (IR) Models


Ans. The various types include:
1. Classical IR Model: Includes Boolean, vector space, and probabilistic models, focusing
on document-term relationships and relevance.
2. Non-Classical IR Model: Explores semantic, fuzzy, and graph-based models to address
challenges of contextual and uncertain data retrieval.
3. Alternative IR Model: Introduces ranking, hybrid, and knowledge-based methods,
combining multiple approaches to improve search accuracy.
4. Boolean Model: Represents queries using logical operators (AND, OR, NOT) to retrieve
exact matches from documents.
5. Vector Space Model: Represents documents and queries as vectors in multi-dimensional
space; uses cosine similarity for ranking relevance.
6. Probabilistic Model: Predicts the probability of a document's relevance to a query using
probabilistic reasoning.
7. Language Model: Uses statistical language models to rank documents based on the
likelihood of generating the user query.
8. Latent Semantic Indexing (LSI): Extracts hidden semantic structures in documents for
improved similarity matching.
9. Extended Boolean Model: Enhances classical Boolean with partial matching and
ranking for more flexible retrieval.
10.Ranking Models: Focuses on ranking algorithms (e.g., PageRank) to prioritize
documents based on their importance or authority.
11.Neural Network Models: Uses deep learning architectures like BERT and transformers
to capture semantic relationships and context.
12.Graph-Based Models: Represents documents and terms as nodes in a graph to analyze
relationships and rank documents.
13.Hybrid Models: Combines classical and non-classical models, leveraging their strengths
for enhanced accuracy and versatility in retrieval.

Q3) Boolean Model.


Ans. Boolean Model in Information Retrieval

Concept Meaning:
The Boolean Model represents queries and documents using binary logic, where terms are
matched exactly based on logical operators like AND, OR, and NOT.

Key Components:
1. Documents: A collection of indexed documents containing terms.
2. Queries: Logical expressions using Boolean operators (AND, OR, NOT) to specify
retrieval criteria.
3. Operators:
a. AND: Retrieves documents containing all specified terms.
b. OR: Retrieves documents containing at least one of the specified terms.
c. NOT: Excludes documents containing the specified term.

Algorithm Concept:
The Boolean Model uses set operations to retrieve documents that satisfy a query's logical
conditions. The model assumes exact matching and outputs a binary decision: relevant or not
relevant.

Steps or Procedure:
1. Indexing: Create an inverted index mapping terms to document IDs.
2. Query Parsing: Convert the user query into a logical expression.
3. Set Operations: Perform set-based operations (union, intersection, or complement) on
document IDs based on Boolean operators.
4. Result Retrieval: Return documents satisfying the query criteria without ranking.

Advantages:
1. Simplicity: The model is easy to understand and implement, especially for users familiar
with Boolean logic.
2. Precision: Allows users to retrieve specific documents using exact query matching
criteria.
3. Efficiency: Works well for small datasets with straightforward and well-defined queries.
4. Structured Queries: Handles queries involving complex logical combinations using AND,
OR, and NOT operators.
5. Customization: Offers flexibility to craft queries based on specific requirements using
Boolean expressions.
6. Low Computational Requirements: Does not require advanced computational resources,
making it lightweight.

Disadvantages:
1. No Ranking: Fails to rank retrieved documents, making it difficult to prioritize the most
relevant ones.
2. Exact Match Dependency: Ineffective for ambiguous or incomplete queries, as it relies
on exact term matching.
3. No Partial Matching: Does not support fuzzy search or term similarity, limiting retrieval
capabilities.
4. Rigid Query Structure: Users must precisely formulate queries, which can be
challenging for complex or vague information needs.
5. No Semantic Understanding: Fails to capture the relationships or context between
terms.

Q4) Vector Space Model of Cosine Similarity

Ans. Concept Meaning


The Vector Space Model represents documents and queries as vectors in a multi-dimensional
space, enabling similarity measurement using cosine similarity.
Key Components
1. Documents as Vectors: Each document is represented as a vector of terms in a multi-
dimensional space.
2. Queries as Vectors: Queries are treated similarly, represented as vectors of terms.
3. Vector Components: Components are term weights, often calculated using TF-IDF
(Term Frequency-Inverse Document Frequency).
4. Cosine Similarity: Measures the cosine of the angle between query and document
vectors to assess similarity.

Algorithm Concept
The model calculates the cosine of the angle between query and document vectors in a vector
space. Smaller angles (cosine values closer to 1) indicate higher similarity.

Formula

Where:

 Q⃗⋅D⃗ : Dot product of query (Q⃗) and document (D⃗) vectors.


 ∥Q ∥,∥D∥: Magnitudes (Euclidean norms) of query and document vectors.

Steps or Procedure
1. Vector Representation: Represent documents and queries as term-weighted vectors
using TF-IDF or other weighting schemes.
2. Dot Product Calculation: Compute the dot product between the query vector and each
document vector.
3. Magnitude Calculation: Calculate the Euclidean norm (magnitude) of the vectors.
4. Similarity Measurement: Use the cosine similarity formula to compute similarity scores
for each document.
5. Result Ranking: Rank documents based on similarity scores for query relevance.

Advantages
1. Relevance Ranking: Assigns scores, allowing documents to be ranked by similarity.
2. Partial Matching: Retrieves documents even with partial query matches.
3. Scalability: Works well with large datasets using sparse representations.
4. Mathematical Foundation: Provides a structured, mathematical approach to measuring
relevance.

Disadvantages
1. Dimensionality Issues: High-dimensional space increases computational complexity.
2. Synonym Limitations: Cannot capture semantic similarities or resolve word
ambiguities.
3. Weight Sensitivity: Accuracy depends on term weighting schemes like TF-IDF.
4. No Contextual Understanding: Ignores word order or deeper contextual relationships.

Graph:

Q5) Stemming.
Ans. Stemming in NLP
Concept Meaning:
Stemming in Natural Language Processing (NLP) reduces words to their root or base form by
removing suffixes or prefixes, improving text normalization and search efficiency.

Key Components
1. Word Roots: The base form of a word that retains its core meaning.
2. Affixes Removal: Eliminating prefixes, suffixes, or inflectional endings.
3. Algorithms: Rules or statistical methods used to derive word stems.
Types of Stemming

1. Porter Stemmer: Widely used and removes common suffixes based on a set of rules.
2. Lancaster Stemmer: A more aggressive stemmer, reducing words to shorter roots.
3. Snowball Stemmer: An improved version of Porter with support for multiple languages.
4. Lovins Stemmer: Early and rule-based, focusing on removing longest matching suffixes.
5. Paice-Husk Stemmer: Iterative rule-based approach for suffix stripping with reversible
operations.
6. Krovetz Stemmer: A light stemmer focusing on linguistic rather than aggressive
reductions.
7. Suffix-Stripping Stemmer: Removes suffixes based on predefined patterns or heuristics.
8. Corpus-Based Stemmer: Utilizes a specific corpus to identify stem patterns statistically.
9. Hybrid Stemmer: Combines rule-based and statistical methods for better performance.
10.Light Stemmer: Focuses on minor affix removal, often used in non-English languages
like Arabic.
11.Inflectional Stemmer: Deals only with inflectional endings like plurals or verb
conjugations.
12.Rule-Based Stemmer: Applies explicit linguistic rules for stripping affixes.
13.Machine Learning Stemmer: Learns stemming patterns using supervised learning
models trained on labeled data.

Algorithm Concept
Stemming algorithms apply a sequence of transformation rules to remove affixes iteratively or
by matching against pre-defined patterns. The goal is to simplify word forms consistently
without altering meaning significantly.

Why Use Stemming?


1. Search Optimization: Reduces query and document terms to common stems for better
matches.
2. Text Normalization: Converts words to a consistent base form, simplifying text
analysis.
3. Storage Efficiency: Reduces storage requirements by collapsing similar words.
4. Improves Recall: Retrieves more documents by matching variations of the same root
word.
5. Language Flexibility: Handles word variations in different grammatical contexts
effectively.
6. Simplifies Preprocessing: Streamlines the text preprocessing pipeline for NLP
applications.
7. Cost-Effective: Decreases computational overhead by reducing vocabulary size.

Example:

Q6)

You might also like