CHAPTER-1 Introduction to Information Retrieval System
Q.1 what is the information retrieval and its characteristics.
Information Retrieval (IR) is the process of searching and finding useful information
from a large collection of data, such as books, documents, websites, or databases.
Imagine you are looking for a book in a library. Instead of going through thousands of
books one by one, you use a catalogue or a computer system where you type the
book’s title or topic. The system then finds the most relevant books for you. This is a
simple example of how information retrieval works.
For information retrieval to work efficiently, the system first collects and organizes
data. This process is called indexing, where all the stored information is arranged in a
structured way so that it can be searched easily. When a user enters a search query,
the system compares it with the indexed data, finds the most relevant matches, and
displays the results in order of relevance.
Information retrieval is not just limited to text-based searches. It is also used in
searching for images, videos, and even voice-based queries. Platforms like YouTube
use information retrieval to suggest videos, while voice assistants like Google
Assistant reply on it to understand and answer questions.
Characteristics of Information Retrieval
1. Query-Based Search: The IR system works based on user queries. Users type a
keyword or a question, and the system finds relevant information. Example: If you search
"how to bake a cake," Google shows recipes, videos, and blog posts related to baking
cakes.
2. Finding Relevant Information: The system finds many documents, but only some
are useful. It ranks results based on relevance so that the most useful ones appear first.
Example: Searching "best smartphones 2025" should show smartphone reviews, not
general mobile-related news.
3. Handling Large Amounts of Data: IR systems work with millions or even billions of
documents. They must be fast and efficient to find the right information quickly.
Example: Google processes billions of searches every day and delivers results in less
than a second.
4. Indexing for Faster Search: Instead of searching through everything every time, IR
systems create an index. Indexing is like making a table of contents to quickly find
information. Example: A library catalogue helps find books by their titles or authors
instead of checking every shelf.
5. Filtering and Categorization: The system organizes search results into categories or
applies filters. This helps users find exactly what they need. Example: Shopping websites
allow filtering by price, brand, and ratings to help you find the perfect product.
Q.2 Components of an Information Retrieval (IR) System
An Information Retrieval (IR) system is made up of several key components that
work together to find and display relevant information based on a user’s query. Let’s
go through these components in simple terms.
1. Document Collection (Database): This is where all the information is stored. It
can be a collection of books, articles, websites, or any other form of data. For
example, Google’s search engine stores billions of web pages in its database. The
quality and size of this collection determine how much useful information can be
retrieved.
2. Query Processor: When a user types a search term (query), the query
processor cleans and refines it to make searching more effective. It removes
unnecessary words and corrects errors. If you search for “best mobiles under 500$,”
the system may adjust the query to “best smartphones under 500 dollars” to get
better results.
3. Indexing Module: The indexing module organizes and structures the stored
documents so that searching is faster. Instead of searching every single document,
the system creates an "index" like a table of contents in a book. For example, if you
search for "Artificial Intelligence," the system will quickly check the index and find
documents where those words appear instead of scanning millions of pages one by
one.
4. Ranking and Matching Module: Not all search results are equally useful. The
ranking and matching module decides which documents are most relevant to the
user’s query and arranges them in order of importance. It uses various factors like
keyword frequency, page popularity, and user behaviour. For example, if you search
for "best laptops 2024," the most relevant and recently updated pages will appear at
the top.
5. User Interface (Search Engine Interface): The user interface is the search
box where users type their queries and view the search results. A good user interface
should be simple, easy to use, and display information in an organized way. Google’s
homepage is a great example of a clean and effective user interface.
6. Feedback Mechanism: A good IR system learns from user behaviour. If people
frequently click on a particular result, the system understands that it is more relevant
and will rank it higher in future searches. Search engines like Google use machine
learning and artificial intelligence to improve results over time based on user
preferences.
Q.3 Architecture of IR System
An Information Retrieval (IR) System is designed to help users search, retrieve,
and rank relevant information from a large collection of data, such as documents, web
pages, or multimedia content. The architecture of an IR system consists of multiple
components that work together to process queries and retrieve relevant documents
efficiently.
1. Components of an IR System Architecture
A. Data Collection & Pre-processing
1. Data Acquisition: The system collects raw data from different sources such as
web pages, documents, or databases. Sources may include structured data
(databases), semi-structured data (XML, JSON), and unstructured data (text
documents).
2. Text Processing & Normalization: The collected data undergoes various pre-
processing steps:
o Tokenization: Splitting text into individual words or terms.
o Stop word Removal: Removing common words (e.g., "the", "is") that do
not add much meaning.
o Stemming & Lemmatization: Reducing words to their root forms (e.g.,
"running" → "run").
o Synonym Handling: Mapping related words to improve search accuracy.
3. Indexing: The processed data is indexed to enable fast retrieval. An Inverted
Index (word-to-document mapping) is created, which maps each word to the
documents it appears in.
B. Query Processing & Retrieval
4. Query Processing: When a user submits a query, the system processes it
using similar text normalization techniques (tokenization, stemming, stopword
removal). The system expands queries using query expansion techniques
(e.g., adding synonyms).
5. Matching & Scoring:
o The system matches the processed query with indexed documents.
o It uses ranking algorithms to score documents based on relevance.
o Common retrieval models:
Boolean Model: Exact match using AND, OR, NOT operators.
Vector Space Model (VSM): Computes document-query similarity
using TF-IDF (Term Frequency-Inverse Document Frequency) and
Cosine Similarity.
Probabilistic Model: Uses probability theory to estimate
relevance.
C. Ranking & Results Display
6. Ranking Algorithm:
o The system sorts the matched documents based on relevance scores.
o Ranking Factors:
Term frequency (TF): Number of times a word appears in a
document.
Inverse document frequency (IDF): Rarity of a word across
documents.
PageRank (for web search): Importance of a page based on
incoming links.
7. Results Presentation: The system returns ranked documents with snippets,
metadata, and URLs.It may include query suggestions or spell corrections.
D. Feedback & Performance Improvement
8. User Feedback & Relevance Feedback: The system learns from user
interactions (clicks, time spent) to improve results. Techniques like pseudo-
relevance feedback refine future searches.
9. Index Updates & Optimization:
o The index is periodically updated as new data arrives.
o Performance optimizations like caching, parallel processing, and
distributed search architectures enhance efficiency.
2. Diagram of IR System Architecture
Q.4 Challenges of Information Retrieval (IR)
1. Handling Large Amounts of Data
Today, the internet has billions of web pages, and digital libraries store millions of
books and research papers. Searching through such huge amounts of data quickly and
efficiently is a big challenge.
For example, when you search for "healthy diet tips" on Google, the system has to
scan and rank millions of websites to show you the best results. Managing and
organizing such vast data efficiently is a difficult task.
2. Understanding User Queries
People use different words and phrases to describe the same thing. One person might
search for "best phones under $500," while another might type "top budget
smartphones." The IR system must understand that both users are looking for the
same type of information.
Additionally, some users type very vague queries like "good movies" or "best food."
The system must guess what they actually mean, which is challenging.
3. Synonyms and Multiple Meanings of Words (Ambiguity)
Words can have different meanings depending on the context. This is known as
ambiguity.
For example:
The word "Apple" could mean the fruit or the technology company.
The word "Java" could refer to a programming language or an Indonesian
island.
If a user searches for "Apple prices," the IR system must figure out whether they are
looking for fruit prices or iPhone prices. This makes search results tricky to manage.
4. Ranking the Most Relevant Results
An IR system must decide which documents, web pages, or books are most relevant
to the user's query. If irrelevant results appear at the top, users will not be satisfied.
For example, if you search for "best laptop for gaming," but the top results show
business laptops, you won’t find what you need. The challenge is to show the most
useful results first.
Search engines use complex ranking algorithms based on keyword relevance,
popularity, and user feedback to solve this problem.
5. Speed and Efficiency
People expect search results instantly. If an IR system takes too long, users may get
frustrated and leave.
For example, Google searches billions of pages and gives results in less than a
second! To achieve such fast performance, the system needs powerful servers,
optimized indexing, and efficient algorithms.
6. Fake or Low-Quality Information
The internet is full of fake news, misleading articles, and low-quality content. An IR
system must filter out unreliable information and provide accurate and trustworthy
results.
Search engines use credibility signals like trusted sources, user reviews, and
fact-checking algorithms to handle this challenge.
Q.5 Applications of Information Retrieval
1. Search Engines (Google, Bing, Yahoo, etc.): The most common application of
IR is search engines. When you search for something on Google, the system
retrieves the most relevant web pages from billions of websites.
2. Digital Libraries (Online Books and Research Papers): IR systems are used in
digital libraries to help students, researchers, and professionals find books, research
papers, and academic articles.
3. E-Commerce Websites (Amazon, eBay, Flipkart, etc.): Online shopping
platforms use IR systems to help users find products easily. If you search for "Nike
running shoes" on Flipkart, the IR system will show Nike shoes, but it may also
suggest related products like socks, sportswear, or alternative brands.
4. Recommendation Systems (Netflix, YouTube, Spotify, etc.): IR systems are
used in recommendation engines to suggest movies, videos, music, or products based
on user preferences.
5. Healthcare and Medical Information Retrieval: Doctors and researchers use IR
systems to find medical studies, patient records, and drug information.
6. Cyber security and Fraud Detection: IR systems help in detecting suspicious
activities by retrieving and analysing security data.
CHAPTER-2 Document Indexing, storage and Compression
Q.6 Inverted Index
An Inverted Index is a data structure used in Information Retrieval (IR) systems to
help find relevant documents quickly. It is commonly used in search engines like
Google, digital libraries, and databases to improve search speed and efficiency.
Example: Imagine you have a book and you want to find all the pages where the word
"apple" appears. One way is to read the whole book page by page, which would
take a lot of time. A better way is to use an index at the back of the book, which
tells you exactly which pages contain the word "apple."
How to Build an Inverted Index in Information Retrieval
Step 1: Collect Documents
First, we need a set of documents. These could be web pages, articles, books, or any text data.
Example Documents:
Doc 1: "Apple is a tasty fruit."
Doc 2: "Apple makes iPhones and MacBook."
Doc 3: "Eating fruit like apple and banana is healthy."
Step 2: Tokenization (Break Text into Words)
We split each document into individual words (also called tokens).
After Tokenization:
Doc 1: "Apple", "is", "a", "tasty", "fruit"
Doc 2: "Apple", "makes", "iPhones", "and", "MacBook"
Doc 3: "Eating", "fruit", "like", "apple", "and", "banana", "is", "healthy"
Step 3: Remove Stop Words
Common words like "is", "a", "and", "the" do not add much meaning to searches, so we remove them.
After Removing Stop Words:
Doc 1: "Apple", "tasty", "fruit"
Doc 2: "Apple", "makes", "iPhones", "MacBook"
Doc 3: "Eating", "fruit", "like", "apple", "banana", "healthy"
Step 4: Create a Word-to-Document Mapping (Inverted Index)
Now, we create a table where each word is linked to the documents that contain it.
Inverted Index Table:
Word Documents Containing the Word
Apple Doc 1, Doc 2, Doc 3
Tasty Doc 1
Fruit Doc 1, Doc 3
Makes Doc 2
iPhones Doc 2
MacBook Doc 2
Eating Doc 3
Like Doc 3
Banana Doc 3
Healthy Doc 3
Now, if someone searches for "apple", the system instantly knows that the word "apple" appears in Doc 1,
Doc 2, and Doc 3 without scanning all documents.
Step 5: Store the Inverted Index Efficiently
For large-scale systems (like Google), the inverted index is stored in an optimized format to reduce space
and speed up searches. Some techniques include:
Using numbers instead of words (to save space).
Compressing data (to make storage efficient).
Step 6: Search Using the Inverted Index
Now, when a user searches for a word like "fruit", the system:
1.Checks the inverted index for "fruit".
2. Finds the documents where "fruit" appears (Doc 1, Doc 3).
3. Ranks the results (e.g., if "fruit" appears many times in Doc 3, it might be ranked higher).
Q.7 Compression Techniques in Information Retrieval (IR)
When working with huge amounts of data (like web pages, books, or documents),
Information Retrieval (IR) systems use compression techniques to reduce storage
space and speed up searches. These techniques help search engines like Google
handle billions of web pages efficiently.
Types of Compression Techniques in IR
1. Lossless Compression (No Data Loss) 2. Lossy Compression (Some Data Loss)
Lossless Compression (No Data Loss)
In lossless compression, data is compressed in a way that allows it to be perfectly
restored when needed. It ensures no loss of information and is ideal for inverted
indexes and metadata storage. Common Lossless Compression Techniques:
A. Huffman Coding: This technique replaces frequently used words with shorter
codes and rare words with longer codes. It is commonly used in search engines
and databases.
📌 Example: If the word "apple" appears 100 times in a document, instead of storing
"apple" each time, we can replace it with a short binary code like "101".
B. Run-Length Encoding (RLE): Used when a word appears multiple times in
a row. Instead of storing "apple, apple, apple, apple", we store "apple (4
times)".
📌 Example: Original: AAAA BBB CC DDDD Compressed: A(4) B(3) C(2) D(4)
This is useful for compressing inverted index postings (when a word appears
in many documents).
C. Variable Byte Encoding
This technique stores numbers using fewer bytes for smaller values and more
bytes for larger values.
Helps in compressing document IDs in inverted indexes.
Example: If document IDs are 1, 2, 3, 1000, instead of using 4 bytes per
number, we can use fewer bytes for small numbers and more bytes for
large numbers.
Lossy Compression (Some Data Loss):
In lossy compression, some less important data is removed to save space.
This is used in search engines when small errors are acceptable, like
reducing image quality or removing uncommon words.
🔹 Common Lossy Compression Techniques:
A. Stop word Removal
Stop words (common words like "the", "is", "and", "of") are removed from
the index to reduce size.
Since stop words appear in almost every document, removing them saves a lot
of space.
Example: Original Sentence: "The apple is a tasty fruit."
After Stop word Removal: "Apple tasty fruit" (smaller but still meaningful).
B. Stemming & Lemmatization
Stemming: Reduces words to their base form (e.g., "running" → "run").
Lemmatization: Converts words into their dictionary form (e.g., "better" →
"good").
📌 Example:
Original words: Running, Runs, Ran
After stemming: Run
This reduces storage size by storing only the root word instead of many variations.
C. Quantization (for Image and Video Search)
Used in multimedia IR systems to compress images and videos.
Reduces the number of colours or pixel details without major loss of quality.
Example: JPEG compression reduces image size by removing unnecessary pixel
details, making it load faster.
Q.8 Document Representation
When you search for something on Google, it looks through millions of
documents to find the best match for your query. But how does Google or any
Information Retrieval (IR) system understand and store documents efficiently?
This is where Document Representation comes in. It is a way of organizing,
structuring, and storing documents so that they can be easily searched and
retrieved.
Think of a document as a book in a library. If books were just thrown into a room
without a title or index, it would be impossible to find the right one. To make
searching easier, books have: A title, an author’s name, a summary, Categories (fiction, history,
science, etc.)
Term Weighting in Information Retrieval
Term weighting techniques are used in Information Retrieval (IR) and Text Mining to
determine the importance of words (terms) in a document or a collection of
documents (corpus). These techniques help search engines, recommendation
systems, and text analysis tools rank documents based on their relevance to a query.
Term weighting assigns numerical values (weights) to words based on their
importance in a document or a set of documents.
Term Weighting Techniques
1. Term Frequency (TF): This method counts how many times a word appears in
a document. More appearances = more importance.
Formula: TF=Number of times the term appears in a document /Total number of words in the document
🔹 Example: Let's say we have a document: "The cat is on the mat. The mat is soft."
"The" appears 2 times
"mat" appears 2 times
"cat" appears 1 time
"soft" appears 1 time
TF for "mat" = 2/9 = 0.22 (assuming 9 total words).
2. Inverse Document Frequency (IDF): This method reduces the importance of
common words that appear in many documents (like "the" or "is"). If a word
appears in fewer documents, it gets a higher weight.
Formula: IDF=log (Total number of documents / Number of documents containing the term)
🔹 Example: Suppose we have 1000 documents, and the word "cat" appears in 10 documents:
IDF=log (1000/10) = log (100) =2
Whereas, if "the" appears in 900 documents:
IDF=log (1000/900) = log (1.11) ≈0.05
Since "the" appears in almost every document, its IDF is low.
3. TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF combines
Term Frequency and Inverse Document Frequency to balance word frequency and word
importance.
🔹 Formula: TF−IDF=TF×IDF
🔹 Example: For the word "cat" in one document: TF = 0.22, IDF = 2
TF−IDF=0.22×2=0.44
A rare word (like "cat") gets a higher score than a common word (like "the"), which helps in ranking
relevant documents.
Chapter-3 Retrieval Models
Q.9 Boolean Retrieval Model
The Boolean Retrieval Model is a simple way to search for documents in a
collection using Boolean logic (AND, OR, NOT). It works on exact matches—either a
document is relevant (1) or not relevant (0). How Does It Work?
Each document is represented as a set of words (terms).
A Boolean query is formed using AND, OR, NOT to filter documents.
The model returns only exact matches (no ranking like in Google search).
Boolean Operators
🔹 OR (Union, ∪) → Finds documents containing at least one of the given words.
🔹 AND (Intersection, ∩) → Finds documents containing all the given words.
🔹 NOT (Exclusion, −) → Excludes documents containing a specific word.
3. Example Scenario: Documents Collection
We have three documents:
1.D1: "Cats love milk."
2. D2: "Dogs love bones and milk."
3. D3: "Milk is healthy."
Example Queries & Results
🔹 Query: cats AND milk
Matches D1 ("Cats love milk.") D2 and D3 do not match because they don’t contain
both words.
🔹 Query: dogs OR milk
Matches D1, D2, and D3 because all contain at least one of the words.
🔹 Query: milk NOT dogs
Matches D1 and D3 (D2 is excluded because it contains "dogs").
4. Advantages of Boolean Retrieval
Simple and Fast – Just checks for word presence.
Precise Results – Only relevant documents are retrieved.
Easy to Understand – Uses basic logical operations.
5. Disadvantages of Boolean Retrieval
No Ranking – It doesn’t rank documents by relevance.
Exact Matching Only – If a word is missing (e.g., "cat" instead of "cats"), the document
won’t be found.
Complex Queries – Long queries can be difficult to write and understand.
Q.10 Vector Space Model (VSM)
The Vector Space Model (VSM) is a method used in Information Retrieval (IR) and Text
Mining to represent documents and queries as vectors. It helps find the most relevant
document for a search query using Cosine Similarity.
1. What is the Vector Space Model?
Every document is represented as a vector of words (terms).
Words are given weights based on their importance (e.g., TF-IDF).
Documents and queries are compared using angles between their vectors.
2. Cosine Similarity: Cosine Similarity measures how similar two vectors (documents or
query & document) are by calculating the cosine of the angle between them.
If angle = 0° → Similarity = 1 (Perfect Match)
If angle = 90° → Similarity = 0 (No Match)
🔹 Formula: cos(θ)=A⋅B / ∥A∥∥B∥ Where:
A · B = Dot product of vectors A and B
||A|| = Magnitude (length) of vector A
||B|| = Magnitude (length) of vector B
3. Example with Simple Calculation
Step 1: Create Document Vectors: Imagine we have two documents:
🔹 Document 1 (D1): "I love cats and dogs"
🔹 Document 2 (D2): "Cats and dogs are cute"
First, we create a vocabulary of unique words:
📌 Vocabulary: [I, love, cats, and, dogs, are, cute]
Now, represent each document as a vector:
Wo D1 (I love cats D2 (Cats and dogs
rd and dogs) are cute)
I 1 0
lov 1 0
e
cat 1 1
s
and 1 1
dog 1 1
s
are 0 1
cut 0 1
e
So, the vectors are:
D1: [1, 1, 1, 1, 1, 0, 0]
D2: [0, 0, 1, 1, 1, 1, 1]
4. What Does 0.6 Similarities Mean?
1.0 → Exactly the same
0.6 → Quite similar
0.0 → Completely different
Since 0.6 is closer to 1, the documents are somewhat similar but not identical.
5. Why Use Cosine Similarity?
Ignores Document Length – Works even if documents have different word counts.
Handles High-Dimensional Data – Useful for large text datasets.
Common in Search Engines – Used in Google Search, Chatbots, and Recommendation
Systems.
Q.11 Bayesian Retrieval
Bayesian Retrieval is an advanced search method in Information Retrieval
(IR) that applies Bayes' Theorem to estimate the probability of a document
being relevant to a user's query.
It is based on probability theory, meaning it tries to predict how likely a document
is to be useful for a search query. The higher the probability, the more relevant the
document is.
1. Bayes’ Theorem: Bayes' Theorem calculates the probability of an event
happening based on prior knowledge. Formula: P (A∣B) = P (B∣A) ⋅P (A) / P (B)
Where:
P(A | B) → Probability of A happening given B (Posterior Probability)
P(B | A) → Probability of B happening given A (Likelihood)
P(A) → Probability of A happening (Prior Probability)
P(B) → Probability of B happening (Evidence)
2. How Does Bayesian Retrieval Work?
The goal is to find P(R | Q), the probability that a document (R = Relevant) is relevant to a query (Q = Query terms).
Using Bayes' Theorem: P (R∣Q) =P (Q∣R) ⋅P(R) / P (Q)
Where:
P(R | Q) → Probability that a document is relevant given the query
P(Q | R) → Probability of query terms appearing in relevant documents
P(R) → Probability that any document is relevant in general
P(Q) → Probability of the query occurring in all documents
3. Simple Example
Scenario: You are searching for the topic "Machine Learning" in a library.
We have two types of books:
📖 Relevant Books (R) → Machine Learning books
📖 Non-Relevant Books (¬R) → other books (e.g., Novels, History books)
Step 1: Given Probabilities
P(R) = 0.3 → 30% of books are related to Machine Learning.
P (¬R) = 0.7 → 70% of books are unrelated.
P (Q | R) = 0.8 → If a book is about Machine Learning, it has an 80% chance of containing the query
words ("Machine Learning").
P (Q | ¬R) = 0.2 → If a book is NOT about Machine Learning, it has only a 20% chance of containing
the words ("Machine Learning").
4. What Does 0.63 Mean?
The probability that a book is relevant to the query "Machine Learning" is 63%.
If P(R | Q) is high, the book is likely relevant.
If P(R | Q) is low, the book is not relevant.
5. Advantages of Bayesian Retrieval
Mathematically Sound – Uses probability theory.
Adaptive – Can improve as more data is available.
Works with Uncertainty – Useful when relevance is not clear.
6. Disadvantages of Bayesian Retrieval
Needs Training Data – Requires knowledge of probabilities.
Computationally Expensive – Needs calculations for many documents.
Assumes Independence of Terms – Which is not always true.
Q.12 Relevance Feedback in Information Retrieval
When you search for something online, sometimes the results are not exactly what
you want. Relevance Feedback is a way for the search system to learn from your
choices and improve future search results. How Does It Work?
1. You Search for Something
o Example: You type "best smartphones for gaming" into a search engine.
2. You See the Search Results
o The search engine shows a list of articles, reviews, and websites.
3. You Select Some Results
o If you click on some articles and ignore others, the system learns that the
ones you clicked are more relevant.
4. System Improves Future Searches
o Based on your choices, the search engine refines its understanding and
provides better, more relevant results in the future.
Types of Relevance Feedback
1. Explicit Feedback (Direct user input)
o The user gives feedback manually.
o Example: Clicking a "thumbs up" or "thumbs down" on a search result
or marking something as helpful or not helpful.
2. Implicit Feedback (Automatic learning)
o The system learns from your actions without asking you directly.
o Example: If you spend more time reading an article or click multiple
links from the same source, the system assumes you found it useful.
3. Pseudo-Relevance Feedback (PRF) (Automatic system improvement)
o The system assumes the top search results are relevant and refines the
search automatically.
o Example: If most people searching for "budget gaming laptops" click on
articles about laptops under $1000, the system learns that price is an
important factor and will show more price-related results in future
searches.
Why Is Relevance Feedback Useful?
Improves Search Accuracy – Helps show better results over time.
Learns User Preferences – If you always read articles from a certain website,
it may prioritize results from that site.
Reduces Irrelevant Information – Unhelpful results get pushed down in
rankings.
Real-Life Examples
Google Search – If you often click on technology news, Google may show you
more tech-related articles.
Netflix & YouTube – They recommend videos based on what you watch and
like.
Online Shopping (Amazon, eBay) – If you search for "running shoes" and
mostly look at Nike products, it will show more Nike recommendations.
Chapter-4 Spelling Correction in IR System
Q.13 Spelling Correction
Spelling errors can negatively impact search results. If a user makes a spelling
mistake, the system might fail to retrieve relevant documents or provide
incorrect suggestions. Below are the key challenges in correcting spelling errors in
IR systems:
1.1 Ambiguity in Spelling Errors
Some spelling mistakes have multiple possible corrections, making it difficult
to choose the right one.
Example:
o A user types "thier" instead of "their", but it could also be corrected to
"thief" depending on the context.
o A misspelled word like "bark" could refer to a dog's bark or tree bark,
requiring context analysis.
1.2 Out-of-Vocabulary Words
Users sometimes type words that are not in the system's dictionary, such as
slang, new words, or technical terms.
Example: If a user searches for "cryptozoology", but it's not in the dictionary,
the system might incorrectly suggest something like "cryptography"
instead.
1.3 Context Sensitivity
Some misspelled words have multiple valid spellings, depending on the
context in which they appear.
Example:
o "lead" can be corrected to "led" (past tense of lead) or "lead" (a type of
metal).
o "read" can be past or present tense, depending on the sentence.
1.4 Handling Noise in Queries and Documents
Not all variations in spelling are mistakes. Some may be intentional, like:
o Slang: "gonna" instead of "going to"
o Abbreviations: "AI" for "Artificial Intelligence"
o Technical jargon: "Wi-Fi" or "WiFi"
1.5 Homophones and Homographs
Homophones: Words that sound the same but have different meanings.
o Example: "two", "to", and "too"
Homographs: Words that are spelled the same but have different meanings.
o Example: "bow" (to bend) vs. "bow" (used in archery)
1.6 Language Variations
Different regions and dialects use different spellings for the same word.
Example:
o "Colour" (US English) vs. "Colour" (UK English)
o "Organize" (US English) vs. "Organise" (UK English)
Q.14 Form of spelling correction
Spelling correction is an essential component of natural language processing and text
processing, aimed at identifying and rectifying errors in written text. It plays a critical
role in improving readability, ensuring accurate communication, and enhancing the
performance of downstream tasks like search, translation, and sentiment analysis.
There are several forms of spelling correction, each employing distinct techniques and
methodologies to address different types of errors. Below is a detailed explanation of
the primary forms of spelling correction.
1. Manual Spelling Correction: This is when a person checks their own spelling
mistakes and corrects them. People do this by reading their text carefully or using a
dictionary to find the correct spelling. Example: A student writes "receive" and later
checks the dictionary to correct it to "receive."
2. Automatic Spelling Correction: Many software programs and apps can
automatically fix spelling mistakes as you type. This is common in word processors (like
Microsoft Word), search engines (like Google), and phone keyboards. Example: When you
type "hte" on your phone, it automatically changes to "the."
3. Rule-Based Spelling Correction: This method follows fixed rules of spelling and
grammar. If a word does not follow a correct pattern, the system suggests the correct
spelling. Example: English words usually don’t start with "kz," so if someone types
"kzebra," the system may suggest "zebra."
4. Dictionary-Based Spelling Correction: A dictionary is used to check if a word is
spelled correctly. If a word is not in the dictionary, the system suggests a similar word
that exists in the dictionary. Example: If you type "teh," the system sees that this word is
not in the dictionary and suggests "the."
5. Statistical Spelling Correction: This method looks at common mistakes that people
make and chooses the most likely correct word. It is often used in search engines.
Example: If many people type "definitely" instead of "definitely," the system learns that
"definitely" is the correct spelling and suggests it.
6. Context-Based Spelling Correction: Some words are spelled correctly but used in
the wrong place. This method looks at the sentence to decide if the word is correct or not.
Example: "I love to eat stake." → the correct word should be "steak" (food), not "stake" (a
wooden post).
7. AI and Machine Learning-Based Spelling Correction: This is the most advanced
type. Computers learn from a large amount of text and correct spelling mistakes
automatically. Modern autocorrect systems and AI tools like Grammarly use this method.
Example: If you type "I am gona go their," AI knows that "gona" should be "gonna" or
"going to," and "their" should be "there."
Q.15 Evaluation metrics
Evaluation metrics in information retrieval are used to measure how well a search
system or retrieval model performs in finding and ranking relevant documents for a
given query. These metrics help us understand the effectiveness of the system and
compare different retrieval methods. Let’s break down the most common evaluation
metrics in simple terms:
1. Precision: Precision tells us how many of the retrieved documents are actually
relevant. It focuses on the quality of the results. How to calculate:
Precision=Number of Relevant Documents Retrieved /
Total Number of Documents Retrieved
Example: If a search engine retrieves 10 documents and 7 of them are relevant, the
precision is 7/10 = 0.7 (or 70%). A high precision means the system is making fewer
incorrect suggestions.
2. Recall: Recall tells us how many of the total relevant documents in the collection
were successfully retrieved. It focuses on the completeness of the results. How to
calculate:
Recall=Number of Relevant Documents Retrieved /
Total Number of Relevant Documents in the Collection
Example: If there are 20 relevant documents in the collection and the system
retrieves 12 of them, the recall is 12/20 = 0.6 (or 60%). A high recall means the
system finds and corrects most mistakes.
3. F1-Score: The F1-score is a balance between precision and recall. It is the
harmonic mean of the two, giving equal importance to both metrics. How to
calculate:
F1-Score=2× (Precision x Recall / Precision + Recall)
Example: If precision is 0.7 and recall is 0.6, the F1-score is: 2×0.7×0.6 / 0.7+0.6 =
2×0.42 /1.3 ≈ 0.652
A high F1-score means a good balance between precision and recall.
4. Accuracy: Accuracy tells us how often the system is correct in retrieving relevant
documents and excluding irrelevant ones. How to calculate:
Accuracy=Number of Correct Decisions / Total Number of Decisions
Example: If there are 100 documents, 20 are relevant, and the system retrieves 15
relevant and 70 irrelevant documents, the accuracy is: 15+
(100−20−70)/100=15+10/100=0.25 Higher accuracy means the system is correcting
more words correctly.
5. Mean Average Precision (MAP): MAP is an extension of precision that evaluates
the system’s performance across multiple queries. It calculates the average precision
for each query and then takes the mean of these values. How to calculate:
Compute precision at each point where a relevant document is retrieved.
Average these precision values for each query.
Take the mean of these averages across all queries.
Example: For two queries:
o Query 1: Precision values at relevant documents = [1.0, 0.67, 0.75] →
Average = (1.0 + 0.67 + 0.75)/3 = 0.81
o Query 2: Precision values at relevant documents = [0.5, 0.4] → Average =
(0.5 + 0.4)/2 = 0.45
o MAP = (0.81 + 0.45)/2 = 0.63
Q.16 designing an experiment to evaluate an Information
Retrieval (IR) system
When we create a search engine or an Information Retrieval (IR) system, we
need to test how well it works. Experimental design helps us properly test and
evaluate the system in a structured way. Steps in Experimental Design for IR
Evaluation
1. Define the Objective
Before starting an experiment, we must clearly define:
What are we testing?
What improvements do we expect?
Example: “We want to check if our new search algorithm retrieves more relevant
documents compared to the existing system.”
2. Select a Test Collection
A test collection includes:
A document collection – A set of indexed documents (e.g., news articles, research
papers).
A set of queries – User search queries (e.g., “Best gaming laptops 2024”).
Relevance judgments – Labels indicating which documents are relevant for each
query.
Examples of Standard IR Datasets:
TREC (Text Retrieval Conference)
CLEF (Cross-Language Evaluation Forum)
NTCIR (NII Test Collection for IR Systems)
3. Choose Evaluation Metrics 📏
We need measurement criteria to compare performance.
Common IR evaluation metrics:
o Precision – How many retrieved results are relevant?
o Recall – How many relevant results were retrieved?
o F1-score – A balance of precision and recall.
o Mean Average Precision (MAP) – Measures ranking effectiveness.
o Normalized Discounted Cumulative Gain (NDCG) – Gives more
weight to highly ranked relevant results.
4. Select a Baseline System 🏛
We compare the new IR system with an existing system to check improvements.
Example:
Old search system → Achieves 60% Precision
New search system → Achieves 70% Precision
Improvement? We need statistical tests to confirm.
5. Conducting the Experiment
1. Run search queries on both new and old IR systems.
2. Retrieve and rank documents for each query.
3. Compare performance using evaluation metrics.
4. Store results for statistical analysis.
6. Statistical Significance Testing
To check if improvements are real or just happened by chance, we use
statistical tests.
Common tests:
o T-test (for comparing two systems).
o Wilcoxon test (for ranking-based comparison).
o Chi-square test (for categorical results).
If p-value < 0.05, the new system’s improvement is statistically significant.
Real-Life Example
📌 Google Search Updates – Google runs thousands of experiments before updating
its ranking system.
📌 E-commerce Search Engines – Amazon tests new search models to improve
product recommendations.
📌 Academic Research – Researchers compare IR models using standard datasets like
TREC or CLEF.
Q.17 Significance testing
When we create a new system or algorithm, we need to check if it actually
performs better than the old one. Significance testing helps us determine if the
difference in performance is real or just random chance.
Imagine you launch a new medicine and want to see if it really works better than the
old one.
Group 1 (Old Medicine): 70% of patients recover.
Group 2 (New Medicine): 75% of patients recover.
The new medicine looks better, but is this small 5% difference real or just luck?
Significance testing helps us check if the difference is statistically meaningful.
Key Concepts in Significance Testing
1. Null Hypothesis (H₀):
o This is the default assumption that there is no difference between the
systems being compared.
o Example: "There is no difference in performance between System A and
System B."
2. Alternative Hypothesis (H₁):
o This is the opposite of the null hypothesis. It states that there is a
difference between the systems.
o Example: "System A performs better than System B."
3. p-value:
o The p-value is a probability that measures the strength of evidence
against the null hypothesis.
o A low p-value (typically < 0.05) means the observed difference is unlikely
to have occurred by chance, so we reject the null hypothesis.
o A high p-value means the difference could be due to random variation, so
we fail to reject the null hypothesis.
4. Confidence Level:
o This is the threshold for deciding whether a result is significant. A common
confidence level is 95%, meaning we are 95% confident that the
difference is real.
How Does Significance Testing Work?
Step 1: Define Hypotheses
Null Hypothesis (H₀): The new system is not significantly better than the
old one. Any difference is due to chance.
Alternative Hypothesis (H₁): The new system is significantly better than
the old one.
Step 2: Collect Performance Data
We measure performance using metrics like accuracy, precision, recall, F1-
score, or Mean Average Precision (MAP).
Example: We test both algorithms on 1000 searches and record their accuracy.
Step 3: Choose a Statistical Test
Common tests used in performance evaluation:
o T-test → compares two sets of results.
o Wilcoxon signed-rank test → Used when the data is not normally
distributed.
o Chi-square test → Used for categorical performance data.
Step 4: Compute the p-value
The p-value tells us how likely the difference is just random chance.
If p < 0.05, the new system’s improvement is statistically significant (95%
confidence).
If p > 0.05, the improvement might be due to randomness.
Step 5: Make a Decision
If p < 0.05 → we reject the null hypothesis and say the new system is
significantly better.
If p > 0.05 → we keep the old system because the difference is not meaningful.
Real-Life Examples
✅ Google Search Updates – Google tests new ranking algorithms to see if they really
improve search results.
✅ Self-Driving Cars – Evaluates whether new AI models make fewer driving mistakes.
✅ Medical AI Systems – Tests if a new AI can detect diseases better than an old
one.
Chapter-6-Text Categorization and Filtering
Q.1 Text Classification or Categorization Algorithms
Text classification, also called text categorization, is the process of automatically
assigning labels or categories to a given text. This is widely used in various
applications, such as spam detection in emails, sentiment analysis in social media
posts, and categorizing news articles into topics like sports, politics, or entertainment.
The idea behind text classification is simple: we have a collection of texts, and we
want a computer to learn how to classify them correctly into predefined categories. To
achieve this, we use different types of algorithms that analyse text data and make
predictions. These algorithms can be broadly categorized into rule-based methods,
machine learning models, and deep learning techniques.
Rule-Based Methods: In the early days of text classification, rule-based systems
were commonly used. These systems rely on manually created rules to determine
which category a text belongs to. For example, if an email contains words like
“lottery,” “win,” or “prize,” it might be classified as spam. Similarly, if a news article
contains words like “football,” “goal,” or “match,” it may be classified under sports.
While rule-based methods are easy to understand and implement, they have a major
drawback: they require constant updating. Language changes over time, and new
words and phrases appear frequently.
Machine Learning-Based Methods: As technology advanced, machine learning
became a more effective approach to text classification. Instead of relying on
manually written rules, machine learning models learn from past examples. These
models are trained on a dataset where each text is already labelled with its correct
category. After training, the model can predict the category of new, unseen text.
Naïve Bayes Classifier: One of the simplest and most popular machine learning
algorithms for text classification is the Naïve Bayes classifier. This algorithm is
based on probability theory and assumes that words in a text are independent of each
other. For example, if the word “discount” appears frequently in spam emails but
rarely in regular emails, the Naïve Bayes classifier will consider an email containing
“discount” as more likely to be spam. Despite its simplicity, Naïve Bayes performs well
for tasks like spam filtering and sentiment analysis.
Support Vector Machine (SVM): Another commonly used machine learning
algorithm for text classification is the Support Vector Machine (SVM). SVM works
by finding the best way to separate different categories of text using mathematical
boundaries. It looks at the features (words) in the text and tries to draw a line that
best separates one category from another.
For example, if we want to classify movie reviews as “positive” or “negative,” SVM will
learn from examples and find a way to differentiate between words associated with
positivity (e.g., “great,” “amazing,” “fantastic”) and negativity (e.g., “boring,”
“terrible,” “worst”). Once trained, SVM can classify new reviews with high accuracy.
Deep Learning-Based Methods: With the rise of artificial intelligence, deep
learning models have become the most powerful tools for text classification. These
models, inspired by the human brain, are capable of understanding complex
relationships in text data. They require large amounts of data and computing power,
but they offer the highest accuracy in classification tasks.
Recurrent Neural Networks (RNN) and LSTMs: Recurrent Neural Networks (RNNs)
and their improved version, Long Short-Term Memory (LSTM) networks, are designed
to handle sequential data like sentences or paragraphs. Unlike traditional machine
learning models, which treat each word separately, RNNs and LSTMs remember
previous words in a sentence and use that information to better understand context.
For example, in sentiment analysis, the phrase “not good” should be classified as
negative, even though the word “good” alone is usually positive. RNNs and LSTMs can
capture such dependencies and provide more accurate classifications.
Q.2 Naïve Bayes text classification
Naïve Bayes is a probability-based algorithm that helps classify text by calculating
how likely a piece of text belongs to a certain category. It works like a detective who
looks at clues (words in the text) and decides which category the text belongs to base
on past experiences.
Imagine you are running a book review website, and you want to automatically
categorize each review as either positive or negative. Instead of manually reading
each review, you want a computer program to do it for you. This is where Naïve
Bayes comes in!
For example, let’s say we have the following reviews:
Positive reviews:
1. "This book is amazing and well-written."
2. "A fantastic and enjoyable story."
3. "I loved the book, it was excellent."
Negative reviews:
1. "This book is terrible and boring."
2. "A disappointing and dull story."
3. "I hated the book, it was awful."
Now, suppose a new review appears: "The book is excellent and enjoyable."
Naïve Bayes will check how often the words "excellent" and "enjoyable" appeared in
positive and negative reviews. Since these words were mostly found in positive
reviews, it will classify this new review as positive.
Real-Life Example: Spam Detection
One of the most common uses of Naïve Bayes is email spam detection. If an email
contains words like "free," "lottery," "win," "money," "urgent", it is more likely to
be spam. The Naïve Bayes classifier learns from past emails and calculates the
probability of a new email being spam based on the words it contains. For example, if
an email says:
"Congratulations! You have won a free iPhone. Click here to claim your prize!"
The algorithm checks how often words like "congratulations," "won," "free," and
"prize" appeared in spam emails before. If these words appeared frequently in past
spam messages, the new email is marked as spam.
Q.3 Support Vector Machine (SVM)
Imagine you work for a company that receives thousands of customer reviews daily,
and you need to sort them into categories like positive or negative. Instead of doing
this manually, you want a machine to do it for you. One of the best tools for this job is
the Support Vector Machine (SVM) algorithm.
SVM is like a super-smart decision-maker that draws a boundary between different
categories of text. It helps classify new text based on what it has learned from past
examples.
How SVM Works with Text
Think of a game where you have a basket full of apples and oranges. Your job is to
draw a straight line on the table to separate apples from oranges. Now, if you get a
new fruit, you can simply check which side of the line it falls on and decide if it’s an
apple or an orange.
SVM does the same thing, but with words instead of fruits. It finds the best possible
boundary (a straight line in simple cases, but more complex shapes if needed) that
separates two different types of text.
For example, let’s say we want to classify customer reviews as positive or negative.
A positive review might contain words like "good," "excellent," "fantastic,"
"loved," etc.
A negative review might have words like "bad," "terrible," "boring," "hated,"
etc.
SVM looks at past reviews and finds the best way to separate these two groups based
on the words they contain. Once it has drawn a clear boundary, it can classify any
new review by checking which side of the boundary it falls on.
Example of SVM in Action: Imagine we have these customer reviews:
Positive reviews:
"The product is excellent and works perfectly."
"I loved the movie, it was fantastic."
"Great service very satisfied!"
Negative reviews:
"The product is terrible and broke in a day."
"I hated the movie, it was so boring."
"Worst service ever, completely disappointed."
Now, a new review comes in: "The product is great but a bit expensive."
SVM will analyse the words in this review and compare them to past reviews. Since it
contains a strong positive word ("great"), it will likely classify this review as positive.
Where is SVM Used in Real Life?
SVM is widely used in spam detection, sentiment analysis, news classification,
and even medical diagnosis where text-based symptoms are analysed to identify
diseases.
For example, in email spam filtering, if an email contains words like "lottery," "win,"
"free money," it is likely to be spam. SVM will analyse past emails and create a
boundary between spam and non-spam emails, helping your inbox stay clean.
Q.4 Feature selection and dimensionality text classification
Imagine you are trying to sort a huge pile of books into different categories like
science, history, and fiction. If each book has thousands of pages, it would take
forever to read every word to classify it. Instead, you could focus only on the
important words—like "experiment" for science, "king" for history, or "fantasy" for
fiction. This is exactly what feature selection and dimensionality reduction do in
text classification.
When we process text, each word is like a feature. A document (like an email or a
review) can contain thousands of words, but not all words are useful for
classification. Some words, like "the," "and," or "is," appear in almost every document
and don’t help us decide if a text is about sports or politics. Feature selection helps
pick the most important words, while dimensionality reduction helps simplify the
data without losing meaning.
Examples: News Articles
1. "The team won the championship after an intense match."
2. "The government passed a new law regarding education."
Words like "team," "championship," and "match" are clearly related to sports,
while "government," "law," and "education" belong to politics. However, words
like "the," "after," and "a" appears in both articles and do not help classification.
Feature selection removes these common words and keeps only the meaningful ones.
There are different ways to select features:
Removing Stopwords: Words like "the," "is," and "and" are removed because
they appear everywhere.
Term Frequency (TF): Words that appear more often in a specific category are
considered important.
TF-IDF (Term Frequency-Inverse Document Frequency): Words that are
frequent in one category but rare in others get higher importance.
For example, if the word "goal" appears frequently in sports articles but rarely in
political articles, TF-IDF will give it a high weight for sports classification.
Dimensionality Reduction: Simplifying the Data
Imagine you are analysing 100,000 emails to detect spam. Each email contains
hundreds or thousands of words, creating a massive amount of data. If we try to use
all words for classification, the computer will struggle to process so much information.
This is where dimensionality reduction helps.
Instead of using every single word, we reduce the number of features by keeping only
the most relevant patterns and relationships. This makes processing faster and
improves accuracy.
Example: Movie Reviews Sentiment Analysis
Let’s say we are classifying movie reviews as positive or negative. Instead of looking
at all words, we focus on a few key ones:
Positive words: "excellent," "amazing," "loved," "great," "enjoyed"
Negative words: "terrible," "boring," "disappointing," "hated"
Even if a review contains hundreds of words, just looking at these key words can
help us classify it correctly.
There are several techniques for dimensionality reduction:
Principal Component Analysis (PCA): Finds the most important patterns in
the data and removes redundant information.
Latent Semantic Analysis (LSA): Groups similar words together based on
meaning.
Word Embedding’s (like Word2Vec or BERT): Converts words into
numerical values based on their relationships and context.
For example, in spam detection, if we notice that emails containing "win" and "free"
often belong to spam, we can reduce the dataset by focusing on just these important
words instead of analysing every single word in the email.
Why Are Feature Selection and Dimensionality Reduction Important?
1. Faster Processing: Instead of analysing thousands of words, we only focus on
the most relevant ones, making classification much quicker.
2. Better Accuracy: Removing unnecessary words helps the algorithm make
better decisions.
3. Reduces Over fitting: If we use too many words, the model may learn
irrelevant details that don’t generalize well to new text.
Real-Life Applications
Spam detection: Focusing on words like "free," "win," and "limited offer"
instead of analysing entire emails.
Sentiment analysis: Using key words like "love" and "hate" to classify
customer reviews.
News classification: Selecting words like "match" and "goal" for sports news
and "law" and "election" for political news.
Q.5 Applications of Text Categorization and Filtering
Text categorization and filtering are widely used in many areas of technology,
business, and daily life. These techniques help automatically sort, classify, and
manage large volumes of text data efficiently. Let’s explore some of the key
applications with examples.
1. Email Spam Filtering
One of the most common applications is filtering spam emails from important ones.
Email providers like Gmail, Yahoo, and Outlook use text classification to detect
spam based on the words in the email.
Example: If you receive an email with the subject "Congratulations! You have won
$1,000,000!", the system detects that similar emails were previously marked as spam
and moves it to the spam folder.
2. Sentiment Analysis (Opinion Mining)
Companies use text categorization to analyse customer reviews, social media posts,
and feedback to determine if the sentiment is positive, negative, or neutral.
Example: A company selling smartphones analyses online reviews. If many
customers use words like "battery drains fast," "slow performance," it signals a
problem. The company can use this feedback to improve future products.
3. News Classification and Personalization
News websites and apps use text categorization to group articles into categories like
Politics, Sports, Technology, Entertainment, and Health. Example: Google
News learns that a user reads a lot of sports news. It starts showing them more
sports-related articles and fewer political ones.
4. Social Media Monitoring and Hate Speech Detection
Social media platforms use text classification to filter inappropriate content, hate
speech, and cyber bullying.
Example: If someone posts a comment with hate speech on Twitter, the system flags
it and removes it automatically.
5. Healthcare and Medical Text Analysis
Hospitals and research institutions use text classification to analyse medical
records, research papers, and patient symptoms.
Example: A hospital's AI system detects that multiple patients in a city are reporting
symptoms like "fever, cough, breathing issues", and flags a possible outbreak of a
disease.
6. Search Engine Optimization (SEO) and Ad Targeting
Search engines like Google categorize webpages based on content to show the most
relevant results.
Example: If you search for "best smartphones under $500," Google shows
product listings and ads for smartphones instead of laptops or tablets.
Chapter-7 Text Clustering for Information Retrieval
Q.6 Clustering and its Techniques
Imagine you have a big box of different types of candies, but they are all mixed
together. You want to organize them into groups based on colour, shape, or
flavour. However, you don’t know the exact types of candies present. So, you start
grouping similar candies together based on how they look and taste.
This is exactly what clustering does in machine learning—it automatically groups
similar data points together without any prior labels. Unlike classification (where
we already know the categories), clustering helps us find hidden patterns in data.
For example, if an e-commerce website wants to understand customer behaviour, it
can use clustering to group customers based on their shopping habits. This way,
customers who buy electronics frequently will be in one cluster, while those who shop
for clothing will be in another.
Clustering Technique:
1. K-Means Clustering
Imagine you are organizing a large group of students in a school. You want to divide
them into different teams based on their height. Instead of sorting them manually, you
decide to use a simple rule: create groups based on similarity. This is exactly how K-
Means Clustering works—it groups similar things together.
K-Means is an unsupervised machine learning algorithm, which means it finds
patterns in data without any predefined labels. It is one of the most commonly
used clustering techniques because it is simple and effective.
K-Means is one of the simplest and most popular clustering algorithms. It works like this:
1. You choose a number, K, which represents the number of clusters you want.
2. The algorithm randomly picks K points as the centre of these clusters.
3. Each data point is assigned to the nearest cluster centre.
4. The cluster centres are updated based on the points assigned to them.
5. This process repeats until the cluster centres don’t change much.
Example: Imagine you are a gym owner and want to categorize your members based
on workout habits. If you choose K=3, K-Means might group them into:
People who visit the gym daily (fitness enthusiasts).
People who come occasionally (casual gym-goers).
People who signed up but rarely come (inactive members).
This helps you create personalized offers for each group.
2. Hierarchical Clustering
Imagine you are a teacher and you want to group students in your class based on
their skills. However, you don’t know how many groups you need. So, you start by
pairing the most similar students together based on their abilities. Then, you
merge smaller groups into bigger groups until the entire class is grouped into
meaningful clusters. This is how Hierarchical Clustering works—it keeps merging or
splitting groups step by step until a clear structure appears.
Hierarchical Clustering is a type of unsupervised learning algorithm that helps
organize data into a tree-like structure called a dendrogram. Unlike K-Means
Clustering, where you need to decide the number of clusters beforehand, Hierarchical
Clustering automatically finds the structure in the data.
How Hierarchical Clustering Works
Hierarchical Clustering builds clusters gradually. It can be done in two ways:
1. (Bottom-Up Approach)
o It starts with each data point as its own cluster.
o Then, it merges the closest points into small clusters.
o These small clusters keep merging into bigger clusters until only one big
cluster remains.
o This method is more commonly used.
2. (Top-Down Approach)
o It starts with one big cluster containing all data points.
o The cluster is then split into smaller groups step by step.
o It keeps dividing until each point becomes its own separate cluster.
The result of this merging or splitting is represented in a dendrogram, which is a
tree-like diagram showing how data points are grouped.
Example of Hierarchical Clustering
Example: A university can use hierarchical clustering to analyse students' academic
performance. It may first divide students into high, average, and low scorers, and
then further divide them into groups based on subjects.
Where is Hierarchical Clustering Used?
1. Customer Segmentation: Businesses use it to group customers based on
purchasing behaviour.
2. Image Segmentation: Helps in dividing images into meaningful parts for
analysis.
3. Medical Research: Groups patients based on similar symptoms for better
disease diagnosis.
4. Social Network Analysis: Finds communities within a network by grouping
similar users together.
Q.7 Query Expansion and Result Grouping
Query expansion is a technique that helps improve search results by modifying or
expanding the user's query. When we type a short or vague query, the search
system automatically adds more words, synonyms, or related phrases to refine
the search. This helps in retrieving better and more useful results.
For example, imagine you search for "Apple laptop":
The search engine might expand your query to include "MacBook," "MacBook
Pro," "MacBook Air," because it knows that these are related terms.
If you search for "mobile phone", the system might add terms like
"smartphone," "cell phone," "iPhone," "Android phone" to bring better
results.
Types of Query Expansion
1. Synonym Expansion
o The system finds synonyms of the search term to include in the search.
o Example: Searching for “cheap flights” might also look for “low-cost
flights” or “affordable airfare”.
2. Stemming and Lemmatization
o The system includes different word forms of the same word.
o Example: Searching for “running” might also include “run,” “ran,” and
“runner”.
3. Relevance Feedback
o The system improves searches based on what users previously clicked on.
o If many users searching for “best smartphones” clicked on “iPhone
14”, the system may show iPhone results higher in future searches.
4. Using Related Phrases
o The search system adds related phrases to the query.
o Example: Searching for “COVID vaccine” might also retrieve
“coronavirus immunization” and “COVID-19 shot”.
5. Using User Behaviour and Trends
o Search engines analyze what most people search for and expand
queries accordingly.
o Example: If many users searching for “NBA” also search for “LeBron
James”, the system may add LeBron-related results when someone
searches for NBA.
Query expansion makes search engines more intelligent by understanding the real
intent behind a query and bringing more useful results.
What is Result Grouping?
Result grouping is another technique that improves search results by organizing
similar results together. Instead of showing all search results in a long and
unstructured list, the search engine groups related results to make it easier for
users to find what they need.
For example, if you search for “Tesla”, there could be multiple meanings:
1. Tesla, the electric car company
2. Nikola Tesla, the scientist
3. Tesla stock (TSLA) news
Instead of mixing up all these results, search engines group them into categories
so that you can quickly find the information you are looking for.
Chapter-8 Web Information Retrieval
Q.8 Web Search or Web Search Architecture
Web search is the process of finding information on the internet using a search engine
like Google, Bing, or Yahoo. When we type a question or keywords into a search
bar, the search engine looks through billions of web pages and shows us the most
relevant results. This process happens within seconds, thanks to advanced
algorithms that organize and rank information.
Main Components of Web Search Architecture
1. Web Crawling (Finding Web Pages)
The first step in web search is discovering web pages. Search engines use Web
Crawlers (also called Bots or Spiders) to browse the internet and collect
information from websites.
These crawlers start with a list of known web pages and follow hyperlinks to
discover new pages.
The bot downloads the page content, reads the text, and saves the links for
future crawling.
Crawling happens regularly to keep search results updated.
Example: Google’s web crawler, Googlebot, visits a new blog, reads its content, and
follows links on the page to find other related blogs.
2. Indexing (Organizing & Storing Information)
After a page is crawled, the next step is indexing, which means storing the
information in a structured way so it can be searched easily.
The search engine analyses the text, images, and other content on the page.
It removes unnecessary words (like "the", "is", "a") to focus on important
keywords.
The processed data is then stored in a large database called the Search
Index.
Think of it as a giant digital library where every webpage is categorized based on
keywords and topics.
Example:
If a webpage talks about “Best Pizza Restaurants”, the search engine will index
words like pizza, best, restaurant, food, delivery, Italian, etc.. When someone
searches for “best pizza places”, this page may appear in the results.
3. Query Processing (Understanding User Search)
When a user types a search query (e.g., "best smartphones 2024"), the search
engine must understand what they mean before finding results. This step is called
query processing.
The search engine breaks down the query into important keywords.
It checks for spelling mistakes and suggests corrections (e.g., "Did you
mean...").
It expands the query by adding synonyms (e.g., "mobile phone" instead of
"smartphone").
It understands user intent – Does the user want to buy a phone or read
reviews?
Example:
If someone searches "cheap flights to Paris", the search engine understands:
“Cheap” means low-cost
“Flights” refers to airlines
“Paris” is a location
The user wants to book a flight, not learn about the history of Paris.
4. Ranking (Choosing the Best Results)
Once the search engine finds matching pages, it must rank them in the best order.
The goal is to show the most relevant and high-quality results first.
Ranking is done using complex algorithms that consider factors like:
Relevance – Does the page contain information matching the search query?
Page Authority – Is the website trusted and popular?
Content Quality – Is the content well-written and useful?
Freshness – Is the page recently updated?
User Experience – Does the page load fast and work well on all devices?
Example:
If you search for “best laptops 2024”, search engines rank:
1. Trusted tech websites (e.g., TechRadar, CNET) higher.
2. Recently updated reviews above older ones.
3. Pages with useful comparisons and detailed reviews above those with
little content.
5. Retrieval & Display (Showing Results to the User)
The final step is retrieving the top-ranked pages and displaying them in an easy-
to-read format.
Search engines show results in different formats, such as:
Blue links – Standard webpage results.
Featured snippets – Direct answers at the top of search results.
Image & video results – For searches related to media.
News results – For trending and current events.
Local results – Maps and business listings for location-based searches.
Example:
If you search for "How to bake a cake", Google may show:
A featured snippet with step-by-step instructions.
YouTube videos with baking tutorials.
Blog posts with detailed recipes.
Challenges in Web Search
Although search engines are powerful, they still face challenges:
Fake News & Misinformation – Some websites spread false information,
making it hard for search engines to filter reliable sources.
SEO Manipulation – Some websites try to trick search engines by overusing
keywords, making it harder to show truly relevant content.
Privacy Issues – Personalized search results require collecting user data, which
raises concerns about online privacy.
To deal with these challenges, search engines constantly update their algorithms
to improve accuracy and security.
Chapter-9 Learning to rank
Q.1 Learning to Rank (LTR)
Learning to Rank (LTR) is a technique used by search engines and recommendation
systems to improve the order of search results based on what is most relevant to
users. Instead of just using simple rules, LTR uses machine learning to understand
what makes one result better than another.
Think of it like a smart librarian who learns over time which books people find most
useful and starts recommending those first. Similarly, search engines like Google,
Bing, and Amazon use LTR to improve their rankings based on user behaviour and
data.
Why Do We Need Learning to Rank?
Traditional search ranking relies on fixed rules (like counting keyword matches), but
these rules do not always give the best results.
Example:
If you search for "best budget smartphones", a simple ranking system might:
1. Count how many times "best," "budget," and "smartphones" appear in a
webpage.
2. Rank pages based on word frequency, without understanding which reviews
are truly helpful.
Problem:
A bad review saying "This is the worst budget smartphone" may rank high just
because it contains the words "best" and "budget."
Better reviews that actually help users might be ranked lower.
LTR solves this by learning from past user interactions (e.g., which links people
click more often) to improve rankings.
How Learning to Rank Works?
LTR is based on machine learning models that learn from training data. The
process follows these steps:
1. Collect Training Data
o Search engines record data like which links users click, how long they
stay on a page, and which results they ignore.
o This helps understand which results are useful and which ones are not.
2. Extract Features
o LTR looks at different ranking factors (features) such as:
Relevance (Does the page match the query?)
Page Authority (Is the website trusted?)
User Behaviour (Do people click on this result often?)
3. Train a Model
o The search engine feeds the training data to a machine learning
algorithm.
o The model learns how to order results based on real user
preferences.
4. Rank Results in Real Time
o When a new search is performed, the trained model ranks results
dynamically based on what it has learned.
Types of Learning to Rank Approaches
There are three main ways to apply machine learning in ranking:
1. Point wise Approach (Evaluates One Result at a Time)
This method assigns a score to each search result individually.
The higher the score, the higher the ranking.
It treats ranking as a classification or regression problem (predicting a
score).
Example:
For a search query like "best pizza places in New York", each restaurant website is
given a score, and the one with the highest score ranks first.
2. Pairwise Approach (Compares Two Results at a Time)
This method compares two results side by side to decide which one should
be ranked higher.
The model learns by looking at pairs of results and seeing which one user
prefer more.
Example:
If users tend to click Restaurant A more than Restaurant B, the model learns that A
should rank higher than B.
3. List wise Approach (Optimizes the Whole List at Once)
Instead of looking at single results or pairs, this method ranks the entire list
at once.
It tries to find the best overall ranking order.
Example:
For "best movies of 2024", the model optimizes the entire ranking to match the
best possible list based on user behaviour.
Where is Learning to Rank Used?
Search Engines (Google, Bing, and Yahoo) – Improves the ranking of web
pages based on user behaviour.
E-commerce (Amazon, Flipkart, eBay) – Helps show the most relevant
products based on past purchases, reviews, and customer clicks.
Video Streaming (YouTube, Netflix, Disney+) – Recommends movies or
videos based on what users have watched and liked before.
Online Advertising (Google Ads, Facebook Ads) – Decides which ads
should be shown to users based on their interests and past clicks.
Challenges in Learning to Rank
1. Handling New Queries
o If a completely new search term appears, the model has no past data
to learn from.
2. Spam and Manipulation
o Some websites try to trick the system by generating fake clicks or
artificial backlinks.
3. User Bias in Data
o If certain types of pages are clicked more due to marketing, the
model may favour them unfairly.
4. Computational Cost
o Running machine learning models for millions of searches every second
requires high processing power.
Q.2 RankSVM – A Pairwise Learning
RankSVM (Ranking Support Vector Machine) is a machine learning algorithm
used for ranking search results or recommendations. It is a pairwise learning
approaches, meaning it learn by comparing pairs of items to determine which one
should be ranked higher.
Think of it as a tennis match ranking system. Instead of ranking all players at once,
we compare them in pairs (who won against whom) and then use this information to
create an overall ranking. Similarly, RankSVM compares pairs of search results and
decides which one should appear higher in the final list.
Why Use RankSVM?
Traditional ranking methods (like sorting by keyword matches) do not always
provide the best results. Sometimes, the best page doesn’t contain the exact
keywords but is still highly relevant.
For example, if you search for “best laptops for students”, a webpage titled “Top
Affordable Laptops for College” may be better than another page that simply
repeats the words "best laptops students" multiple times.
RankSVM helps solve this issue by:
Learning from past user preferences.
Comparing search results in pairs rather than scoring them separately.
Improving ranking quality over time by optimizing for real-world user
behavior.
How RankSVM Works?
RankSVM is based on Support Vector Machines (SVMs), which are commonly used
for classification problems. However, instead of classifying objects, RankSVM is used
to rank them.
Here’s how it works step by step:
1. Data Preparation (Creating Pairs of Results)
Search engines collect historical data on which links users click.
They generate pairs of results (A, B) where one result is preferred over the
other.
The goal is to train the model to learn which result should rank higher.
Example:
Suppose a user searches for "best smartphones 2024" and clicks on result A
(Samsung S24 Review) but ignores result B (Old Samsung S21 Review).
The system learns that A is better than B and should rank higher in future
searches.
2. Feature Extraction (Analyzing Differences)
Each webpage (or item) is represented by features such as:
Relevance Score (Does the page match the query well?)
Click Rate (How often do users click this result?)
Page Authority (Is the website trusted?)
User Behavior Metrics (How long do users stay on the page?)
For a pair (A, B), RankSVM calculates the feature differences between them.
Example:
Feature Page A Page B Difference
(Samsung S24) (Samsung S21) (A - B)
Relevance 0.9 0.6 0.3
Score
Click Rate 80% 30% 50%
Page 85% 60% 25%
Authority
These differences help RankSVM determine which page should be ranked
higher.
3. Training the RankSVM Model
The model is trained on many such pairs using a Support Vector Machine
(SVM).
The SVM tries to maximize the margin between preferred results and less
preferred results.
Over time, the model learns a ranking function that can order new search
results.
4. Ranking New Search Results
When a user performs a new search, the trained RankSVM model compares
results pair by pair and orders them based on learned preferences.
The final ranking maximizes user satisfaction by placing the most useful
results at the top.
Example of RankSVM in Action
Before Using RankSVM
A search for "best budget smartphones" might show:
1. Old Blog Post (2018) – "Top 5 Budget Phones"
2. Latest Review (2024) – "Best Budget Phones Right Now"
3. Random Forum Discussion
This happens because a simple ranking algorithm may rely only on keyword
matches, ignoring the quality and freshness of content.
After Using RankSVM
RankSVM analyzes past user clicks and engagement and improves rankings:
1. Latest Review (2024) – "Best Budget Phones Right Now" ✅ (Users prefer
this)
2. Recent YouTube Video Review (2023) ✅ (Highly clicked)
3. Old Blog Post (2018) ❌ (Less useful)
Where is RankSVM Used?
Search Engines (Google, Bing, Yahoo) – Improves search rankings based on
real user behavior.
E-commerce (Amazon, Flipkart) – Ranks products based on reviews, clicks,
and conversions.
Streaming Services (Netflix, YouTube, Spotify) – Ranks movies, videos,
and songs based on user engagement.
Job Portals (LinkedIn, Indeed) – Ranks job listings based on relevance and
popularity.
Q.3 RankBoost
RankBoost is a machine learning algorithm used for ranking search results or
recommendations. It is based on a method called Boosting, which improves
ranking by combining multiple weak models into a stronger one.
Imagine you have several judges ranking chess players based on different criteria
(e.g., number of wins, player experience, and speed of moves). Individually, each
judge might not be perfect, but if we combine their opinions smartly, we can get a
better ranking.
Rank Boost works similarly. Instead of relying on a single weak ranking function, it
boosts the learning process by combining multiple weak rankings into a strong
final ranking.
Why Use RankBoost?
Traditional ranking methods, such as sorting by keyword frequency, often fail to
capture true relevance. RankBoost is useful because:
It learns from past data and continuously improves rankings.
It handles complex ranking problems where simple algorithms fail.
It considers multiple weak ranking factors and combines them into a
powerful ranking model.
For example, if you search for “best budget laptops 2024”, RankBoost helps
ensure that:
Recent and relevant reviews rank higher.
Trusted websites appear before unknown sources.
User engagement (clicks, time spent on page) is considered.
How RankBoost Works?
RankBoost follows a step-by-step approach to improve ranking quality.
1. Preparing Training Data (Pairwise Comparisons)
We collect past search data, where users have clicked on some results but
ignored others.
We create pairs of results (A, B) where we know that A should be ranked
higher than B.
RankBoost learns from these comparisons to improve future rankings.
2. Initializing Weak Rankers (Weak Ranking Functions)
RankBoost starts with simple weak models (each one makes basic ranking
decisions).
These weak rankers use different factors like:
o Keyword relevance (Does the query match the page content?)
o Click rate (Do users often click this page?)
o Freshness (Is the page recent or outdated?)
Individually, these weak rankers are not perfect, but RankBoost improves
them over time.
3. Boosting Process (Iterative Learning)
RankBoost improves rankings through multiple rounds of learning:
In each round, it adjusts ranking weights based on past mistakes.
It gives more importance to incorrectly ranked results, so the model
learns from its errors.
Over time, the ranking quality keeps improving as errors are corrected.
4. Final Ranking Calculation
After many rounds, RankBoost combines all weak ranking models into one
strong ranking function.
This final ranking is much better than what any single weak ranker could do
alone.
When a user performs a new search, RankBoost:
Uses the learned ranking model to compare search results.
Ranks the best pages at the top based on past learning.
Q.4 Evaluation Metrics for Learning to Rank (Easy Explanation)
When we use machine learning to rank search results, we need a way to
measure how good the rankings are. This is done using evaluation metrics that
compare the predicted rankings to the ideal rankings (what users actually prefer).
For example, if you search for “best smartphones 2024”, a good ranking system
should show relevant and high-quality results at the top. If it ranks old,
irrelevant pages higher, the system is not performing well.
Evaluation metrics help us:
Compare different ranking models to see which one works best.
Identify problems (e.g., irrelevant results appearing at the top).
Improve ranking algorithms over time.
Common Evaluation Metrics for Learning to Rank
1. Mean Reciprocal Rank (MRR) – How Early is the First Relevant Result?
MRR measures how early the first relevant result appears in the ranking. If the best
answer is ranked at position 1, MRR is perfect (1.0). If it's at position 5, MRR is
lower (0.2).
Example:
For the query "who won the FIFA World Cup 2022?", the correct answer should be
at the top. If it appears at:
Position 1 → MRR = 1.0 ✅ (Perfect ranking)
Position 2 → MRR = 0.5 ❌ (Not ideal)
Position 5 → MRR = 0.2 ❌ (Bad ranking)
MRR is mostly used for question-answering systems or cases where only one
right answer exists.
2. Precision@K – How Many Top Results Are Relevant?
Precision@K (P@K) measures how many of the top K results are relevant.
Example:
For the search “best programming languages for AI”, if the top 5 results are:
✅ Python
✅ Java
✅ C++
❌ PHP
❌ HTML
Here, 3 out of 5 results are relevant → Precision@5 = 3/5 = 0.6 (or 60%).
Formula: Precision@K=Relevant results in top K / K
Precision@K is useful for search engines and recommendation systems.
3. Recall@K – Are We Finding All the Relevant Results?
Recall@K checks if the model finds all relevant documents. If 10 relevant pages
exist but the model only shows 3, recall is low.
Example:
Suppose there are 8 relevant articles on "Best Laptops 2024".
If the search engine shows only 4 relevant ones in the top 10, Recall@10 =
4/8 = 0.5 (50%).
Formula: Recall@K=Relevant results in top K / Total relevant results available
Recall is useful when finding all relevant results is important, like legal or
medical document searches.
4. Normalized Discounted Cumulative Gain (NDCG) – Are the Best Results at
the Top?
NDCG checks how well a ranking model places the most relevant results at
the top.
Example:
Imagine you search for “best programming languages for AI”, and the most
relevant pages should be:
1. ✅ Python (Best Choice)
2. ✅ Java (Second Best Choice)
3. ✅ C++ (Useful but Less Popular)
If the search engine ranks C++ first, then Java, then Python, NDCG will be low
because the most relevant result (Python) is not at the top.
NDCG is widely used in search engines like Google, Bing, and e-commerce
websites.
5. Mean Average Precision (MAP) – How Good is the Ranking Overall?
MAP averages Precision@K over multiple queries. It is useful when we test a ranking
model on many different searches.
Example:
If Precision@5 for "best laptops" is 0.8 and for "best smartphones" is 0.6,
then:
MAP=0.8+0.6 / 2=0.7
MAP is widely used for evaluating ranking systems over multiple queries.
Which Metric Should You Use?
If only the first correct answer matters → Use MRR (e.g., Q&A systems).
If top results need to be relevant → Use Precision@K (e.g., search
engines).
If we need to find all relevant results → Use Recall@K (e.g., legal/medical
searches).
If ranking order matters → Use NDCG (e.g., recommendation systems).
If evaluating over multiple queries → Use MAP (e.g., large search datasets).
Q.1 Link Analysis
Link analysis is a technique used in search engines, social networks, and
recommendation systems to analyse connections between web pages, people,
or objects. It helps determine the importance, relevance, and relationships
between different entities by studying how they are linked together.
Why is Link Analysis Important?
Link analysis helps in:
Search engines (Google, Bing) – Ranking webpages based on their importance (PageRank).
Social networks (Facebook, Twitter, LinkedIn) – Finding influential users.
Fraud detection (Banks, E-commerce) – Detecting fake transactions or spam networks.
Recommendation systems (Amazon, Netflix, YouTube) – Suggesting similar products, movies, or
videos.
How Link Analysis Works
Link analysis follows these steps:
1. Representing the Network as a Graph
A graph is used to represent links between entities.
Webpages (Nodes) are connected by Hyperlinks (Edges).
People (Nodes) are connected by Friendships or Followers (Edges) in social media.
For example, if Site A links to Site B, we draw a directed edge from A to B.
2. Measuring Importance Using Incoming and Outgoing Links
The more high-quality links a page receives, the more important it is.
Example: If many trusted news websites link to a blog, it is likely credible.
But: If a page is linked only by spammy or unknown sites, it might be low quality.
3. Computing Influence and Ranking
Using mathematical algorithms (like PageRank or HITS), we can determine:
Which pages are the most authoritative.
Which users are most influential in a network.
Example: Link Analysis in Action
How Google Ranks Pages
You search for "best laptops for gaming", and Google must rank millions of webpages.
Using link analysis:
1. Page A (trusted review site) has many backlinks from tech blogs → Higher rank
2. Page B (spammy site) has links only from unknown sites → Lower rank
3. Page C (Amazon product page) has backlinks from trusted forums → Ranks well
Google combines link analysis + AI to provide the best results.
Link Analysis Technique
PageRank: PageRank is one of the most well-known link analysis techniques. It
was developed by Google to rank web pages based on how many other pages link
to them and how important those linking pages are. The idea is simple: a webpage
that is linked by many trusted pages is likely to be more important than one linked
only by low-quality pages.
For example, if a new website about space research gets linked by NASA,
Harvard University, and National Geographic, it is likely to be important and
should rank higher in search results. However, if the same website is only linked by
unknown or spammy pages, it will not be ranked highly. The algorithm also
considers the quality of links—a link from Wikipedia is more valuable than a link
from a random personal blog.
HITS (Hyperlink-Induced Topic Search): HITS is another algorithm that
classifies web pages into two categories: hubs and authorities. Hubs are pages
that link to many other important pages, and authorities are pages that receive
many links from hubs.
For example, consider a travel blog that links to many airline websites, hotel
booking sites, and travel guides. This blog is a hub because it provides useful links.
The airline websites and hotel booking pages are authorities because they are
being linked by many such travel blogs. HITS is useful for identifying the most
relevant pages when searching for information on a specific topic.
Q.3 Application of Link Analysis in Information Retrieval (IR) Systems
Link analysis plays a crucial role in Information Retrieval (IR) systems, especially
in improving search engines, ranking web pages, filtering spam, and recommending
relevant content. Since the internet is filled with billions of web pages, IR systems
need efficient ways to determine which pages are important, relevant, and
trustworthy. Link analysis helps achieve this by studying the relationships between
web pages based on how they link to each other.
1. Web Page Ranking in Search Engines
One of the most significant applications of link analysis in IR systems is determining the
importance of a web page. Search engines like Google, Bing, and Yahoo use link analysis
algorithms like PageRank and HITS to rank pages based on their backlinks.
2. Identifying Hubs and Authorities for Topic-Specific Search
Link analysis helps IR systems categorize web pages into hubs and authorities using the
HITS algorithm. Hubs are pages that link to many important pages, while authorities are
pages that receive many links from hubs.
3. Spam Detection and Web Quality Assessment
One of the major challenges in information retrieval is eliminating spam and low-quality
web pages. Spam websites often try to manipulate search rankings by generating fake
backlinks. Link analysis techniques like Trust Rank help in identifying spam pages.
4. Personalized Search and Recommendations
Link analysis helps IR systems personalize search results and recommendations by
analysing user behaviour and link structures. Algorithms like SimRank and Personalized
PageRank are used to suggest content based on similarities between users and webpages.
5. Fraud Detection and Security in IR Systems
IR systems in banking, cybersecurity, and e-commerce use link analysis to detect
fraudulent activities. Co-Citation Analysis and Co-Occurrence Analysis help identify
hidden connections between fake entities.
6. Question Answering and Chatbots
In AI-based question-answering systems like Google Assistant and ChatGPT, link
analysis helps retrieve the most relevant answers by analysing how different
documents are connected.
Chapter-11 Crawling and Near-Duplicate Page Detection
Q.1 Web Crawling
Web crawling is the process of systematically browsing the internet to collect and
index information from web pages. It is done by special programs called web
crawlers, spiders, or bots. These crawlers start from one webpage, extract links,
and move to other pages, repeating the process.
Features of Web Crawling
1. Automated Web Page Discovery
Web crawlers automatically find and visit new web pages by following links. They do
not require manual input to locate new content.
2. Efficient Data Storage and Processing
Crawlers store the collected data in structured formats, making it easy to analyse and
retrieve when needed.
3. Link Following (Recursive Crawling)
Crawlers discover new pages by following hyperlinks from one page to another. This
allows them to navigate the internet efficiently.
4. Indexing of Web Pages
Once the data is collected, it is indexed to make searching easier. This helps search
engines quickly retrieve relevant information when users search for something.
5. Respect for Robots.txt
Crawlers check the robots.txt file of a website to understand which pages they are
allowed or not allowed to crawl.
6. Handling Dynamic and Static Pages
Modern crawlers can handle both static HTML pages and dynamic content generated
by JavaScript.
7. Duplicate Content Detection
Crawlers detect duplicate pages to avoid storing and indexing the same content
multiple times.
Q.2 Web Crawling Techniques
1. Breadth-First Search (BFS)
Breadth-First Search is web crawling technique that operate by systematically exploring web pages level by level.it
begins form a starting page and visit all reachable page at the current depth level before moving on to pages at the
next depth level. BFS employs a first-in-first-out (FIFO) queue to manage URLs, ensuring that pages are visited in the
order they were discovered .this systematic and comprehensive approach makes BFS Suitable for broad crawling task
where the goal is to collect a large number of pages across different parts of web.
Advantages: 1.Ensures that all pages at a given depth are crawled before moving
deeper.
2. Useful for collecting a complete set of URLs.
Disadvantages: 1. can be slow if there are too many links on a page, 2.Requires a lot
of memory to store links.
2. Depth-First Search (DFS) Crawling
This is the opposite of BFS. Instead of covering all pages at one level first, it follows
one link deeply before coming back to explore other links. It based on Last-in-First-Out
(LIFO) strategy using a stack to keep tracks of URLs. Newly discovered URLs are added
to the top of the stack and crawler explores them until there are no unexplored links,
this stack-based approach contributes to the memory efficiency of DFS, as it only
needs to store URLs at the current depth level. Imagine walking through a city and
following one road all the way to the end before turning back and exploring another
road. This method helps reach deeply hidden pages but can sometimes miss
important links that were left behind in the earlier levels.
Advantages:
Helps reach deeply hidden pages.
Uses less memory than BFS.
Disadvantages:
May miss some links at the top level if the website is too deep.
Can get stuck in loops if links keep pointing back to the same pages.
3. Focused Crawling
For cases where only specific types of data are needed, Focused Crawling is used.
Instead of crawling everything on the internet, a focused crawler only follows links
that are related to a certain topic. For example, if a crawler is programmed to collect
only news articles, it will ignore e-commerce sites or entertainment websites. This
makes the process more efficient because unnecessary pages are skipped, but it also
risks missing some relevant information if the filtering rules are too strict..
Advantages:
Saves time and storage by avoiding irrelevant pages.
Useful for industry-specific or research-focused crawling.
Disadvantages:
Might miss useful information if the filtering rules are too strict.
Requires pre-defined keywords or categories.
Q.3 Near-Duplicate Page Detection Algorithms
Near-Duplicate Page Detection Algorithms are used to identify web pages that are
highly similar or nearly identical to each other. These are called near-duplicate
pages. For example, two news websites might publish the same story, but one might
have a different date or a small additional paragraph. Similarly, an e-commerce
website might have multiple product pages with nearly the same content but with
slight variations in price or descriptions.
Detecting these near-duplicate pages is important because storing and displaying
multiple similar pages is inefficient. It can clutter search results, waste storage space,
and slow down web crawling. To solve this problem, different algorithms are used to
compare and identify pages that are almost the same.
1. Hashing (Exact Duplicate Detection, Not Good for Near-Duplicates)
The simplest method is hashing, where each page's content is converted into a
unique fingerprint, called a hash. This fingerprint is a short string of numbers and
letters that represents the content. If two pages have the same hash, they are
identical. However, if even a single word changes, the hash will be completely
different, making it ineffective for detecting near-duplicates.
Example of Hashing:
Let’s say we have these two texts:
Text 1: "The quick brown fox jumps over the lazy dog."
Text 2: "The quick brown fox jumps over a lazy dog."
Even though only one word changed ("the" → "a"), their hashes will be completely
different, making hashing useless for near-duplicate detection.
Advantages:
Fast and easy to implement.
Works well for detecting exact duplicates.
Disadvantages:
Cannot detect near-duplicates (small changes make a different hash).
Sensitive to even a single character change.
2. Shingling (Comparing Small Parts of a Page for Similarity)
Since hashing fails for near-duplicates, a better method is shingling. In this method, a
web page is broken down into smaller overlapping sections of words, called shingles.
These shingles are then compared between different pages. If two pages share a high
percentage of the same shingles, they are considered near-duplicates.
Example of Shingling:
For the sentence:
"The quick brown fox jumps over the lazy dog."
If we break it into 3-word shingles, we get:
1. "The quick brown"
2. "quick brown fox"
3. "brown fox jumps"
4. "fox jumps over"
5. "jumps over the"
6. "over the lazy"
7. "the lazy dog"
Now, if another page contains similar text with small changes, we compare how many
shingles match. If most shingles are the same, the pages are considered near-
duplicates.
Advantages:
Detects near-duplicates effectively.
Handles minor changes well.
Disadvantages:
Can be slow if the dataset is large.
3. MinHashing (Speeding Up the Shingling Process)
Shingling is effective but can be slow when comparing millions of pages. MinHashing
speeds up the process by selecting only a few important shingles from each document
instead of storing all of them. Instead of comparing every word, MinHashing selects a
small but meaningful portion of the text and checks for similarity.
Example of MinHashing: Imagine you are summarizing a book by selecting only a
few important sentences. If two books have similar key sentences, they likely have
similar content. MinHashing does something similar by selecting important shingles
and comparing them instead of analysing the whole text.
Advantages:
Much faster than full shingling.
Reduces memory usage.
Disadvantages:
Slightly less accurate than full Jacquard similarity.
Q.4 Handling Dynamic Web Content During Crawling
Handling dynamic web content during crawling is essential for accurately extracting
data from modern websites that use client-side scripting, AJAX, or other dynamic
techniques to generate content.
Techniques to Handle Dynamic Content during Crawling
To overcome these challenges, web crawlers use different strategies depending on the
complexity of the website.
1. Headless Browser Crawling (Rendering JavaScript)
A headless browser is a web browser that runs without a graphical user interface,
allowing crawlers to load and execute JavaScript like a real user. Tools like
Puppeteer, Selenium, and Playwright can be used to render JavaScript-heavy
websites and extract the fully loaded content.
2. Using APIs Instead of Crawling
Some websites provide public or private APIs that return structured data in formats
like JSON or XML. Instead of crawling web pages, data can be fetched directly from the
API, which is faster and more reliable.
3. Crawling with AJAX Aware Crawlers
Some modern crawlers are designed to detect and handle AJAX requests. These
crawlers wait for background requests to complete and extract data once the page is
fully loaded.
4. Simulating User Interactions
To handle infinite scrolling or pages with "Load More" buttons, crawlers need to
simulate user actions like scrolling down or clicking buttons. Headless browsers and
automation tools can be programmed to perform these actions.
5. Handling CAPTCHA and Login-Based Restrictions
If content is locked behind a login, crawlers may need to simulate login processes by
submitting user credentials and maintaining session cookies. For CAPTCHAs, human
intervention or CAPTCHA-solving services may be required.
6. Handling Infinite Scrolling: Some websites load more content when a user
scrolls down (e.g., Facebook, Twitter). Crawlers need to simulate scrolling to fetch all
the data.
Chapter-12 Advanced Topics in IR
Q.5 Text Summarization: Extractive and Abstractive Methods
Text summarization is the process of creating a shorter version of a text while keeping
the most important information. It helps people quickly understand long documents,
articles, or reports. Text summarization is widely used in news articles, research papers,
emails, and AI-powered assistants.
Extractive Summarization
Extractive summarization works by selecting key sentences or phrases from the
original text and putting them together to create a summary. It does not generate new
sentences but instead extracts the most important parts from the original content.
How Extractive Summarization Works?
1. Text Processing → The system breaks the text into individual sentences.
2. Sentence Scoring → Each sentence is given a score based on importance
3. Selection of Key Sentences → The highest-scoring sentences are chosen for the
summary.
Techniques Used in Extractive Summarization
TF-IDF (Term Frequency-Inverse Document Frequency) → Measures how
important a word is in a text.
Text Rank (Graph-Based Ranking) → Similar to Google’s PageRank algorithm, this
method finds the most important sentences in a text.
Machine Learning Models → AI models learn from human-written summaries and
select the best sentences.
Advantages of Extractive Summarization
✔ Keeps the original meaning without changes.
✔ Fast and efficient compared to abstractive methods.
Disadvantages of Extractive Summarization
✖ Can be disjointed because sentences are copied without smooth transitions.
✖ Does not rewrite text, so it may not sound natural.
Abstractive Summarization
Abstractive summarization creates new sentences to summarize the key ideas of a
text. Instead of copying sentences directly, it rewrites them in a shorter and more
meaningful way, similar to how humans summarize information.
How Abstractive Summarization Works?
1. Understanding the Text → The system reads and understands the meaning of the
text.
2. Key Idea Extraction → Important concepts and relationships between ideas are
identified.
3. Rewriting in Shorter Form → New sentences are generated using natural
language processing (NLP).
Techniques Used in Abstractive Summarization
Seq2Seq Models (Sequence-to-Sequence) → AI models process input text and
generate a shorter output.
Transformers (BERT, T5, GPT-3, etc.) → Advanced AI models understand
context and generate summaries.
Attention Mechanism → Helps models focus on important words while
summarizing.
Advantages of Abstractive Summarization
✔ creates more natural and human-like summaries.
✔ can paraphrase and simplify complex information.
✔ Works well for storytelling, reports, and creative writing.
Disadvantages of Abstractive Summarization
✖ Harder to develop because AI needs deep language understanding.
✖ can miss important details or introduce errors.
Q.6 Question Answering: Approaches for Finding Precise Answers
Question Answering (QA) is a technology that allows computers to answer human
questions automatically. Instead of just providing a list of search results like Google, a QA
system understands the question and gives a direct answer. For example:
Question: Who is the president of the USA?
Answer: Joe Biden.
QA systems are used in search engines (Google, Bing), virtual assistants (Siri, Alexa),
and AI chatbots to provide fast and accurate information.
Types of Question Answering Systems:
1. Open-Domain QA → Answers questions about any topic using a large collection of texts, like Wikipedia.
2. Closed-Domain QA → Answers questions related to a specific subject, such as medical or legal topics.
Approaches to Finding Precise Answers
1. Information Retrieval-Based Approach
This approach works like a search engine. It finds relevant documents, extracts key
information, and presents it as an answer. How It Works:
1. The system understands the question (e.g., "Who invented the telephone?").
2. It searches a large database for relevant documents (e.g., Wikipedia).
3. It extracts sentences that contain the answer (e.g., "Alexander Graham Bell invented
the telephone.").
4. It presents the most relevant part as the answer.
Example:
Question: What is the capital of France?
Process: The system searches a knowledge database and finds, "Paris is the capital of
France."
Answer: Paris.
Advantages:
✔ Works well for factual questions.
✔ Fast and efficient.
Disadvantages:
✖ Sometimes returns long documents instead of precise answers.
✖ May not work well for complex questions.
2. Knowledge-Based Approach
This approach uses structured databases, called Knowledge Graphs, to find answers. A Knowledge Graph is like a
giant map of facts stored in a structured format. Google’s "Knowledge Panel" uses this method. How It Works:
1. The system understands the question and converts it into a structured query.
2. It searches a knowledge database like Google Knowledge Graph, DBpedia, or Wolfram Alpha.
3. It retrieves the correct answer.
Example:
Question: Who directed the movie Inception?
Process: The system looks in its Knowledge Graph and finds: "Christopher Nolan directed Inception."
Answer: Christopher Nolan.
Advantages:
✔ Very accurate for factual knowledge.
✔ Can provide structured information (e.g., birth date, location, job title).
Disadvantages:
✖ Requires a well-maintained knowledge base.
✖ Cannot handle opinion-based or subjective questions well.
3. Machine Learning and Deep Learning-Based Approach
This is the most advanced method. It uses Artificial Intelligence (AI) and deep learning models to understand the
question, find relevant information, and generate an answer. It works well for complex and conversational
questions. How It Works:
1. The system uses Natural Language Processing (NLP) to understand the question.
2. A deep learning model (such as BERT, GPT, or T5) searches for answers in large text collections.
3. The model generates a response based on its understanding of the data.
Example:
Question: Why is the sky blue?
Process: The AI model reads multiple articles, understands the physics of light scattering, and generates an answer:
Answer: The sky appears blue because molecules in the air scatter blue light from the sun more than other colours.
Advantages:
✔ Can understand complex and conversational questions.
✔ Works well for long-form answers and explanations.
Disadvantages:
✖ Requires a lot of training data and computational power.
✖ May sometimes generate incorrect or biased answers.
4. Rule-Based Approach
This method uses predefined rules and patterns to answer specific types of questions. It works well for structured
and repetitive questions but struggles with open-ended or complex questions. How It Works?
1. The system analyses the question structure (e.g., recognizing "Who," "What," "Where").
2. It applies predefined rules to find answers.
3. It extracts information from databases, FAQs, or simple lookup tables.
Example:
Question: What is 2 + 2?
The system applies a mathematical rule and provides the answer: 4.
Question: Who is the CEO of Apple?
The system looks up a predefined database and returns Tim Cook.
Advantages:
✔ Fast and efficient for simple questions.
✔ Works well for fixed datasets (like company directories, FAQs).
Disadvantages:
✖ Cannot handle complex or new questions that are not in the rules.
✖ Requires manual updates to keep rules up to date.
5. Graph-Based Approach
This approach represents knowledge as a graph of connected facts. Instead of searching through text, it follows
relationships between concepts to find precise answers. Google’s "Knowledge Graph" uses this technique.
How It Works?
1. The system stores data in a graph where nodes represent entities (people, places, things) and edges
represent relationships between them.
2. When a user asks a question, the system follows connections in the graph to find the correct answer.
Example:
Question: Who is the father of Elon Musk?
The system follows these relationships:
o Elon Musk → Parent → Errol Musk.
Final Answer: Errol Musk.
Advantages:
✔ Very accurate and structured answers.
✔ Works well for fact-based questions.
✔ Fast and efficient for well-organized knowledge bases.
Disadvantages:
✖ Requires well-maintained structured data.
✖ Does not work well for opinion-based or abstract questions.
Q.7 Recommender System: Collaborative Filtering & Content-
Based Filtering
A Recommender System is a type of artificial intelligence that helps users find relevant products, movies, books,
music, or other content based on their preferences. These systems are widely used by platforms like Netflix,
YouTube, Amazon, and Spotify to provide personalized recommendations.
Collaborative Filtering
Collaborative Filtering works by analyzing the behavior and preferences of many users to make recommendations. It
assumes that if two users liked similar items in the past, they will like similar items in the future.
For example, if User A and User B both watched and liked the same movies, and User A watches a new movie, the
system might recommend that movie to User B because they have similar tastes.
How Collaborative Filtering Works?
1. Collect User Data → The system records users' interactions (likes, ratings, purchases, etc.).
2. Find Similar Users or Items → The system compares users or items based on interactions.
3. Make Predictions → If two users are similar, the system recommends items that one user liked to the other.
Types of Collaborative Filtering
1. User-Based Collaborative Filtering
Finds users who have similar tastes and recommends items based on their preferences.
Example: If Alex and Sam both love action movies, and Alex watches a new action movie, Sam might get a
recommendation for that movie.
2. Item-Based Collaborative Filtering
Finds items that are similar based on users’ interactions.
Example: If many people who watched The Dark Knight also watched Inception, then Inception might be
recommended to someone who liked The Dark Knight.
Example of Collaborative Filtering
Scenario:
User A likes The Avengers and Iron Man.
User B likes The Avengers, Iron Man, and Thor.
Since User A and User B have similar tastes, the system recommends Thor to User A.
Advantages of Collaborative Filtering
✔ Works well for diverse types of content (movies, books, shopping, etc.).
✔ No need to understand the content itself—just uses user preferences.
✔ Can discover new and unexpected recommendations based on user trends.
Disadvantages of Collaborative Filtering
✖ Requires a large amount of user data to work effectively.
✖ Cold Start Problem → Doesn’t work well for new users or new items because there is no interaction data yet.
✖ Can sometimes recommend popular items repeatedly instead of introducing diversity.
Content-Based Filtering
Content-Based Filtering recommends items based on their features and how they match a user's past preferences.
Instead of relying on what other users liked, it focuses on the characteristics of items.
For example, if a user watches many sci-fi movies, the system will recommend other sci-fi movies based on genre,
actors, or storyline.
How Content-Based Filtering Works?
1. Analyze Item Features → The system identifies important features (e.g., genre, author, keywords).
2. Create a User Profile → The system learns about the user’s preferences based on past interactions.
3. Match Items to User Interests → The system recommends items with similar features to what the user liked
before.
Example of Content-Based Filtering
Scenario:
A user reads a book about Artificial Intelligence.
The system identifies features like technology, science, and AI as important topics.
It recommends other books related to AI and technology.
Advantages of Content-Based Filtering
✔ Works well for new users because it only depends on their own preferences.
✔ Provides personalized recommendations based on a user's specific interests.
✔ Avoids the popularity bias seen in collaborative filtering.
Disadvantages of Content-Based Filtering
✖ Can be too narrow, recommending only similar content instead of diverse suggestions.
✖ Requires detailed item information (tags, descriptions, categories) to work properly.
✖ Struggles with new items that don’t have enough metadata.
Chapter-13 Cross-Lingual and Multilingual Retrieval
Q.8 Cross-Lingual and Multilingual Retrieval
In today's world, information is available in many languages, and users often want to
search for and access content that is not in their native language. Cross-Lingual and
Multilingual Retrieval are techniques used in Information Retrieval (IR) to help users
find relevant documents in different languages.
These retrieval methods help break language barriers and provide access to global
information. They are essential for search engines, academic research, global
businesses, and multilingual databases.
Cross-Lingual Information Retrieval (CLIR)
Cross-Lingual Information Retrieval (CLIR) is the process of allowing users to enter a
search query in one language and retrieve documents written in another
language. This is useful when relevant information exists in a language that the user
does not understand.
For example, if a person searches for "climate change reports" in English, but the most
relevant research papers are written in Spanish, CLIR will help retrieve and translate those
Spanish documents into English.
How CLIR Works?
CLIR involves several key steps:
1. Query Translation → the search query is translated into the target language.
2. Document Retrieval → the system finds documents in the translated language.
3. Document Translation (optional) → If the user does not understand the
retrieved language, the documents are translated back into the user's language.
Methods Used in CLIR
CLIR uses different techniques to handle language differences:
Machine Translation (MT): The query or documents are translated using AI-based
tools like Google Translate.
Dictionary-Based Translation: Words are translated using a bilingual dictionary.
Parallel Corpora: Pre-existing translated texts are used to improve accuracy.
Challenges in CLIR
Translation Errors: Some words have multiple meanings, which can lead to
inaccurate translations.
Language Structure Differences: Grammar and sentence structure vary
between languages.
Limited Data: Some languages have fewer resources for accurate translation.
Multilingual Information Retrieval (MLIR)
Multilingual Information Retrieval (MLIR) allows users to search in one or multiple
languages and get results in different languages without needing translation.
The key difference from CLIR is that MLIR does not always require translation; instead, it
retrieves documents in multiple languages and presents them to the user.
A European journalist searches for news on "economic policies in Asia." The MLIR
system retrieves articles in English, Chinese, and Japanese, showing the most relevant
ones at the top. The journalist can then choose which articles to read based on their
language skills.
How MLIR Works?
1. The system identifies which languages the user understands.
2. It searches for relevant documents in multiple languages.
3. The retrieved results are presented to the user in the original languages.
Methods Used in MLIR
Language Identification: The system detects the language of the documents and
the user's preferences.
Multilingual Indexing: The search engine stores information in multiple
languages.
Ranking Algorithms: Results are ranked based on relevance, regardless of the
language.
Challenges in MLIR
Different Word Meanings: Some words mean different things in different
languages.
Language-Specific Ranking: Some languages may have more content available,
making it harder to balance results.
User Preferences: Some users may want only results in their native language.
Q.9 Methods of User-Based Evolution in Information Retrieval
Information Retrieval (IR) is the process of finding relevant information from large
collections of data. Over time, IR systems need to improve their performance by
adapting to how users interact with them. This process of continuous improvement
based on user behaviour is known as User-Based Evolution in Information
Retrieval. The idea behind this is that search engines and retrieval systems
should learn from users to enhance their accuracy and efficiency.
For example, when a person searches for information online, the system tracks their
behaviour, such as which results they click on, how much time they spend on a page,
and whether they refine their search query. Based on this data, the system makes
changes to improve the quality of search results. If many users find a particular search
result useful, the system may rank it higher in future searches.
To improve search systems, different evaluation methods are used to study user
behaviour and measure the effectiveness of search results. Some of the most common
methods include user studies, surveys, test collections, and benchmarking.
Each of these methods helps researchers and developers understand how users
interact with an IR system and identify areas for improvement.
User studies involve observing users as they search for information and analysing
their behaviour to understand how they use the system. In this method, participants
are given specific search tasks, and researchers observe how they navigate the
search interface, what types of queries they enter, and how they interact with search
results. By analysing user behaviour, researchers can identify common difficulties and
make improvements to the system. For example, if users struggle to find relevant
information because they do not use the right keywords, the system can be improved
by providing better search suggestions.
Surveys are another important method used to collect feedback from users about
their experience with an IR system. A survey typically includes questions about how
satisfied users are with the search results, whether they found the information they
were looking for, and what difficulties they encountered during the search process.
Surveys help researchers understand user opinions and preferences, which can guide
improvements to the system. For instance, if a large number of users report that
search results are not relevant, developers may adjust the ranking algorithm to
improve accuracy.
Test collections provide a standardized way to evaluate the performance of an IR
system. In this method, a large set of documents is collected, along with predefined
search queries and expert judgments on which documents are most relevant to each
query. The system's performance is measured by comparing its search results with
the expert-judged relevance lists. This method helps researchers test different search
algorithms and determine which one provides the best results. For example, a medical
research database might use a test collection containing thousands of journal articles,
where experts have identified the most relevant papers for specific medical topics. By
testing different search algorithms on this dataset, researchers can improve the
accuracy of search results for medical professionals.
Benchmarking is a method used to compare an IR system’s performance against a
standard or a competing system. In this approach, the system is evaluated using
predefined datasets and performance metrics to determine how well it retrieves
relevant information. The results are then compared with other systems to identify
areas for improvement. For example, if a new search engine is being developed, its
performance can be benchmarked against Google or Bing to see how well it ranks
results and how quickly it retrieves information. If the competitor's system performs
better, developers can analyse its features and make improvements to their own
system.
Q.10 Online Evaluation Method in Information Retrieval
Information Retrieval (IR) systems, such as search engines (Google, Bing, etc.), are
continuously evolving to provide better and more relevant results to users. Online
Evolution in Information Retrieval refers to the process of improving search
systems in real time based on user interactions.
To test and improve IR systems while they are running, different online evaluation
methods are used. Two of the most important methods are A/B Testing and
Interleaving Experiments. These methods help search engines determine whether
a new algorithm or ranking method provides better results compared to the existing
one.
A/B Testing in Information Retrieval
A/B Testing is a method where two different versions of a system (A and B) are tested with
real users to compare their effectiveness. The goal is to see which version provides better
results based on user interactions, such as click-through rate (CTR), time spent on a page, and
user satisfaction.
How A/B Testing Works?
1. Users are randomly divided into two groups.
o Group A sees the current version of the search system.
o Group B sees a new, modified version of the system.
2. Both groups perform searches as usual.
3. The system collects data on user interactions, such as:
o Which results users click on.
o How long users stay on a page.
o Whether users refine their search queries.
4. The results are analysed to determine which version performed better.
5. If the new version (B) performs significantly better, it may replace the existing version
(A) permanently.
Example of A/B Testing in Search Engines
A search engine wants to test a new ranking algorithm that prioritizes longer, more detailed
articles instead of shorter summaries.
Group A sees search results ranked using the current algorithm.
Group B sees search results ranked using the new algorithm.
After two weeks, the system analyses user behaviour. If users in Group B spend more time
on the recommended articles and click on results more frequently, the new algorithm is
adopted.
Advantages of A/B Testing
✔ Provides real-time feedback from actual users.
✔ helps identify which version improves user experience.
✔ Allows gradual implementation of changes without affecting all users at once.
Interleaving Experiment in Information Retrieval
Interleaving is another method used to compare two different ranking algorithms. Unlike A/B
Testing, where users are divided into separate groups, Interleaving combines results from two
ranking algorithms and presents them to the same user in a mixed order. This allows for a
more direct comparison of the two algorithms.
How Interleaving Works?
1. A user enters a search query.
2. The system generates two lists of ranked search results:
o List A (using the old ranking algorithm).
o List B (using the new ranking algorithm).
3. Instead of showing one complete list, results from both lists are mixed together in a
single search results page.
4. The system tracks which results the user clicks on.
5. If most clicked results come from Algorithm B, it means that the new ranking method is
better.
Example of Interleaving in Search Engines
A search engine wants to test a new AI-based ranking algorithm. Instead of showing half of
the users’ one ranking and the other half a different ranking (as in A/B Testing), it mixes
results from both algorithms and presents them in one search list. If users click on more
results from the AI-based ranking, it indicates that this method provides better relevance, and
the system might adopt it permanently.
Advantages of Interleaving
✔ provides a fair and unbiased comparison of two ranking methods.
✔ Users are not aware that they are part of an experiment, leading to natural behaviour.
✔ Allows faster testing since the same users interact with both ranking algorithms at the same
time.