Module 5.
6
Text Indices
1
Portion:
Inverted files, Other indices for text, Boolean Queries, Sequential Searching, Pattern Matching,
Structural Queries, Compression; Multimedia IR: Indexing and Searching:- A Generic Multimedia
indexing approach, , Automatic Feature extraction; Searching Web: Challenges, Characterizing the web,
Search Engines. Browsing, Meta searches, Searching using Hyperlinks. Self-learning Topics: Koha
2
Inverted Index
● An inverted index is a data structure used by search engines to map terms (or words) to the
documents or locations where they appear.
● It's called "inverted" because it is the reverse of a typical index, where you map documents to
their content.
● In an inverted index, the content (terms) is mapped to the documents containing those terms.
● Inverted indexes enable efficient full-text search, as they allow the search engine to quickly locate
all documents containing a particular word or set of words.
● There are two types of inverted indexes:
1. Inverted File: Maps each word (term) to the list of documents where it occurs.
2. Inverted List: Maps each word to both the documents and the positions (locations) where the
word occurs in those documents.
3
Steps to create an Inverted Index
1. Document Collection
The first step involves gathering a set of documents or texts where the search functionality is to be
implemented. Let’s assume we have three documents in our collection:
● Doc 1: "Data structures and algorithms"
● Doc 2: "Algorithms are key to data processing"
● Doc 3: "Efficient data structures enable fast processing"
2. Tokenization
In this step, each document is broken down into individual words, also called tokens. This involves splitting
the text by spaces, punctuation, or other delimiters.
● Doc 1 → ["data", "structures", "and", "algorithms"]
● Doc 2 → ["algorithms", "are", "key", "to", "data", "processing"]
● Doc 3 → ["efficient", "data", "structures", "enable", "fast", "processing"]
4
Steps to create an Inverted Index (Cntd..)
3. Normalization (Optional)
Before building the index, words are often normalized to make searching more efficient. Common normalization tasks include:
● Case folding: Converting all words to lowercase.
● Stemming or Lemmatization: Reducing words to their base forms (e.g., "processing" to "process").
● Stop-word removal: Removing common words like "the," "and," "are," as they don’t add much meaning to the search.
After normalization:
● Doc 1 → ["data", "structure", "algorithm"]
● Doc 2 → ["algorithm", "key", "data", "process"]
● Doc 3 → ["efficient", "data", "structure", "enable", "fast", "process"]
4. Building the Inverted Index
Now, we build the inverted index. For each unique word, we create an entry that lists the document(s) in which it appears.
● "algorithm" → Doc 1, Doc 2
● "data" → Doc 1, Doc 2, Doc 3
● "structure" → Doc 1, Doc 3
● "process" → Doc 2, Doc 3
● "key" → Doc 2
● "efficient" → Doc 3
● "enable" → Doc 3
● "fast" → Doc 3 This is the basic inverted file where each term maps to the document(s) in which it appears.
5
Steps to create an Inverted Index(Cntd..)
5. Positional Inverted Index (Optional)
If you want to track where in the document a word appears, you can create a positional inverted index. This helps
for phrase queries (e.g., searching for "data structures" as a phrase).
Here’s an example of a positional inverted index:
● "algorithm" → Doc 1: [3], Doc 2: [1]
● "data" → Doc 1: [1], Doc 2: [4], Doc 3: [2]
● "structure" → Doc 1: [2], Doc 3: [3]
● "process" → Doc 2: [6], Doc 3: [5]
In this example, the numbers represent the position of the word in each document.
6. Storing the Inverted Index
The inverted index can be stored as a dictionary (or hash table) where each term is the key, and the associated value
is the list of document IDs (or document IDs with positions) where the term appears.
For example:
{"algorithm": {"Doc 1": [3], "Doc 2": [1] },
"data": {"Doc 1": [1], "Doc 2": [4], "Doc 3": [2] },
…}
6
Example
Let’s walk through creating a simplified inverted index for the
three documents.
1. Documents: 4. Build the Inverted Index:
○ Doc 1: "Data structures and algorithms" ○ "algorithms" → [Doc 1, Doc 2]
○ Doc 2: "Algorithms are key to data processing" ○ "data" → [Doc 1, Doc 2, Doc 3]
○ Doc 3: "Efficient data structures enable fast ○ "structures" → [Doc 1, Doc 3]
processing" ○ "processing" → [Doc 2, Doc 3]
○ "key" → [Doc 2]
2. Tokenization: ○ "efficient" → [Doc 3]
○ Doc 1 → ["data", "structures", "and", ○ "enable" → [Doc 3]
"algorithms"] ○ "fast" → [Doc 3]
○ Doc 2 → ["algorithms", "are", "key", "to", "data",
"processing"] 5. Positional Index (if required):
○ Doc 3 → ["efficient", "data", "structures", ○ "algorithms" → Doc 1: [3], Doc 2: [1]
"enable", "fast", "processing"] ○ "data" → Doc 1: [1], Doc 2: [4], Doc 3: [2]
○ "structures" → Doc 1: [2], Doc 3: [3]
3. Normalization (lowercasing, stemming, stop-word ○ "processing" → Doc 2: [6], Doc 3: [5]
removal):
○ Doc 1 → ["data", "structures", "algorithms"]
○ Doc 2 → ["algorithms", "key", "data",
"processing"]
○ Doc 3 → ["efficient", "data", "structures", 7
How is the Inverted Index Used in Search?
Let’s say the user searches for the term "data" and "algorithms". The search engine looks up these terms
in the inverted index and returns the documents containing both terms.
● "data" is in Doc 1, Doc 2, and Doc 3.
● "algorithms" is in Doc 1 and Doc 2.
Since both terms are in Doc 1 and Doc 2, these documents are returned to the user.
If phrase searching is needed (e.g., "data structures"), the positional index will be used to ensure that
"data" is followed by "structures" in the same document.
8
How can the process of creating Inverted Index be optimized using block
addressing?
● Block addressing is an optimization technique used in inverted indexing to make the search process more
efficient, especially for large-scale document collections.
● It involves grouping postings (i.e., the document IDs where a term appears) into blocks and associating
each term with a pointer to the blocks that contain the postings for that term.
● This reduces the size of the inverted index and improves search speed.
How Block Addressing Works
In a basic inverted index, each term is directly associated with a list of document IDs (or a list of documents with
positional data). As the number of documents increases, the list of postings for each term can become very large.
Block addressing reduces the size and complexity of managing these postings by grouping the postings into
smaller, more manageable blocks.
● Each term points to a series of blocks, where each block contains a range of document IDs.
● The postings are stored in these blocks, and instead of pointing directly to each document, the inverted
index points to the blocks that store document IDs.
9
Example: Create an Inverted Index with Block Addressing
Step 1: Document Collection
Let’s assume we have the following small collection of documents:
● Doc 1: "Data structures and algorithms"
● Doc 2: "Algorithms are key to data processing"
● Doc 3: "Efficient data structures enable fast processing"
● Doc 4: "Data structures allow efficient retrieval"
Step 2: Tokenization and Normalization
We tokenize and normalize each document by converting the words to lowercase and removing stop words (like
"and," "to," etc.). The resulting terms for each document are as follows:
● Doc 1: ["data", "structures", "algorithms"]
● Doc 2: ["algorithms", "key", "data", "processing"]
● Doc 3: ["efficient", "data", "structures", "enable", "processing"]
● Doc 4: ["data", "structures", "efficient", "retrieval"]
10
Example: Create an Inverted Index with Block Addressing (Cntd..)
Step 3: Create Postings Lists
After tokenization, the next step is to create a posting list for each term. A posting list contains the
document IDs where the term appears. Here’s what the posting lists look like:
● "data" → [Doc 1, Doc 2, Doc 3, Doc 4]
● "structures" → [Doc 1, Doc 3, Doc 4]
● "algorithms" → [Doc 1, Doc 2]
● "processing" → [Doc 2, Doc 3]
● "efficient" → [Doc 3, Doc 4]
● "retrieval" → [Doc 4]
● "enable" → [Doc 3]
● "key" → [Doc 2]
11
Example: Create an Inverted Index with Block Addressing (Cntd..)
Step 4: Block Addressing
Instead of storing the full posting list as a single long list, we will divide the postings into blocks. Each block will store a
fixed number of document IDs. For simplicity, let’s assume each block can hold two document IDs.
Breaking Postings into Blocks:
Inverted Index with Block Addressing looks as follows:
● "data" → [Doc 1, Doc 2, Doc 3, Doc 4]
○ Block 1: [Doc 1, Doc 2]
Each term points to a list of blocks, and each block contains a
○ Block 2: [Doc 3, Doc 4]
● "structures" → [Doc 1, Doc 3, Doc 4] set of document IDs:
○ Block 1: [Doc 1, Doc 3]
● "data" → Block 1: [Doc 1, Doc 2], Block 2: [Doc 3, Doc
○ Block 2: [Doc 4]
● "algorithms" → [Doc 1, Doc 2] 4]
○ Block 1: [Doc 1, Doc 2] ● "structures" → Block 1: [Doc 1, Doc 3], Block 2: [Doc 4]
● "processing" → [Doc 2, Doc 3] ● "algorithms" → Block 1: [Doc 1, Doc 2]
○ Block 1: [Doc 2, Doc 3] ● "processing" → Block 1: [Doc 2, Doc 3]
● "efficient" → [Doc 3, Doc 4] ● "efficient" → Block 1: [Doc 3, Doc 4]
○ Block 1: [Doc 3, Doc 4]
● "retrieval" → Block 1: [Doc 4]
● "retrieval" → [Doc 4]
○ Block 1: [Doc 4] ● "enable" → Block 1: [Doc 3]
● "enable" → [Doc 3] ● "key" → Block 1: [Doc 2]
○ Block 1: [Doc 3]
● "key" → [Doc 2]
○ Block 1: [Doc 2] 12
Step 5: Query Processing with Block Addressing
Now, let’s look at how block addressing improves search performance using the following query: "data AND structures".
Step 5.1: Retrieve Postings for ● With block addressing, we only need to Step 5.3: Efficiency of Block
"data" and "structures" intersect the document IDs in the Addressing
corresponding blocks.
● For "data", the inverted index ● With block addressing, we only
points to: Intersecting the Blocks: needed to compare small
○ Block 1: [Doc 1, Doc 2] subsets of document IDs
○ Block 2: [Doc 3, Doc 4] ● Block 1 for "data" ([Doc 1, Doc 2]) and (within blocks) instead of
● For "structures", the inverted Block 1 for "structures" ([Doc 1, Doc 3]): comparing the full posting lists.
index points to: ○ Intersection → [Doc 1] (Doc 1 ● This reduces the time
○ Block 1: [Doc 1, Doc 3] appears in both Block 1 for "data" complexity of intersecting long
○ Block 2: [Doc 4] and Block 1 for "structures") posting lists and minimizes the
● Block 2 for "data" ([Doc 3, Doc 4]) and amount of data that needs to
Block 2 for "structures" ([Doc 4]): be loaded into memory for
Step 5.2: Intersect the Postings ○ Intersection → [Doc 4] (Doc 4 large-scale document
appears in both Block 2 for "data"
● To answer the query "data AND collections.
and Block 2 for "structures")
structures," we need to find
documents that appear in both The final result for the query "data AND
the posting lists for "data" and structures" is [Doc 1, Doc 4].
"structures." 13
Advantages of Block Addressing in Our Example
Faster Query Processing:
● Instead of processing the entire posting lists for both terms, block addressing allowed us to focus only on the relevant
blocks. This reduces the number of comparisons needed, especially for larger datasets.
Optimized I/O Operations:
● Only the blocks that are needed for a particular query are accessed, reducing the number of disk reads. In this example,
instead of reading the full postings for "data" and "structures," we only had to retrieve two small blocks for each term.
Better Cache Utilization:
● Block addressing helps in caching blocks. When the same term is searched frequently, its blocks can be cached,
improving performance for repeated queries.
Efficient Memory Usage:
● Blocks can be compressed (e.g., using delta encoding). For instance, in Block 1 for "data" ([Doc 1, Doc 2]), instead of
storing the full document IDs, we can store the difference between Doc IDs (i.e., 1 and 1), which reduces the memory
footprint.
Scalability:
● As the document collection grows, block addressing becomes more effective, since only a small number of blocks need
to be processed for each query.
14
Module 5
Indexing and Searching
15
1. Describe Metasearchers and its merits with example
2. Describe the process of creating Inverted index with example. How can this process be optimized using
block addressing?
3. Describe sequential searching. Explain any one algorithm for the same.
4. Write short notes on Multimedia indexing approach, interface support search process
5. Describe concept of text search engine.
6. Explain inverted file indexing with suitable examples.
7. Search engine architecture.
16
MetaSearchers
● A metasearch engine (also called a meta-searcher) is a type of search tool that sends a user's query to multiple search
engines and databases, aggregates the results, and presents them in a unified, ranked list.
● Unlike traditional search engines like Google or Bing, which maintain their own databases of indexed content,
metasearch engines rely on the results from other search engines. They don't store web pages but instead retrieve data
in real time from external search engines.
How Metasearch engine works?
1. Query Submission:
A user enters a search term or phrase into the metasearch engine.
2. Query Dispatching:
The metasearch engine sends this query to several other search engines (e.g., Google, Bing, Yahoo), specialized
databases (such as academic repositories), or even other metasearch engines.
3. Result Aggregation:
Once the external search engines provide their results, the metasearch engine aggregates them. The process
typically involves removing duplicate results and then ranking the results based on relevance.
4. Display of Results:
The results from the various search engines are then presented in a single, unified list, allowing users to see the
combined information in one place. 17
Types of Metasearch Engines
1. General Metasearch Engines:
a. These search engines pull data from a variety of search engines that cater to general web searches.
b. Examples include:
i. DuckDuckGo: Known for its privacy, DuckDuckGo aggregates results from sources like Yahoo, Bing, and
its own crawler.
ii. Dogpile: Queries multiple search engines like Google, Yahoo, and Yandex.
2. Vertical Metasearch Engines (Domain-Specific):
a. These are focused on specific industries or types of content, such as travel, shopping, or academic research.
b. Examples include:
i. Kayak or Skyscanner (travel): Aggregates flight, hotel, and car rental prices from various booking
websites.
ii. Indeed (job search): Aggregates job postings from various job boards and company websites.
iii. Jstor or Google Scholar (academic): Aggregates academic papers from different journals and
repositories.
3. Metasearch Engines for E-commerce:
a. These focus on comparing product prices and reviews across various online stores.
b. Examples include:
i. PriceGrabber: Compares prices from multiple online stores.
ii. Shopping.com: Aggregates product results across a range of e-commerce sites. 18
Benefits of Metasearch Engines
● Access to a Broader Range of Results:
○ Since metasearch engines pull data from multiple search engines, users have access to more comprehensive and
diverse information sources.
● Time Efficiency:
○ Users save time by not needing to visit multiple search engines or platforms individually. The metasearch engine
collects and presents all the information in one place.
● Mitigation of Algorithm Bias:
○ Traditional search engines use specific algorithms that influence the ranking of search results. By aggregating
from different sources, metasearch engines mitigate the influence of any one search engine’s ranking algorithm,
offering more balanced results.
● Privacy:
○ Some metasearch engines, such as DuckDuckGo or StartPage, are designed to provide enhanced privacy by
not tracking user activity or using personalized search algorithms, which are common in traditional search
engines.
● Diverse Sources and Perspectives:
○ By pulling results from a variety of platforms, metasearch engines can expose users to diverse opinions, sources,
and perspectives, which can be particularly useful in academic research, political analysis, or news aggregation.
19
Challenges and Drawbacks
● Lack of Original Data:
○ Metasearch engines depend entirely on other search engines or data sources. This makes them vulnerable to
changes in how third-party search engines function or what data they allow metasearch engines to access.
● Redundancy in Results:
○ While metasearch engines remove duplicate results to some extent, they can still sometimes return redundant
results or show lower quality pages.
● Limited Advanced Features:
○ Many traditional search engines provide advanced search options such as filtering by date, image search, or
filtering by specific file types. Metasearch engines typically lack these advanced capabilities because they don’t
have full control over the data they pull from other search engines.
● Inconsistent Performance:
○ Because metasearch engines rely on third-party platforms, if one of these platforms restricts access or changes
how results are delivered, it can affect the performance and relevance of the metasearch engine.
● Results Quality:
○ Aggregating from multiple sources doesn't always mean better quality. If the external sources have irrelevant or
low-quality results, these can still appear in the meta search engine's list.
20
Examples of Metasearch Engines
1. DuckDuckGo (Privacy-Oriented)
● How it works: DuckDuckGo pulls search results from Yahoo, Bing, and its own index. It’s popular for its strict privacy
policy—it doesn’t track user behavior or use personalized search algorithms.
● Use case: Ideal for users who prioritize privacy and want an alternative to Google’s personalized results.
2. Kayak (Travel Metasearch)
● How it works: Kayak aggregates flight, hotel, and car rental options from various travel booking websites such as
Expedia, Priceline, and airline websites. It also offers filters for comparing prices, layovers, and flight duration.
● Use case: For travelers looking to compare prices from multiple travel websites to find the best deal.
3. Dogpile (General Web Search)
● How it works: Dogpile combines results from Google, Yahoo, Yandex, and others. It attempts to remove duplicate
listings and aggregates the top results from each engine.
● Use case: Useful for users who want the breadth of multiple search engines without using them individually.
4. Indeed (Job Metasearch)
● How it works: Indeed aggregates job postings from company career pages, job boards like Monster and LinkedIn, and
other employment platforms.
● Use case: Job seekers can quickly access job listings from various platforms in one search.
21
Real-World Example of a Metasearch Engine in Action
Let’s say a user is planning a vacation. They want to compare prices for flights and hotels but don’t have time to visit
individual travel websites like Expedia, Orbitz, or airline websites. Using Kayak, the user can:
1. Enter their travel dates and destination.
2. Kayak will query multiple travel sites, including airline websites and online travel agencies.
3. Kayak will aggregate the flight and hotel prices and show them in one list, allowing the user to filter based on price,
travel duration, number of stops, and more.
4. The user can then make a decision based on the most comprehensive comparison, saving time and potentially finding
a better deal.
FOR SEARCH ENGINE ARCHITECTURE :Link
22
23
Search Engine - Introdcution, Architecture & Working - GeeksforGeeks 24