0% found this document useful (0 votes)

31 views9 pages

Irt Ans

This document provides information on various techniques used in information retrieval systems. It discusses applications of IR, common commands like GREP used for text searching, the purpose of lemmatization, components of an inverted index with examples, definitions of tokenization and bag of words model, need for query optimization, differences between normalization and case folding, differences between types and tokens in IR, effectiveness measures like precision and recall for search results, issues in tokenization, entries in a permuterm index for a given term, use of proximity operators in extended Boolean models, and normalization of terms. It also discusses the basic idea of the BSBI algorithm, key ideas in single-pass in-memory indexing, properties of gamma codes for compression,

Uploaded by

Sanjay Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views9 pages

Irt Ans

Uploaded by

Sanjay Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Information Retrieval Techniques

Unit - 1

1.List any four applications of Information retrieval.

 Blog search.
 Image retrieval.
 3D retrieval.
 Music retrieval.
 News search.
 Speech retrieval.
 Video retrieval.

2. Write about GREP command.

Grep is a useful command to search for matching patterns in a file. grep is short for
"global regular expression print". If you are a system admin who needs to scrape
through log files or a developer trying to find certain occurrences in the code file,
then grep is a powerful command to use.

3. Outline the purpose of Lemmatize.

The goal of lemmatization is to reduce a word to its root form, also called a lemma.
For example, the verb "running" would be identified as "run." Lemmatization studies
the morphological, or structural, and contextual analysis of words.

4. Mention the two important parts of an inverted index with example.

o Dictionary
o Postings Lists.
5. Define tokenization and tokenize the following document.

Tokenization is the process of chopping character streams into tokens; linguistic

preprocessing then deals with building equivalence classes of tokens, which are the
set of terms that are indexed.
6. Define Bag of words model.
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is material. A bag-of-words is a representation of text that
describes the occurrence of words within a document.

7. State the need for Query optimization.

Query optimization is used to access and modify the database in the most efficient
way possible. It is the art of obtaining necessary information in a predictable,
reliable, and timely manner.

8. Differntiate between normalization and case folding.

Normalization is the process of organizing data into a related table; it also eliminates
redundancy and increases the integrity which improves performance of the query.

Case-folding is defined as "a process applied to a sequence of characters, in which

those identified as non-uppercase are replaced by their uppercase equivalents".

9. Differentiate between type and token in IR document.

A token is an instance of a sequence of characters in some particular document that
are grouped together as a useful semantic unit for processing. A type is the class of
all tokens containing the same character sequence. A term is a (perhaps normalized)
type that is included in the IR system's dictionary.

10. List and define two effectiveness measures of an IR search results.

Precision and recall.

Precision measures the accuracy of positive predictions, while recall measures the
completeness of positive predictions.

11. Identify the possible issues in tokenization.

Most traditional tokenization systems fail to account for input data types during
token generation, severely limiting support for analytics. Also, the lack of context
around sensitive data inputs prevents most tokenization systems from securely
managing the detokenization process.
12. Write down the entries in the permuterm index dictionary that are
generated by the term mama.
mama$,ama$m,ma$ma,a$mam,$mama.

13. Explain the use of proximity operators in extended Boolean retrieval

models.
A proximity operator is a way of specifying that two terms in a query must occur
close to each other in a document, where closeness may be measured by limiting the
allowed number of intervening words or by reference to a structural unit such as a
sentence or paragraph.

14.

Outline on normalization and Suggest what normalized

form should be used for these words.

A) Because

B) Shouldn't or Shiite which is a branch of Islam

C) Continued

D) Hawaii

E) Karaoke or Beto O'Rourke is the name of famous American businessman.

Above words are formed keeping in mind closest similarity to the given words.
These are the possible words which could be created by given hint.
Unit – 2

1. Write basic idea of BSBI algorithm

(i) segments the collection into parts of equal size, (ii) sorts the termID-docID pairs
of each part in memory, (iii) stores intermediate sorted results on disk, and (iv)
merges all intermediate results into the final index.

2. List two key ideas in constructing single-pass in-memory indexing.

In addition to constructing a new dictionary structure for each block and eliminating
the expensive sorting step, SPIMI has a third important component: compression.
Both the postings and the dictionary terms can be stored compactly on disk if we
employ compression.

3. Write about two properties of gamma (γ) codes in posting file compression.
A γ code is decoded by first reading the unary code up to the 0 that terminates
it, for example, the four bits 1110 when decoding 1110101.
 Represent a gap G as a pair of length and offset.
 Offset is the gap in binary, with the leading bit chopped off.

4. Define document frequency (df) and inverse document frequency (idf)

TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that
uses the frequency of words to determine how relevant those words are to a given
document. It's a relatively simple but intuitive approach to weighting words,
allowing it to act as a great jumping off point for a variety of tasks.

5. Give the significance of Tf-Idf weighting for a term t in a document d.

TF-IDF weights words based on relevance, one can use this technique to determine
that the words with the highest relevance are the most important. This can be used
to help summarize articles more efficiently or to simply determine keywords (or
even tags) for a document.
6. Compute tf-idf for the following documents

Tf values

Some terms appear in two documents, some appear only in one document. The total
number of documents is N=3. Therefore, the idf values for the terms are:

angles log2(3/1)=1.584
los log2(3/1)=1.584
new log2(3/2)=0.584
post log2(3/1)=1.584
times log2(3/2)=0.584
york log2(3/2)=0.584

TF-IDF

angeles los new post times york

d1 0 0 0.584 0 0.584 0.584
d2 0 0 0.584 1.584 0 0.584
d3 1.584 1.584 0 0 0.584 0

7. State the need for dictionary compression.

the main goal of compressing the dictionary is to fit it in main memory, or at least a
large portion of it, to support high query throughput. Other reasons for wanting to
conserve memory are fast startup time and having to share resources with other
applications.
8. List the Parameters in calculating a weight for a document term or query
term.
 term frequency,
 inverse document frequency,
 document length.

9. List the characteristics of Hardware basics needed for IR system.

 Caching
 seek time
 Buffer
10. Give the significance of SPIMI.
SPIMI uses terms instead of termIDs,
writes each block’s dictionary to disk, and then starts a new dictionary for the
next block. SPIMI can index collections of any size as long as there is enough
disk space available.
11. What are the advantages and disadvantages of query processing?
Advantages
The searching process is easy to understand. Current information is available in the
storage database. Users can access multi-database to use multiple
keywords/concepts at the same time. To serve multi-users at the same time.
Disadvantages
1) Expensive. 2) Reduction in jobs. 3) Security breaches.
12. What is the difference between BSBI and SPIMI?
A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its
postings list
14. List the advantages of index construction.
The main advantage of this method is its simplicity
During index construction,
we can simply assign successive integers to each new document
when it is first encountered.
15. Estimate the space usage of the Reuters dictionary with blocks of size k = 8 and
k = 16 in blocked dictionary storage
Solution: K = 8 We will save :(8-1) * 3 = 21 bytes for term pointer Need additional
k =8 for term length
so space reduced by 13 bytes per 8 term block
Total space reduced by= 400000 * 13 /8 = 0.65 MB
Total space are: 7.6 – 0.65 = 6.95 MB
K=16 We will save :(16-1) * 3 = 45 bytes for term pointer Need additional k=16 for
term length
so space reduced by 29 bytes per 16 term block
Total space reduced by= 400000 * 29 /16 = 0.725 MB
Total space are: 7.6 – 0.725 = 6.875 MB

UNIT -3

1.List any four standard test collections for information retrieval system
evaluation.
 Cranfield
 TREC Text Retrieval Conference (TREC).
 GOV2
 NTCIR
 CLEF Cross Language Evaluation Forum (CLEF).
 Reuters
 Newsgroups

2. Identify the three important things to measure ad hoc IR effectiveness.

1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of
either relevant or nonrelevant for each query-document pair.
3. Define interpolated precision.
The interpolated Precision is the maximum Precision corresponding to the Recall
value greater than the current Recall value

4. Define precision and recall.

Precision can be seen as a measure of quality, and recall as a measure of quantity.

5. Define Kappa statistics.

The Kappa Statistic or Cohen's* Kappa is a statistical measure of inter-rater
reliability for categorical variables.

6. Translate information given below into a query

Drinking lemon tea is more effective at reducing your risk of heart attacks than
drinking coffee.
The User Task. It all starts with the user converting the information to a query. In an
information retrieval system, a collection of words is used to convey the semantics
of the information that is requested, whereas, in a data retrieval system, a query
phrase is used to convey the constraints that the objects satisfy

7. Define Cosine Similarity.

Cosine similarity measures the similarity between two vectors of an inner product
space. It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used to
measure document similarity in text analysis.

8. Define MapReduce function.

MapReduce is a Java-based, distributed execution framework within the Apache
Hadoop Ecosystem. It takes away the complexity of distributed programming by
exposing two processing steps that developers implement: 1) Map and 2) Reduce.

9. State Mean Average Precision.

We use the mean average precision (mAP) to measure the accuracy of information
retrieval models.
10. Discuss the two ideas in impact ordering to lower the number of documents
for accumulate scores.

(1) when traversing the postings list for a query term , we stop after considering a
prefix of the postings list - either after a fixed number of documents have been
seen, or after the value of has dropped below a threshold;
(2) when accumulating scores in the outer loop, we consider the query terms in
decreasing order of idf, so that the query terms likely to contribute the most to the
final scores are considered first.

11. Define metadata.

Metadata is a set of data that provides information about other data.

12. Write Relationship between the value of F1 and the break-even point

BEP refers to the volume of sales where the profit or loss is zero or the total sales
is equal to total costs (fixed as well as marginal costs)

13. Write short notes on vector space model.

Vector space model or term vector model is an algebraic model for representing text
documents (and any objects, in general) as vectors of identifiers (such as index
terms).

15. Define Accumulation.

An accumulation of something is a large number of things which have been collected
together or acquired over a period of time.

Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Bulu
No ratings yet
Bulu
47 pages
IIRS Quiz-1 Bits
No ratings yet
IIRS Quiz-1 Bits
15 pages
IRSunit 2
No ratings yet
IRSunit 2
20 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Ir QB
No ratings yet
Ir QB
8 pages
IR - Set 1
No ratings yet
IR - Set 1
5 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Sample Exam
No ratings yet
Sample Exam
2 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
Information Retrieval MCQ
No ratings yet
Information Retrieval MCQ
93 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Unit 2
No ratings yet
Unit 2
58 pages
Practice Question For Information Retrieval Subject
No ratings yet
Practice Question For Information Retrieval Subject
5 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
IRS UNITS-1,2,3 Objective Type Questions
No ratings yet
IRS UNITS-1,2,3 Objective Type Questions
9 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
No ratings yet
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
10 pages
Information Retrieval Systems Q&A
No ratings yet
Information Retrieval Systems Q&A
9 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Information Retrieval & MapReduce
No ratings yet
Information Retrieval & MapReduce
72 pages
SodaPDF Converted Text
No ratings yet
SodaPDF Converted Text
14 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
Unit III
No ratings yet
Unit III
37 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Query Languages
No ratings yet
Query Languages
54 pages
Theory Assignment
No ratings yet
Theory Assignment
4 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
All Unit 2 Mark
No ratings yet
All Unit 2 Mark
15 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
9210 Imp Questions
No ratings yet
9210 Imp Questions
71 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Information Retrieval Thesis Topics
100% (3)
Information Retrieval Thesis Topics
6 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
5 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
ACFrOgAhDKMNiLdAKJ27Hzg52gNTQw 5K PHitykqmtwIgd9UKTVkmihywbzrIyBvrHsHZZ9wixYTTAUoZYnERTr6vUQ Cfqlt65bXEVoMBh Ta3S1geQE-C8DUlimE
No ratings yet
ACFrOgAhDKMNiLdAKJ27Hzg52gNTQw 5K PHitykqmtwIgd9UKTVkmihywbzrIyBvrHsHZZ9wixYTTAUoZYnERTr6vUQ Cfqlt65bXEVoMBh Ta3S1geQE-C8DUlimE
2 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
QP Midsem Regular - Solutions For IR
100% (2)
QP Midsem Regular - Solutions For IR
4 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Minimalist White and Grey Professional Resume
No ratings yet
Minimalist White and Grey Professional Resume
1 page
Fat G2 Cse2003 50201
No ratings yet
Fat G2 Cse2003 50201
4 pages
Entity Framework
No ratings yet
Entity Framework
70 pages
25 Angular Interview Questions
No ratings yet
25 Angular Interview Questions
3 pages
Lecture 16 Exception Handling
No ratings yet
Lecture 16 Exception Handling
7 pages
08 - Using Optimization Parameter Analysis in Analytic Solver
No ratings yet
08 - Using Optimization Parameter Analysis in Analytic Solver
5 pages
Key Info SIMUL8
No ratings yet
Key Info SIMUL8
4 pages
Practicals Programmes 2024-25
No ratings yet
Practicals Programmes 2024-25
3 pages
Core Java Assignments Lab 3
25% (4)
Core Java Assignments Lab 3
11 pages
BMIDE Assessment Questions-6
No ratings yet
BMIDE Assessment Questions-6
9 pages
Universiti Tun Hussein Onn Malaysia: Confidential
No ratings yet
Universiti Tun Hussein Onn Malaysia: Confidential
6 pages
Java SE 17 Certification Exam Guide
No ratings yet
Java SE 17 Certification Exam Guide
3 pages
BrainBusterSeries CS Class12 FREE
No ratings yet
BrainBusterSeries CS Class12 FREE
12 pages
High Level Musical Control of Sound Synthesis in O
No ratings yet
High Level Musical Control of Sound Synthesis in O
5 pages
SQL Scripts for PowerCenter Monitoring
No ratings yet
SQL Scripts for PowerCenter Monitoring
4 pages
Java-All Quizes
No ratings yet
Java-All Quizes
95 pages
E-Assignment: German-Malaysian Institute
No ratings yet
E-Assignment: German-Malaysian Institute
5 pages
PLC vs Microcontroller Guide
No ratings yet
PLC vs Microcontroller Guide
7 pages
Java Script Matrial
No ratings yet
Java Script Matrial
106 pages
20 Java Program
No ratings yet
20 Java Program
89 pages
Cs403-Finalterm-Solved-Mcqs Solved Mega File
No ratings yet
Cs403-Finalterm-Solved-Mcqs Solved Mega File
39 pages
8085 Microprocessor Guide
No ratings yet
8085 Microprocessor Guide
14 pages
Distributed Order Management - Retail & Commerce
No ratings yet
Distributed Order Management - Retail & Commerce
12 pages
IT3280 ThuchanhKTMT
No ratings yet
IT3280 ThuchanhKTMT
47 pages
C# Data Science with NumSharp
No ratings yet
C# Data Science with NumSharp
1 page
Asynchronous Programming With Coroutines
No ratings yet
Asynchronous Programming With Coroutines
10 pages
3.11. Java Standard Tag Library (JSTL) : Internet Programming With Java Course
No ratings yet
3.11. Java Standard Tag Library (JSTL) : Internet Programming With Java Course
22 pages
Unit 4: Architectural Modeling-1 Components
No ratings yet
Unit 4: Architectural Modeling-1 Components
12 pages
DSA Practical 2
No ratings yet
DSA Practical 2
4 pages
COMPILER DESIGN ASSIGNMENT TWO 17 12 2022 Submit
No ratings yet
COMPILER DESIGN ASSIGNMENT TWO 17 12 2022 Submit
18 pages

Irt Ans

Uploaded by

Irt Ans

Uploaded by

Information Retrieval Techniques

1.List any four applications of Information retrieval.

2. Write about GREP command.

3. Outline the purpose of Lemmatize.

4. Mention the two important parts of an inverted index with example.

Tokenization is the process of chopping character streams into tokens; linguistic

7. State the need for Query optimization.

8. Differntiate between normalization and case folding.

Case-folding is defined as "a process applied to a sequence of characters, in which

9. Differentiate between type and token in IR document.

10. List and define two effectiveness measures of an IR search results.

11. Identify the possible issues in tokenization.

13. Explain the use of proximity operators in extended Boolean retrieval

Outline on normalization and Suggest what normalized

B) Shouldn't or Shiite which is a branch of Islam

E) Karaoke or Beto O'Rourke is the name of famous American businessman.

1. Write basic idea of BSBI algorithm

2. List two key ideas in constructing single-pass in-memory indexing.

4. Define document frequency (df) and inverse document frequency (idf)

5. Give the significance of Tf-Idf weighting for a term t in a document d.

angeles los new post times york

7. State the need for dictionary compression.

9. List the characteristics of Hardware basics needed for IR system.

2. Identify the three important things to measure ad hoc IR effectiveness.

4. Define precision and recall.

Precision can be seen as a measure of quality, and recall as a measure of quantity.

5. Define Kappa statistics.

6. Translate information given below into a query

7. Define Cosine Similarity.

8. Define MapReduce function.

9. State Mean Average Precision.

11. Define metadata.

13. Write short notes on vector space model.

15. Define Accumulation.

You might also like