0% found this document useful (0 votes)
31 views9 pages

Irt Ans

This document provides information on various techniques used in information retrieval systems. It discusses applications of IR, common commands like GREP used for text searching, the purpose of lemmatization, components of an inverted index with examples, definitions of tokenization and bag of words model, need for query optimization, differences between normalization and case folding, differences between types and tokens in IR, effectiveness measures like precision and recall for search results, issues in tokenization, entries in a permuterm index for a given term, use of proximity operators in extended Boolean models, and normalization of terms. It also discusses the basic idea of the BSBI algorithm, key ideas in single-pass in-memory indexing, properties of gamma codes for compression,

Uploaded by

Sanjay Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

Irt Ans

This document provides information on various techniques used in information retrieval systems. It discusses applications of IR, common commands like GREP used for text searching, the purpose of lemmatization, components of an inverted index with examples, definitions of tokenization and bag of words model, need for query optimization, differences between normalization and case folding, differences between types and tokens in IR, effectiveness measures like precision and recall for search results, issues in tokenization, entries in a permuterm index for a given term, use of proximity operators in extended Boolean models, and normalization of terms. It also discusses the basic idea of the BSBI algorithm, key ideas in single-pass in-memory indexing, properties of gamma codes for compression,

Uploaded by

Sanjay Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Information Retrieval Techniques

Unit - 1

1.List any four applications of Information retrieval.

 Blog search.
 Image retrieval.
 3D retrieval.
 Music retrieval.
 News search.
 Speech retrieval.
 Video retrieval.

2. Write about GREP command.


Grep is a useful command to search for matching patterns in a file. grep is short for
"global regular expression print". If you are a system admin who needs to scrape
through log files or a developer trying to find certain occurrences in the code file,
then grep is a powerful command to use.

3. Outline the purpose of Lemmatize.


The goal of lemmatization is to reduce a word to its root form, also called a lemma.
For example, the verb "running" would be identified as "run." Lemmatization studies
the morphological, or structural, and contextual analysis of words.

4. Mention the two important parts of an inverted index with example.


o Dictionary
o Postings Lists.
5. Define tokenization and tokenize the following document.

Tokenization is the process of chopping character streams into tokens; linguistic


preprocessing then deals with building equivalence classes of tokens, which are the
set of terms that are indexed.
6. Define Bag of words model.
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is material. A bag-of-words is a representation of text that
describes the occurrence of words within a document.

7. State the need for Query optimization.


Query optimization is used to access and modify the database in the most efficient
way possible. It is the art of obtaining necessary information in a predictable,
reliable, and timely manner.

8. Differntiate between normalization and case folding.


Normalization is the process of organizing data into a related table; it also eliminates
redundancy and increases the integrity which improves performance of the query.

Case-folding is defined as "a process applied to a sequence of characters, in which


those identified as non-uppercase are replaced by their uppercase equivalents".

9. Differentiate between type and token in IR document.


A token is an instance of a sequence of characters in some particular document that
are grouped together as a useful semantic unit for processing. A type is the class of
all tokens containing the same character sequence. A term is a (perhaps normalized)
type that is included in the IR system's dictionary.

10. List and define two effectiveness measures of an IR search results.


Precision and recall.

Precision measures the accuracy of positive predictions, while recall measures the
completeness of positive predictions.

11. Identify the possible issues in tokenization.


Most traditional tokenization systems fail to account for input data types during
token generation, severely limiting support for analytics. Also, the lack of context
around sensitive data inputs prevents most tokenization systems from securely
managing the detokenization process.
12. Write down the entries in the permuterm index dictionary that are
generated by the term mama.
mama$,ama$m,ma$ma,a$mam,$mama.

13. Explain the use of proximity operators in extended Boolean retrieval


models.
A proximity operator is a way of specifying that two terms in a query must occur
close to each other in a document, where closeness may be measured by limiting the
allowed number of intervening words or by reference to a structural unit such as a
sentence or paragraph.

14.

Outline on normalization and Suggest what normalized


form should be used for these words.

A) Because

B) Shouldn't or Shiite which is a branch of Islam

C) Continued

D) Hawaii

E) Karaoke or Beto O'Rourke is the name of famous American businessman.

Above words are formed keeping in mind closest similarity to the given words.
These are the possible words which could be created by given hint.
Unit – 2

1. Write basic idea of BSBI algorithm


(i) segments the collection into parts of equal size, (ii) sorts the termID-docID pairs
of each part in memory, (iii) stores intermediate sorted results on disk, and (iv)
merges all intermediate results into the final index.

2. List two key ideas in constructing single-pass in-memory indexing.


In addition to constructing a new dictionary structure for each block and eliminating
the expensive sorting step, SPIMI has a third important component: compression.
Both the postings and the dictionary terms can be stored compactly on disk if we
employ compression.

3. Write about two properties of gamma (γ) codes in posting file compression.
A γ code is decoded by first reading the unary code up to the 0 that terminates
it, for example, the four bits 1110 when decoding 1110101.
 Represent a gap G as a pair of length and offset.
 Offset is the gap in binary, with the leading bit chopped off.

4. Define document frequency (df) and inverse document frequency (idf)


TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that
uses the frequency of words to determine how relevant those words are to a given
document. It's a relatively simple but intuitive approach to weighting words,
allowing it to act as a great jumping off point for a variety of tasks.

5. Give the significance of Tf-Idf weighting for a term t in a document d.


TF-IDF weights words based on relevance, one can use this technique to determine
that the words with the highest relevance are the most important. This can be used
to help summarize articles more efficiently or to simply determine keywords (or
even tags) for a document.
6. Compute tf-idf for the following documents

Tf values

Some terms appear in two documents, some appear only in one document. The total
number of documents is N=3. Therefore, the idf values for the terms are:

angles log2(3/1)=1.584
los log2(3/1)=1.584
new log2(3/2)=0.584
post log2(3/1)=1.584
times log2(3/2)=0.584
york log2(3/2)=0.584

TF-IDF

angeles los new post times york


d1 0 0 0.584 0 0.584 0.584
d2 0 0 0.584 1.584 0 0.584
d3 1.584 1.584 0 0 0.584 0

7. State the need for dictionary compression.


the main goal of compressing the dictionary is to fit it in main memory, or at least a
large portion of it, to support high query throughput. Other reasons for wanting to
conserve memory are fast startup time and having to share resources with other
applications.
8. List the Parameters in calculating a weight for a document term or query
term.
 term frequency,
 inverse document frequency,
 document length.

9. List the characteristics of Hardware basics needed for IR system.


 Caching
 seek time
 Buffer
10. Give the significance of SPIMI.
SPIMI uses terms instead of termIDs,
writes each block’s dictionary to disk, and then starts a new dictionary for the
next block. SPIMI can index collections of any size as long as there is enough
disk space available.
11. What are the advantages and disadvantages of query processing?
Advantages
The searching process is easy to understand. Current information is available in the
storage database. Users can access multi-database to use multiple
keywords/concepts at the same time. To serve multi-users at the same time.
Disadvantages
1) Expensive. 2) Reduction in jobs. 3) Security breaches.
12. What is the difference between BSBI and SPIMI?
A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its
postings list
14. List the advantages of index construction.
The main advantage of this method is its simplicity
During index construction,
we can simply assign successive integers to each new document
when it is first encountered.
15. Estimate the space usage of the Reuters dictionary with blocks of size k = 8 and
k = 16 in blocked dictionary storage
Solution: K = 8 We will save :(8-1) * 3 = 21 bytes for term pointer Need additional
k =8 for term length
so space reduced by 13 bytes per 8 term block
Total space reduced by= 400000 * 13 /8 = 0.65 MB
Total space are: 7.6 – 0.65 = 6.95 MB
K=16 We will save :(16-1) * 3 = 45 bytes for term pointer Need additional k=16 for
term length
so space reduced by 29 bytes per 16 term block
Total space reduced by= 400000 * 29 /16 = 0.725 MB
Total space are: 7.6 – 0.725 = 6.875 MB

UNIT -3

1.List any four standard test collections for information retrieval system
evaluation.
 Cranfield
 TREC Text Retrieval Conference (TREC).
 GOV2
 NTCIR
 CLEF Cross Language Evaluation Forum (CLEF).
 Reuters
 Newsgroups

2. Identify the three important things to measure ad hoc IR effectiveness.

1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of
either relevant or nonrelevant for each query-document pair.
3. Define interpolated precision.
The interpolated Precision is the maximum Precision corresponding to the Recall
value greater than the current Recall value

4. Define precision and recall.

Precision can be seen as a measure of quality, and recall as a measure of quantity.

5. Define Kappa statistics.


The Kappa Statistic or Cohen's* Kappa is a statistical measure of inter-rater
reliability for categorical variables.

6. Translate information given below into a query


Drinking lemon tea is more effective at reducing your risk of heart attacks than
drinking coffee.
The User Task. It all starts with the user converting the information to a query. In an
information retrieval system, a collection of words is used to convey the semantics
of the information that is requested, whereas, in a data retrieval system, a query
phrase is used to convey the constraints that the objects satisfy

7. Define Cosine Similarity.


Cosine similarity measures the similarity between two vectors of an inner product
space. It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used to
measure document similarity in text analysis.

8. Define MapReduce function.


MapReduce is a Java-based, distributed execution framework within the Apache
Hadoop Ecosystem. It takes away the complexity of distributed programming by
exposing two processing steps that developers implement: 1) Map and 2) Reduce.

9. State Mean Average Precision.


We use the mean average precision (mAP) to measure the accuracy of information
retrieval models.
10. Discuss the two ideas in impact ordering to lower the number of documents
for accumulate scores.

(1) when traversing the postings list for a query term , we stop after considering a
prefix of the postings list - either after a fixed number of documents have been
seen, or after the value of has dropped below a threshold;
(2) when accumulating scores in the outer loop, we consider the query terms in
decreasing order of idf, so that the query terms likely to contribute the most to the
final scores are considered first.

11. Define metadata.


Metadata is a set of data that provides information about other data.

12. Write Relationship between the value of F1 and the break-even point

BEP refers to the volume of sales where the profit or loss is zero or the total sales
is equal to total costs (fixed as well as marginal costs)

13. Write short notes on vector space model.


Vector space model or term vector model is an algebraic model for representing text
documents (and any objects, in general) as vectors of identifiers (such as index
terms).

15. Define Accumulation.


An accumulation of something is a large number of things which have been collected
together or acquired over a period of time.

You might also like