0 ratings0% found this document useful (0 votes) 435 views30 pagesIR Unit 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Introduction to
Sik Information Retrieval
ystern block diagram,
ing and IR relation, IR
im Weighting, Probabilistic
Cluster Hypothesis,
Basic Concepts of IR, Data Retrieval & Information Retrieval, Text min
indexing and Index Ter!
‘Automatic Text Analysis: Luhn’s ideas, Conflation Algorithm, I
ients,
Indexing, Automatic Classification, Measures of Association, Different Matching Coeffici
thm
Clustering Techniques: Rocchio's Algorithm, Single pass algorithm, Single Link 2190"!
@11_BasicConceptsof IR
Explain basic concept of Information Retrieval.
e to maintain such huge amount of
handle vast amount of information. The purpo:
‘= This is the information era. We
sible.
information is when we need any information; we should get it as early as Pos
We require speedy and accurate access whenever we need it
Ene method to get relevant information is read all the documents and then dec ides which are the relevant and
which are the non-relevant documents? This is the manual method.
fe all the information in computer and ask to find the
Second method is the automatic method in which we stor
storage, organization and access to
e representation,
relevant information. Information retrieval handles th:
ld require less space. The organization of information
information items, The representation of information sho
‘hould be such that, system should require less time to access the items of information which satisfies user needs.
1.2 Data Retrieval and Information Retrieval
|G. Differentiate between data and information retrieval
Sao
bat retrieval mainly concerned with determining which documents contain specific words in the query
In information retrieval, the user is interested in the relevant information of the query.
Te Table 1.21 is the comparison of data retrieval and information retrieval.ge and Retrieval (SPPU) tee)
Information Sto
Table
a meters _| Data Retrieval (OR) | Information Retrieval Ry
1 Matching Exact match Partial match, best match
2) | inference Deduction Induction
f Model Deterministic Probabilistic
4, _ | Classification Monothetic Polythetic
5. | Query language —_| Artificial Natural
6 | Query specification | Complete Incomplete
7. __ | Items wanted Matching Relevant
8. | Error response Sensitive Insensitive |
1. Matching
+ In data retrieval, we normally search for exact match for e.g, whether a file contains a particular word or not
+ in information retrieval, we normally find documents which partially match the request and then sl
‘out of them.
2. Inference
= The inference which get used in data retrieval is deductive e.g. if a— b and bc thena +c.
* In Information Retrieval we follow inductive inference i.e. relations are specified with a degree of cert
uncertainty.
3. Model
«As Data Retrieval uses deductive inference, deterministic models can be helpful for finding the relev
documents.
‘* As the Information Retrieval uses inductive inference, we can use models which give the probabilistic output
Classification
+ In Data Retrieval we are interested in monothetic classification ie, one with classes defined by obje
possessing attributes, both necessary and sufficient, to belong to a class.
‘+ InInformation Retrieval monothetic classification is not required.
© In Information Retrieval polythetic classification gets used, i
* In such a classification each individual in a class will possess only a proportion of all attributes possessed by!
members of that class. Hence no attribute is necessary nor sufficient for membership of a class,
5. Query language
‘+ The query language which is used in Data Retrieval is mostly artificial: with restricted syntax and voc
e.g. we can write a query in SQL in its fixed format with fixed words.Information Storage and Retrieval (SPPU) 1-3 Introduction to Information Retrieval
* In Information Retrieval, query will have no restriction related to syntax or vocabulary.
User can provide a query in natural language format. The Information Retrieval system should be able to
handle such queries.
6. Query specification
* As Data Retrieval finds exact match and the query follows a restricted format, the query must be complete.
User should provide the exact query for his interest.
* In Information Retrieval, user can use natural language to specify the query and hence it is possible that the
query may be incomplete e.g, user will not follow the standard grammar of the language. The Information
Retrieval system can handle such queries.
|. Items wanted
‘In Data Retrieval, user specifies the exact query and hence the list of items will contain those items which
exactly match the query.
+ In Information Retrieval, the query gets specified in natural language as well as the models which used for
finding the items are probabilistic, the system will find items which are relevant to the query.
User then will decide for best one from the listed output.
8. Error response
‘* In Data Retrieval, the query must be complete with proper syntax. Hence if there will be any error while
specifying the query, meaning will be totally different and we can get wrong items.
‘© In Information Retrieval, query gets specified in natural language, hence some sort of relaxation can be
handled by the system.
1.2.1 Text Mining and IR Relation
‘= Information Retrieval is related to text, images, audio, video or object oriented type of information. IR deals with
the efficient way of storage of the information and various methods of searching the information based on user's
interest. Handling textual information is subdomain of IR
'* Ris more to do with search engines where we have large amount of information and based on user's requirement,
specific information is extracted from the collection. IR is more hybrid topic which converts, machine learning
techniques, Natural Language processing techniques. Now-a-days, more focus of IR is on search engines.
13 Information Retrieval System: Block Diagram
tion
@. Draw and explain IR system block diagram. STS
]@ Draw in system block diagram. SASL
HV Viana document reproveriatve? Explain wih a suitable example.
RAS
‘An information retrieval system deals with representation, storage, organization and access of information.
Fig. 13.1 shows the block diagram of a typical Information Retrieval system.
Ny nataneformation Storage and Retrieval (SPPU) YorMation ep,
in the information and the que te
y
Sven bythe
1-4 Introduction to In
of documents which contal
.s which are relevant to the query. a
The input for this system is the |
The Information Retrieval system will find the list of document:
Feed!
Input]
Fig, 1.3.1: Information Retrieval system block diagram
document and query suitable for computer to us,
®
n representation of each
‘ation. The documents in its natural lan.
guage
9 fom,
sted into its represents
locuments in natural language format, space requir
eMe
The main problem here is to obtal
the first step the documents are conve
are one of the representations. But if we store these d
gets increased as well as time to retrieve the items which are relevant to query i also large
representation of the document. A d
Jocumey,
1d retrieval systems store only
.d to be significant. These words are calleg
a
¥ extracted words considere
verted into document representation.
Hence most computer base’
list 0}
Jost once it has been CO!
Ce a 2 e
etal
representative could be @
sument is
keywords. Text of a doc
accents,
spacing,
etc.
groups indexing
‘Structure
Saute) full text
of the document
Fig. 1.3.2 : Logical
A full text can be document representative or list of keywort
document
ment can be document
Fig, 132 shows the logical view of the
sentative or any intermediate status of the docu
(index terms) can be a document repres
representative.
‘As the document gets converted in its int
the same fashion.
The Information Retrieval system will internally find the relevant documents compared with the query and outp
ie. the list of relevant documents will be provided to the user.
he can provide the feedback based on which the query 98
emal representation, the queries given by the user are also convertedia
User can stop here or if user wants to refine the query,
modified.
The same process will be done with modified
query and finally the output ie. li a
panera ly the output i.e, list of relevant documentstrieval (SPPU) 1-5 Introduction to Information Retrieval
on Storage and Retrieval
nformati
"Automatic Text Analysis
trieval systems are of two types, one is manual retrieval system and another is automatic retrieval
Mam ve dacssing about automat reveal system
—— trioval system computer searches for the relevant documents related to the given query.
Ben, fe ed formation Retrieval system can actually operate the documents to retrieve information, the
BRP ci be stored ince th co the documents is in its natural format i.e. text
documents must be stored inside the computer one way to store
ee disadvantage of this method is requires more space in memory to store the documents. And the second is
Which searching for the query relevant documents, system require more time.
Jution for this problem is to find the document representative for each document. It can be a title, an abstract,
jon for this problem is
ei of words from that document. Mostly the lists of words from the documents are used as the document
ora li r 3
representative.
The words are those words from documents which contain the semantic of the document. These words are called
= s
as keywords.
15 _Luhn’s Ideas
rT
@. Explain Luhn's idea for understanding the context of the document. See
SEA
Document can be represented as a list of words. But here the question aries which words can be picked from the
documents to create the document representative. Luhn has given the basic idea for this selection.
@.__Explain Luhn's dea in details
Luhn states that the frequency of word occurrence in an article furnishes a useful measurement of word
significance. The relative position within a sentence of words having given values of significance furnishes a useful
measurement for determining the significance of sentences.
The significance factor of a sentence will therefore be based on a combination of these two measurements.
In short, he states that frequency of the words can be used to extract words and sentences to represent a
‘document. Luhn used the Zipf's law for stating the idea. Zipf's law states that the product of the frequency of use
‘of words and the rank order is approximately constant.
Luhnhas specified following idea. Let,
F: Frequency of occurrence of various word types in a given position of text.
Ri The rank order of these words i.e, order of their frequency of occurrence.
Then plot a graph relating f and r which is like a hyperbolic curve,
Here we are interested to find the significant words from the document.
Fig, 1.5.1 shows a plot of the hyperbolic cure relating, f the frequency of occurrence and r, the rank order. Luhn has
stated two cut-offs upper cut off and lower cut off
Upper cut-off is used for excluding common words. The words whose frequency is greater than the upper cut off
are the common words, These words do not cont
considered in the list.
the semantic of the documents. Hence these words are not
¥and Retrieval (SPPU) Introduction to Information Retin
information Stora
‘re the rare words. Hence get discarded,
and lower cut-off are considered a ft
and I It fered * iy |
Thus, the words whose frequency values are in the range of upper
ative.
words. These words become the part of document represent
Leer ae
The words having less frequency as compared to the lower cut-off
| resolving power of
‘sgriicant words
Froquoncy of words
| signiticant words
Words by rank order
Fig. 1.5.1 : Luhn’s Idea
for deciding upper and lower cut-off. They have to be established by trall and error
There is no thumb rule f
Conflation Algorithm
16
mn
ppresentatives using conflation algorithm? EEL
/
|
Ja. How-to generate the document re
| Explain steps in conflation algorithm using a suitable example. SEE
a
List and explain steps of conflation algorithm.
essing system for use in an automatic retrieval system. Explain the folowing pat
Ce
@. _Youare developing a text proc
i) Removal of high frequency words
ji) _ Sutfix stripping
i) Detecting equivalent stems.
+ Confation algorithm is the method to find the document representative. Its based on the Luhn's idea i
The algorithm consists of three parts as shown in Fig. 1.6.1
Algorithm consists of three parts
1. Removal of high frequency words
2, Sullix stripping
3, Detecting equivalent stems
Fig. 1.6.1 : Parts of Conflation algorithmSie Ra i A aaa tema Sem ae
one
Information Storage and Retrieval (SPPU)
Introduction to Information Retrieval
Removal of high frequency words
+ The removal of high frequency words ie, ‘stop! words or luff words is one way of implementing Luhn’s upper
cut-off. This is done by comparing each word from the document to the list of high frequency words, High
frequency words are those words which comes more number of times in the text.
These words does not contain meaning or semantic of the text. For e.g. is,
am, I, are, the etc, are some of
words.
© The advantage of this process is not only the non-significant words are removed but also the size of the
document can be reduced to 30 to 50%,
Suffix stripping
* The second step isthe suffix stripping. In this step, each word is handled from the output of first step. If any
word is having the suffix, then the suffix gets removed and the word is converted in its original form.
* Foreg. If the word is “Killed” it will be converted into ‘kill’ other example are
Original word Werd in original form
1._| Processes ~ _| Process
2._| Repeated ~ _| Repeat
3.__| Kidding > | kid
4. | National > _| Nation
* Unfortunately, context free removal leads to a significant error rate. For e.g, we may want UAL removed from
FACTUAL but not from EQUAL.
To avoid erroneously removing suffixes some rules can be followed For e.g,
1. The length of remaining stem exceeds a given number, the default is usually 2
2. The stem-ending satisfies a certain condition, e.g, does not end with Q.
For removing the suffixes the rules of grammar of the language can be used. For English language porter’s
algorithm is one of the algorithm which helps in removal of suffixes,
The process is called as stemming. An advantage of stemming is it reduces the size of the text. However, too
much stemming is not practical and annoys users.
Detecting equivalent stems
After suffix stripping, we will have the list of words. Only one occurrence of the word is kept in the list for e.g
if two words “processing” and "processed" get converted into ‘process’ only one occurrence of process will be
Part of the list, Each word is called as ‘stem’
If two words have the same underlying stem then they refer to the same concept and should be indexed as
such. This is obviously an over simplification since words with same stem, such as NEUTRON AND
NEUTRALISE, sometimes need to be distinguished.
* The final output from a conflation algorithm is a set of classes, one for each stem detected. As class name is
assigned to a document if and only if one of its members occurs as a significant word in the text of the
document.WF Information Storage and Retrieval (SPPU) 1-8 Introduction to Information Retrieyy
rred as the document indy,
+ A document representative then becomes a list of class names. These are refe
terms or keywords
+ Queries are also treated in the same way. Thus each query is converted in query representative.
1.7__Indexing and Index Term Weighting
17.1 Indexing
* During “Indexing” documents are prepared for use by Information Retrieval system.
n into an easily accessible representation of
\dexing the documents.
This means preparing the raw document collectio documents. Thi
transformation from a document text into a representation of text is known 9° in
+ Transforming a document into an indexed form involves the use of
© Allibrary or set of regular expressions
© Parsers
© Alibrary of stop words (a stop list)
© Other miscellaneous filters
Conflaton algorithm is used for converting document into its representation ie. indexing. Each element i.e. wordy
index language is referred as index term.
ms are coordinated at thy
yy be pre-coordinate or post-coordinate. In pre-coordinate, the ter
tify a class of documents
© Index language ma
time of indexing, A logical combination of any index terms may be used as a label to iden
+ Inpost-coordinate the terms are coordinated at the time of searching. In post-coordinate indexing the same ds
combining the classes of documents labelled with the individual index terms
would be identified at search time by
led, The former refers to a list of approved
+The vocabulary of an index language may be controlled or uncontroll
index terms that an indexer may use.
The controls on the language may also include hierarchic relationships between the index terms. Or, one may inst
that certain terms can only be used as adjective. There is really no limit to the kind of syntactic controls one may
put on a language.
The index language which comes out of the conflation algorithm is uncontrolled, post-coordinate and derived,
The vocabulary of index terms at any stage in the evolution of document collection is just the set of all conflation
class names
1.7.2 Index Term Weighting
Q. Describe index term weighing,
EEE
‘* Weighting is the final stage in most Information Retrieval indexing applications, For each index term a weight vale
‘get assigned which indicates the significance of that index term w.rto the document.
+ The two important factors governing the effectiveness of an index language are exhaustivity of indexing an!
specificity of index language.WF infor
FMation Stor:
Storage and Retrieval (SPPU) Introduction to Information Retrieval
[Two importa
effectivenes
int factors governing the}
of an Index language
1. Indexing exhaustivity
2. Language specificity
Fig. 1.7.
+ Factors ‘Governing effectiveness of index language
Indexing exhaustivity
* Ttis defined as the numb
* of eifferent topics indexed. & high level of exhaustivty of indexing leeds to high
recall and low Precision,
* Alow level of exhaustivity leads
° ‘low recall and high precision. n short, exhaustivty isthe no. of index terms
assigned to the document
Language spec
Its the abili
of the index language to describe topics precisely. It is the level of Precision with which a
document is actually indexed.
and low recall. Specificity means the no. documents to which a given
term is assigned in a given collection,
Refer to the Luhn’s idea, He has menti
‘oned the discrimination power of index terms as a function of the rank
order of their frequency of occurrence.
The highest discrimination power being associated with the middle frequencies.
* Considering this idea, each index term is assi
igned with a weight value which indicates the significance of the
index term in the document.
the weight value.
Another way is based on the index term distribution in the entire collection,
Let, N_ = Total no. of documents
fl = The document in which an index term accurse;
%
er ererereeeae aC).
If we compare two methods. The document frequency weighting places emphasis on content description
whereas the second method i.e. weighting by specificity attempts to stress the ability of terms to discriminate
one document from another.
Salton and Yong combined both the methods of weighting considering inter document frequencies and intra
document frequencies.Wintormation storage and Retrieval (SPPU) 1-10 Inveducton@ INomson Reting
eer UU ete eet eel rece oceans clsbuion ver the dota aay
is how many times i occurs in each document, they concuded many ting
lke.
1. A term with high total frequency of o in retrieval irrespective of
distribution,
ewed.
Middle terms are most useful particularly, i the distribution i i
on are likely to be useful but less so than the midale frequency on,
3. Rare terms with a skewed distributi
for the ones with 3 high ty
4. Very rare terms are also quite useful but come bottom of the list except
frequency.
a collection of documents, renders
© A “good” index term is one which, when assigned as an index term to
index term is one which renders the documents m,
0%
documents as dissimilar as possible, where as 2 “bad
similar.
le which for a particular term measures the in
tease
«This is quantified through a term discrimination valu
documents on the removal ofthat term. Therefore, a good te,
fase in the average dissimilarity between
f documents, leads to a decrease in average.
decres
-ollection o'
is one which on removal from the ci
fh leads on removal to an increase.
Dissimilarity, whereas a bad term is one whict
vents will enhance retrieval effectiveness but then je
«The idea is that a greater separation between docum
separation will depress retrieval effectiveness
1.8 Probabilistic Indexing
sed on the probability model for Information Retrieval
«Probabilistic indexing is bas
This model, considers the difference inthe distributional behaviour of words as 2 guide to whether a word sho
be assigned as an index term.
‘The statistical behaviours of ‘speciality’ words are different from that of function’ words.
‘on words are closely modelled. By Poisson distribution over all documents where as speciality words did
© Functi
follow a Poisson distribution.
Let = function word over a set of texts
fn = number of occurrences of word «.
Then, f (n) the probability that a text will have n occurrences of a function word @ is given by,
fit) =
x will vary from word to word and for a given word should be proportional to the length of the text.
We can interpret x as the mean number of occurrences of the in the.set of texts.
“Speciality words’ are content bearing, Whereas function words are not. Word randomly distributed according
Poisson distribution is not informative about the document in which it occurs.
‘ word which does not follow a Poisson distribution is assumed to indicate that it conveys information as 0"
document is about.information Storage and Retrieval (SPPU) 1-11 Introduction to Information Retrieval
For e.g. WAR is speciality word. It can come in relative documents. Whereas ‘FOR’ is a function word, which can be
randomly distributed,
This model also assumes that a document can be about a word of some degree. A document collection can be
broken up into subsets. Each subset being made up of documents that are about a given word to the same
degree.
Content-bearing word is a word that distinguishes more than one class of documents wirto the extent to which
the topic referred to by the word is treated in the documents in each class.
These are candidates for index terms. These content bearing words can be mechanically detected by measuring
the extent to which their distributions deviate from that expected under a Poisson process.
In this model the status of one of these content words within a subset of documents of the same ‘aboutness' is
‘one of non-content-bearing, this is, within the given subset it does not discriminate between further subsets.
‘The assumptions based on which a word can be considered as index term for the document are :
o The probability that a document will be found relevant to request for information on a subject is a function of
the relative extent to which the topic is treated in the document.
The no. of tokens in a document is a function of the extent to which the subject referred to by the word is
treated in the document.
The indexing rule based on these assumptions indexes a document with word @ if probability exceeds some cost
function,
If there are only two subsets differing in the extent to which they are about a word a then the distribution of w can
be described by a mixture of two Poisson distributions.
Thus, f(n) =
Here, P,is the probability of a random document belonging to one of the subsets and x,and x, are the mean
‘occurrences in the two classes.
Jon model. It describes the statistical behaviour of a content-bearing word over two
This model is called as 2-P
classes which are ‘about’ that word to different extents, these classes are not necessarily the relevant and non-
relevant documents although by assumption (1) we can calculate the probability of reference for any document
from one of these classes.
Pret,
Itis the ratio 3a
Re tx +(l-Pye 7%
That is used to make the decision whether to assign an index term w that occurs k times in a document.
This ratio is in fact the probability that the particular document belongs to the class which treats w to an average
extent of x, given that it contains exactly k occurrences of w,F information Storage and Retrieval (SPPU) 1-12
L
_
i
Introduction to Information Reieg
.9 _ Automatic Classification
re,
Classification is the process to categories the given document in different groups. He
We make the subset
given objects
etrieval as shown i
There are two main areas of application of classification methods in Information Retrieval as shown in Fig, 1.9.
(1) Keyword clustering
(2) Document clustering}
Fig. 1.9.1 : Areas of application ofclassification method in IR
1. Keyword clustering
Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improy
the chance of retrieving relevant documents. In practice many of thesauri are constructed manually,
They have mainly been constructed in two ways
L
2
Words which are deemed to be about the same topic are linked.
Words which are deemed to be about related things are linked.
2. Document dustering
Document clustering is to group the documents in the collection. The purpose will be to group the documers
in such a way that retrieval will be faster or alternatively it may be to construct a thesaurus automaticaly
Whatever the purpose the ‘goodness’ of the classification can finally only be measured by its performare
during retrieval
Considering the collection, the given documents will be divided in different groups (or subsets) Each grog
will be considered as a single cluster.
‘A document can become part of a particular cluster if
is logically related to other members of the lise
Thus 2 single cluster will contain all those documents which are semantically related to each other.
Purpose of clustering is to increase the speed of searching, In practice it is not possible to match e=)
analysed document with each analysed search request because the time consumed by such operation wou!
be excessive.
Using clustering process will have following steps :
For the given collection, using some algorithms, documents will be converted into different groups. 82
group or cluster will contain the semantically related documents.
For each cluster, one duster representative will be decided. The cluster representative may be suc!
document which is semantically near to all other documents of that cluster.
When user will fire a query, it will be firstly matched with the cluster representative. If the que‘ '®
relation with cluster representative, then it indicates that the documents which are member of
particular cluster may be part of the answer set with respect to the query.
It there is no match, the cluster will not handled further.
Once there is match with query and cluster representative, then each document from the cist #
checked for match with query. The documents which are logically near to query become part of
setW Information Storage and Retrieval (SPPU)
+ Thus, using clustering, the searching is done at two different levels:
1. Level 1: Comparing query and cluster representative. ore
2. Level 2: Comparing query and actual document. jofesct Ulead : ae
+ In clustering, documents are grouped because they are in some) sens ee logical relationshiP *
basically, they are grouped because they are likely to be wanted together, a” of
means of measuring this likelihood. ulation of ® cae
Fe eer ee ea eatane oun aa ene so
ally to be intrac
closeness between documents The fist approach has proved theoretic
1,10 Measures of Association
Pen
,_List with definit iat
(0, _List with definition different measures of association. considered:
¢ document is
s
to the second document
Todistribute the objects in different groups, the relationship between each pair 0!
a
«The relationship will indicate that whether a particular document 5 semantically
compared with other documents are not. pganod
ds as shown in Fig. 120°
. between the documents can be mentioned using three different metho
Three different methods
‘of documents
4. Similarly
2. Association.
Fig, 1.10.1 : Methods of documents
1. Similarity
Similarity value indicates how much two documents or objects are near tO each other.
2. Association
objects
“Assocation is same as similarity, but the difference is objects which are considered for comparison are the
characterized by discrete-state attributes.
3. Dissimilarity
the similarity value indicates the likeliness of
Dissimilarity values show that how much far the objects are. Thus,
lar to each other, the
‘ects, f someone wants to find out the group of documents which are simil
two obj
similarity value can be considered.
«For the information retrieval system, we are interested in finding the subsets of the given documents
Documents in the collection are described using the list of index terms.
«Here for information retrieval systems, each pair of document is considered. Two documents will be similar to
each other if they have more number of common index terms.
SF etiaoaeaysF information Storage and Retrieval (SPPU) 1-14 Introduction to Information Retrieya,
Tftwo documents are having less number of common index terms then obviously they will be semantically fay
from each other. Hence we will not include such documents in the same group.
To define the relation between the objects various measures of association are defined. The value of the
‘measure will be more if two objects have more number of similar attributes.
* It follows that a cluster method depending only on the rank-ordering of the association values would gig
identical clustering for all these measures.
* There are five measures of association methods. As we are using these measures for information retrieval
system, we should consider the representation of the document inside the system.
© Here the assumption is each document is represented by a list of keywords. A query is also represented by g
list of keywords,
‘Each list of keyword is considered as one set.
‘© Thus the terms which are assumed here are
1X: The list of index terms related to a document e.g. Document 2.
2. Y:The list of index terms related to a document e.g. Document 2.
3. |X]: The number of index terms present in the Document 1.
4. |Y |: The number of index terms present in the Document 2
5. 9: Intersection of two sets.
For example:
X: The list of index terms related to Document 1
1 Bet 2. Ball
3. Stump 4 Pen
5. Pencil 6. Night
7. Dog 8 Cat
9. Coat 10. Fur
Y : The list of index terms related to Document 2.
1 Pencil 2. Paper
3. Rubber 4. Cat
5. Mouse 6 Book
7 fe 8 Nose
9. Heart 10. Dark
Here J am considering only the nouns. In real scenario the actual index terms ‘may have original representation
of nouns, verbs, etc
Thus
Ix]
10
IY] = 10
* Now we will define different measure of association methods,storage and Retrieval (SPPU) 1-15 Introduction to Information Retrieval
joformation =
ferent Ma’
jas itfer er
B diferent matching coefficient.
Describe
: ite a short note on matching coetficents
Saan
1. Simple Matching Coefficient
2. Dissimilarity Coefficients
Fig, 1.11.1 : Different matching coefficients
19 Coefficient
4.111 Simple Mate
tis the number of shared index terms.
Thus, we can calculate simple matching coefficient as
[xo]
This method does not consider the size of X and Y
Inour example, the common terms are :
1. Pencil
2. Cat
Thus, value of simple matching coefficient is 2.
1. Dice's coefficient
Xay
IXI+1Y1
Itis the shared number index terms divided by addition of sizes of both sets X and ¥.
2. Jaccard's coeffi
Xoy
[xoy]
Tis calculated as shared index terms divided by union of set X and set Y.
3. Cosine coefficient
Xoy
px
Cosine coefficient can be calculated as number of common index terms divided by square root of size of X set plus
‘square root of size of ¥ set.
4. Overlap coefficient
XaY
min (|X| 1¥)= P
“5 Introduction to Information Retry
Information Storage and Retrieval (SPPU)
of common index terms divided by the size of set either x,
‘© Overlap coefficient can be calculated as number
Y having comparatively minimum entries.
«The Dice's coefficient, Jaccard's coefficient, cosine coefficient and overlap coefficients are normalized version,
of simple matching coefficient. The values of these coefficients range from 0 to 1.
«This necessary that the values of coefficients must be normalized. Following example presents importance
normalized values.
= 1 510. ¥) = KAYI= Simple matching coefficient which is not normalized,
2 S20) = ic normalized coefficient
Cased
Let,
a, Ixp=2
2 ly] = Land
3 Ixay =2
Then = S1KY) = 2
Case 2
Let
1 Ix] = 10
2 10
S 1
Then, S1KY) = 1
2[xa¥
|
*10+10
In first case, both the coefficients have same value i.e. 1, which indicates that there is exact match. But in secott
case, even though there is a single common term present in both the set X and Y. Coefficient S1 has value 1 ot)
which doesn't reflect any difference between case 1 and case 2 whereas value of S2 coefficient is 1/10, which gi®
comparatively real scenario,
Ex 4.114: Let,
Document 1 = (CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer)
Document2 = (CPU, VGA, Simulator, OS, Video, USB, Printer, Scanner, Compile)
Find the similarity between two documents using different matching cootticionts. EEtrieval
vation Ret
W information Storage and Retrieval (SPPU) Ae introduction to NFO
Soin
Let
X = Document
= {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer) and
Y = Document 2
= (CPU, VGA, Simulator, 05, Video, USB, Printer, Scanner, Compile!)
Set (XY) = {CPU, VGA, USB, Printer) and
(union Y) = {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer, Simulator,
OS, Video, Scanner, Compiler)
Hence,
|X| = Band|Y]
IXaY| = 4
[Xunion¥] = 13
Following are the similarity coefficients:
i) Simple matching coefficient = |XnY|=4
‘
«wo Dice's coefficient = EE = 4/ (8+9) = 4/ 17 = 0.23529412
XaY
Gi Jaccard’s coefficient = eal = 4/13 = 0.30769
jw Cosine coefficient = OY. 4 (2823) = 0472
IXP™1YT
pee Xavi a=
Ww) Overlap coefficient = Tin (UXLIYD > 4/8=05
1.11.2 Dissimilarity Coefficients
om
@. Explain the properties of dissimilarity coefficients used in information retrieval
ar
EOE
+ Coefficients which are explained in previous section are based on similarity values.
© There are some coefficients defined, which are based on dissimilarity values between the documents.
Properties of dissimilarity coefficients
* Any dissimilarity function can be transformed into similarity function by a simple transformation of the form :
Ss = Q+q?
* But the reverse is not always true.
* IfPis the set of objects to be clustered. A pair wise dissimilarity coefficient D is a function from P xP to the non-
negative real numbers.
Famea (SPPU) 1-18 Introduction to Infor
Information Storage and Ret rete |
1D (ie dissimilarity coefficient) satisfies following conditions _
Conditions for O
(Le. dissimilarity coefficient)
1.D(% Y= O forall X Ve P
2. (XX) =Olorall Xe P
3.0% Y)=D(¥,X)10rX Ye P
4.0 (X,Y) 50% Z)+0(¥.2)
Fig. 1.11.2: Condition for D
1. D(X Y)20 forall X Ye P
It says that the dissimilarity coefficient should be non-negative.
2 D(X) =O forallXe P
we find the dissimilarity value by comparing same document by itself, the dissimilarity coefficient shoug
value 0 because there is exact match
3. D(X Y)=D(¥,X) forX Ye P
Dissimilarity value should not depend on the sequence in which we handle the documents, Thus the dssmiag
coefficient must be same between 2 documents, without considering the order of handling, "
4 DKYSDKD+D.Z
This is based on theorem from Euclidian geometry which states that the sum of length of tWo sides of tangy,
always greater than the length of the third side.
Dissimilarity coefficients example
+ amples of dissimilarity coefficient which satisfies above conditions are :
4 KAY)
IX[+1¥1
Where, KAY) = KuN-KnY)
It is the symmetric difference of set X and Y.
This coefficient is simply related to Dice's coefficient by,
-21koy xAY|
IX}+1¥] IX]+1¥]
The same coefficient can be represented in another form.
If each document is represented as the binary string where each entry represents an absence or presence ol’
keyword, which is indicated by zero or one in i" position. In this case, the above dissimilarity coefficient can
represented like :
Xx (1-y) + By, (1-%)
Hy +d
Where, the summation is over the total number of different keywords in the document collection.information Storage and Retrieval (SPPU) 1-19 Introduction to Information Retrieval
1» considered document representatives as the binary rectors embedded in an n-
2 8
Where, n = Total number of index terms.
XoY
Thus, xxiv?
‘as the cosine of the angular separation of the two binary rectors X and Y,
pane
hoo Rea XVI x
Can then be interpreted
Where, (X,Y) =. inner product
UI-I| = length of rector 4 7
yf the space is Euclidian then for X= (yon) and
Y= yor Yo) Fig. 1.11.3
We get.
3, Expected mutual information measure
Measure of association can be defined based on probabilistic model. It can be measures by the association
en two objects by the extent to which their distribution deviate from stochastic independence. For two
betwe
distinct probability distribution P (x) and P (x), the expected mutual information measure can be defined as
follows :
z bw
Vay 9) = 5 PO 9) 09 BGR) POD
Properties of the function
i. When xand x,are independent,
Px) Pix)
So, 1 (xx)
P Kx)
1 (xx) < 1 (5x) which shows it is symmetric.
ji Itis invariant under one-to-one transformation of coordinates.
iv. 1%, x) often interpreted as a measure of the statistical information contained in xabout x;
When we apply this function to measure the association between two index terms, say i and j, then x and
are binary variables, Thus P (x, = 1) will be the probability of occurrence of the term i and similarly P (x= 0) will
be the probability of its non-occurrence. The extent to which two index terms i and j are associated is then
‘measured by I(x, x). It measures the extent to which their distributions deviate from stochastic independenceIntroduction to Information Retin
Information Storage and Retrieval (SPPU)
4, Information Radius : (dissimilarity between two classes of objects )
ion, For example, we can discriminate yy
Dissimilarity between two classes of objects can be defined with funct
classes on the basis of their probability distribution over 9 simple two point space (1, 0)
Let,
P, (2), P, (0) : Probability distribution associated with class 1.
P,(2),P;(0) : Probability distribution associated with class Tl
On the basis of difference between them, We measure the dissimilarity between class I and Il by information r
. PG) °, (09 Se
Information radius = UP; @) 109 3p, a) + vP, @) +vP, 9 oP a) + vPy
(1) + ¥P2
P,(0) _ 00am
i
+ uP, (©) 1097, 0) + vO) * (0) !09 7p, (0) + vP, (0)
Here, u and v are positive weights adding to units.
Properties
Jed mutual information measure isa special cas® of the information radius
Under some interpretation, the expect
is P (/w,) and P (/W,)-
Fores
() are two conditional distribution:
© tet P, and,
u = PCA)
v PCAw)
«Then for information radius we get,
P(x) = POa/ wy) Pw) + P (x! we) Pla): = O21
P(x/ w) P(x), = 1,2
P (x/ w)
«We came to the expected mutual information measure I (,W).
1.12 Cluster Hypothesis
related documents tend to be relevant to the same requests.
red from those whichat
2 Closely
‘ems is that documents relevant to a request are separa
A basic assumption in retrieval syst
non-relevant.
The relevant documents are more like one another than like non-relevant documents.
sted as follows : Compute the association between all pairs of documents : #
This can be te:
(@) Both of which are relevant to a request.
(b) One of which is relevant and the other is non-relevant.
vant (R~ R) and relevant ~ non-relevant |
Based on a set of requests, the relative distribution of relevant-relev:
R) association of a collection can be defined.
Plotting the relative frequency against strength of association for two hypothetical collections X and Y,
get distribution as shown in Fig. 1.12.1Introduction to Information Retrieval
formation Storage and Retrieval SPPU) 1-21
It
3 Fig. 112.1, RR is the distribution of relevant assoc R is the distribution of relevant non-relevant.
in Fig
association:
from graph we can conclude that:
{a)_ Separation for collection X fs good while for Vis poor
{@) Strenath of
linear search ignores the relationship exists between documents. Hence structuring a collection in such a way
. + will be part of one class will speed up the process of retrieval of the documents.
¥ association between relevant documents is greater for X than for Y.
that relevant document:
calaton Cotection
rs ©
a a
22
o
a” aaa
re : ©
20 1 20
1121
The searching will be more effective, since classes will contain only relevant documents and no non-relevant
documents.
Cluster hypothesis is based on the document descriptions. Hence the objects should be described in such a way
that we can increase the distance between the two distributions R-R and R-N-R.
© We want to make more likely clear that we will retrieve relevant documents and less likely that we will retrieve non-
relevant.
Thus, cluster hypothesis is a convenient way of expressing the aim of such operations as document clustering. It
does not say anything about how the separation is to be exploited.
1.12.1 Clustering in Information Retrieval
Custer analysis is a statistical technique used to generate a category structure which fits a set of observations. The
‘groups which are formed should have a high degree of association between members of the same group and a
low degree between members of different groups.
* Cluster analysis can be performed on documents in several ways :
() Documents may be clustered on the basis of the terms that they contain. The aim of this approach is to
provide more efficient and more effective retrieval.
(i) Documents may be clustered based on co-occurring citations in order to provide insights into the nature of
the literature of afield
(ii). Terms may be clustered on the basis of documents in which they co-occur. It is useful in construction of a
thesaurus or in the enhancement of queries.
SE ratwanaia
W Information Storage and Retrieval (SPPU) 1 Introduction to Information Retrieyy
Although cluster analysis can be easily implemented with available software packages, it may have some Probie .
like as
© Selecting the attributes on which items are to be clustered and their representation.
> Selecting an appropriate clustering method and similarity measure from those available.
© Creating cluster or cluster hierarchies, which can be expensive in terms of computational resources,
© Assessing the validity of the result obtained.
© Ifthe collection to be clustered is dynamic, the requirements for update must be considered.
© Ifthe aim is to use the clustered collection as the basis of information retrieval, @ method for searching the
clusters or cluster hierarchy must be selected.
Criteria for choosing clustering method
While choosing the clustering method two criteria have been used.
Criteria for choosing the
clustering method
1. Theoretical soundness
2. Efficiency
Fig. 1.12.2: Criteria for choosing clustering method
1) Theoretical soundness
The clustering method should satisfy some constraints like :
(2) The method produces a clustering which is unlikely to be altered drastically when further objects ae
incorporated ie. it is stable under growth.
(b) The method is stable in the sense that small errors in the description of the objects lead to small changes in
clustering.
(0) The method is independent of the initial ordering of the objects.
2) Efficiency
The method should be efficient in terms of speed requirement and storage requirement.
1.13 Clustering Algorithm
Clustering methods are usually categorized according to the type of cluster they produce. Thus, the clustering
methods can be categorized as
i. Hierarchical methods
ii, Non-hierarchical methods
ierarchical methods product
Hierarchical methods produce the output as ordered list of clusters. Whereas non-
unordered lists.W Information Storage and Retrieval (SPPU)
| 1-23 Introduction to Information Retrieval
Other categorization of the methods are
The methods producing exclusive clusters
ii, The methods producing overlapping clusters
«Here are some definitions related to clustering methods. While discussing about different algorithms we will use
these terms,
1.13.1 Definitions
1, Cluster: A cluster is an ordered list of objects, which have some common characteristics.
2. Distance between two clusters : The distance between two clusters involves some or all elements of the two
clusters. The clustering method determines how the distance should be computed.
3, Similarity : A similarity measure SIMILAR (d,, d) can be used to represent the similarity between the documents.
Typically similarity is normalized value which ranges from 0 to 1-Similarity generates values of 0 for documents
exhibiting no agreement among the assigned index terms and 1 when perfect agreement is detected. Intermediate
values are obtained for cases of partial agreement.
4, Threshold : The lowest possible input value of similarity required to join two objects in one cluster.
5. Similarity matrix : Similarity between objects calculated by the function SIMILAR (d,d) represented in the form of
a matrix is called a similarity matrix.
6. Dissimilarity coefficient : The dissimilarity of two clusters is defined to be the distance between them. The smaller
the value of dissimilarity coefficient, the more similar two clusters are.
7. Cluster representative (seed) : The representative of the cluster. Every incoming object's similarity is compared
with cluster representative. A clustering method can have predetermined parameters like :
1. The number of clusters desired
2. Aminimum and maximum size for each cluster.
3. A threshold value on the matching function, below which an object will not be included in cluster.
4, The control of overlap between clusters.
5. An atbitrarity chosen objective function which is optimized
Now, let us discuss different clustering algorithms.
1.14 Rocchio’s Algorithm
* Rochhio developed a clustering algorithm in 1966. It was developed on SMART project.
* Several parameters which are defined as the input for this algorithm are as follows :
* Minimum and maximum documents per cluster.
* Lower bound on the correlation between an item and a cluster below which an item will not be placed in
luster. This is a threshold that would be used in the final clean up phase of unclustered items.
Similarity coefficient : On new line the algorithm operates in three stages.
Waimeaz >
Introduction to Information Ret
tHe
1-24
W information storage and Retrieval (SPPU)
tres, The remaining objects are
Stage 1: Algorithm selects (by some citerion) a number of objects as cluster ce
assigned to the centres or rag-bag cluster. Rag-bag is a temporary threshold
09 > 089
Object 4 becomes part of cluster 1.
2G = 04)
G = 2)
G = 8)
‘As the new element is added is cluster, cluster representative is again calculated, In this example there is no
‘change, in cluster representative,
WF hatmecieags|
Introduction to Information Re
ete
Storage and Retrieval (SPPU) 1-26
Object §
Check similarity (1, 5) » threshold
09> 089
tive is calculated again. Based on the similarity values of
Js new element is added in cluster 1, cluster representa
ant hence no change in cluster representative
4,5 objects. All objects are equidist
6 Object 6
a. Check similarity (2, 6) < threshold
os < 089
b. Check similarity (2, 6) < threshold
OS < 0.89
Similarity (3,6) > threshold
09 > 089
Hence, abject 6 becomes part of cluster 3.
G = 45)
G = 2
G = 84
sere again for C, object 3 nd 6 are equidistant hence no change in cluster representative,
the output of single pass algorithm is as follows
Thus
Gq = (145)
G = @
. G = 84
ot ce
08 a
o
Fig.1.15.1
When the dustering method is exclusive, once an object become part of a single cluster, handling of that objectsa
1-27 Introduction to Informati
\ pation Stree and Retrieval (SPPU) n Retrieval
0
2
iystering method is overlapping each object is compared with cluster representatives of each cluster.
en cl
1, ovjeett
einee sno cluster present, create new cluster whose cluster representative is object 1 itself
= a
4, object |
4. compare similarity (2,2) < threshold
15, «Create new cluster.
G = {th
G = {2
3, object 3
a, Compare similarity (1, 3) < threshold
06 < 089 f
b. Compare similarity (2, 3) < threshold |
08 < 089 |
Create new cluster.
, a@, © Gy
G =
G = 3
4, Object 4
a. Compare similarity (1, 4) > threshold
09 > 089
Hence object 4 can become part of cluster 1. As the method is overlapping, go on checking the object 4's
similarity value with all cluster representative.
rity (2, 4) > threshold
09
b. Compare sit
Object 4 can become part of cluster C,.
© Compare similarity (3, 4) < threshold
07
a&
G
©
> 089
< 0.89
(1,4)
= (2,4)
= 3)Object 5
& compare similarity (1, §) » threshold
09
b. compare similarity (2, §) < threshold
06
© Compare similarity (3, 5) « threshold
06
c
6. Object 6
Compare similarity (1. 6)
os
b. Compare similarity (2, 6) < threshold
os
< Compare similarity (3, 6) < threshold
09
Advantage of Single Pass Algorithm :
Simple for implementation.
Disadvantage of Single Pass Algorithm:
The output depends on the sequence in which objects are handled.
a
Introduction to Inform:
28 actormation
089
089
089
45)
(24)
3)
089
089
089
(1, 4, 5)
(2, 4)
(3, 6)- a
Introduction to Information Retrieval
\ formation Storage and Retrieval (SPPU) 1-29
jnformatio
¥ “a1 cwsters the dacuments using single pass clustering algorithm for the folowing example. Threshold value s 10.
mast:
' ‘Terms In document
1 2 13 ™ 15
o ° 1
Ee Doc 4 1 2
i Doc 2 3 1 2 3 o
: Doc 3 3 ° o © 4
Ep Doc 4 2 1 0 3 0
E Doo § 2 2 1 5 4
; Saas
soln. :
step 1: Start with Doc.1. As initially no cluster is present, Document lintroduces Cluster 1 ie. C1. Hence,
1 = (Doc)
Centroid of this cluster is<1, 2, 0, 0, 1>
(i= {Doc 1}; Centroid of C1: <1, 2, 0,0,1>
jow we need to make decision for Doc.2 Either it can become part offirst cluster or it can introduce a new
cluster.
For making the decision, we need to find similarity between centroid of first cluster and Doc. 2. Hence, we will
use dot product for simplicity.
Centroid of CL :<1, 2, 0,0,1>
Doc. 2:< 3,1, 2,3,0>
SIM (Doc 2, C1) = 1°3 +241 +012 +093 +110
Now compare threshold value and
‘SIM(Doc 2, C1)
Hence Doc 2 can't become part of first cluster. Hence, new cluster will be introduced.
10>5
(1={Doc 1}; Centroid of C1: <1,2,0,0,1>
(2={Doc 2}; Centroid of C2: <3,1,2,3,0>
Step 3 : Make decision for Doc 3
Hence,
Doc. 3:< 3,0,0,0,1>
Centroid of C1:< 1, 2,0,0,1>
SIM(Doc3, C1) = 3414012 +040 +00 +191 =6
10 > 6,hence Doc.3 can't be part of CL
Now, check whether Doc3 can become part of C2Hence,
Doc. 3 :< 3, 0,0,0,1>
Centroid of C2:< 3, 1, 2, 3, 0>
WY haimecinays>
Introduction to Informatic
Retin,
Information Storage and Retrieval (SPPU)
SIM(Doc3 , C2) = 3}
of cluster 2 also. Hene, introduce a new Cluster,
Threhold 10> 9, hence, Doc 3 cant become part
C3. = (Doc.3)
C1={Doc 1); Centroid of C1 :<1, 2, 0,0,1>
€2= {Doc 2) ; Centroid of C2: <3, 1, 2,3,0>
€3={ Doc. 3); Centroid of C3 :< 3, 0,0,0, 1>
Step 4:
Now. Make decision for Doc. 4:< 2,1, 0, 30>
Epiocesyc1) = 124224008034 D0NF
Threshold 10 >4 ; Doc: cannot become part of Ci
Si pocaca = S26 141+ 20 +33 + 0 10
become part of C2.
C2 = (Doc2, Doc. 4)
‘entroid of cluster is recalculated which is average of Doc2 and Doc ¢
Threshold 10 < 16, Hence Doc will
‘As new document is included in cluster, Ci
Doc. 2:< 3,1.2,30>
Doc. 4:< 2,1,0,3,0>
Centroid of C2 = <5/22/2, 2/2, 6/2, 0/2> = <25,1, 1,3, 0
Thus Clusters available are
C1= {Doc 4); Centroid of C1: <1, 2,0,0,1>
C2= {Doc 2, Doc. 4} ; Centroid of C2: <2.5, 1, 1,3, 0>
€3= { Doc. 3); Centroid of C3 :< 3, 0, 0, 0, 1>
Step 5 : Finally we need to find where Doc. 5 will fit,
Doc $2,215.17
su(Doc5, C1) = 112+ 212+ 01+ 0°5 + 1*1=7
Threshold 10 > 7, Hence Doc cannot be part of C1
SiM(Doc $,€2) = 25'2 41°24 141 + 3'5 +071
542+1+15+0 =23
Threshold 10 < 23, Hence Doc5 becomes part of C2.
Thus, (2 = {Doc 2, Doc 4, Doc. 5)
Cemroid of C2 will be recalculated which is average of Doc.2, Doc.4 and Doc. 5, Hence
Doc 2:« 3,1,2,3,07
Doc 4« 2,103.07