Identifying Term Importance
(Term Weighting)
Chapter Three
( assigning importance level to index terms)
1
Objectives
Understand why query terms and document terms
are weighted in modern information retrieval
Describe the various methods for weighting terms
Discuss the problems with the existing term
weighting methods in information retrieval
Write programs to implement the various term
weighting schemes
2
Motivation
The indexing process so far has generated a set of
natural language index terms as a representation
of the text
Although these terms belong to the general class of
the content words, they are not equally important
regarding the content of the text
All terms may not have equal capacity in telling
about the semantics of the document/item
3
Cont…
Thus assignment of importance indicator or term
weighting is important.
It enhances the performance of an information
retrieval system by improving precision.
But what do you think is the impact of it on recall?
4
Parameters playing important
role in weight computation are
The index term it self
The text to be indexed: length of text and number of
different terms in the text
The relationship between the index term and the
text to be indexed: frequency of term in a text
The relationship between index terms and document
collection (its frequency in the whole collection
5
Cont…
Remark:
Most weighting functions relay on the distribution pattern of
the terms in the text to be indexed and/or in a reference
collection and use statistics to compute the weights.
The weight of an index term is usually a numerical value.
Term weights have a value of zero or above and in the case of
normalized weights it vary b/n 0 and 1,
with values closer to one indicating very important index
terms and
values closer to zero very weak terms.
Term weighting is a crucial part of an automatic information
retrieval system.
6
Cont…
In conventional retrieval systems,
a term is either used to identify a given item (in
which case it is assumed to carry weight 1) or it is
not (in which case it is assumed to carry a weight
of 0).
Such a system is called a binary system and has
proved to be of limited importance.
For example no distinction is made within the set of
retrieved documents.
That is, all retrieved documents are considered to be
equally important for the query
7
Cont…
Term weighting based systems (e.g., systems that
use statistical weighting schemes) are designed to
overcome these shortcomings
And this is done by collecting numerical values to
each of the index terms in a query or a document
reflecting their relative importance.
A term with a high weight is assumed to be very
relevant to the document or query
A term with low weight on the other hand indicates
little relevance to the content of the document or
query
8
Cont…
These importance indicator values (weights) can then be used to
define a function, which measures the similarity or closeness
between query and documents
This in turn makes it possible to retrieve documents in the
decreasing order of query-document similarity, the most similar
(presumably relevant) being retrieved first
Two difficulties in this respect
The difficulty in deciding how to allocate the weights
The difficulty in actually calculating the values which are to
be used (computation intensive)
9
Some of the term weighting schemes (or
functions or methods) suggested are:
The term frequency (tf) weights
An inverse document frequency (IDF) or collection
frequency) weights
The composite weight (tf*idf)
The signal-noise ratio
10
Early proposal to term weighting
It will be important to consider an early proposal to term weighting
(assignment of importance indicators). And it follows basically the
following procedures:
Calculate the frequency of each unique term for each
document for a given collection of n documents, The frequency
of term k in document i, or FREQik
Determine the total collection frequency TOTFREQk for each term by
summing the frequencies of each unique term across all n documents,
TOTFREQk = ∑ FREQik
Arrange the words in decreasing order according to their collection
frequency. Decide on some threshold value and remove all words
with a collection frequency above this threshold.
Do the same for low-frequency words.
The remaining medium-frequency words are now used for
assignment to the documents as index terms 11
Term frequency (tf) weights
Motivation:
The frequency of occurrence of a term is a useful
indication of its relative importance in describing
(defining) a document
In other words, term importance is related to its
frequency of occurrence.
If term A is mentioned more than term B, then the
document is more about A than about B
(assuming A and B to be content bearing terms)
12
Cont…
Supporting idea:
“Authors tend to repeat certain words as he or she
advances or varies the argument on an aspect of the
subject”
One such measure assumes that the value, importance, or
weight, of a term assigned to a document is simply
proportional to the term frequency (i.e., the frequency of
occurrence of that particular term in that particular document)
Thus the assumption here is that,
The more frequently a term occurs in a document the
more likely it is to be of value in describing the
content of the document.
Do you agree with this?
13
Cont…
Accordingly, the weight of term k in document i,
denoted by wik, might be determined by
wik FREQik
Where, FREQik is the frequency of term k in document i
14
Cont…
Remarks:
It is a simple count of the number of occurrences of a
term in a particular document (or query).
Is a measure of term density in a document.
Experiments have shown that this is better than Boolean.
Having all weaknesses this method shows better results
than that of Boolean (Binary systems)
The basic idea is to differentiate terms in a document
15
Problems with Term frequency (tf)
weights
Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency words
are equally distributed throughout the collection
Since it does not take into account the role of term k in any
document other than document i it doesn‟t consider the
importance of term k in a collection.
This simple measure is not normalized to account for variances
in the length of documents
A one-page document with 10 mentions of A is “more about
A” than a 100-page document with 20 mentions of A
Used alone, favors common words and long documents
HOW??????
16
Solutions to the Problems with
Term frequency (tf) weights
Divide each frequency count by the length of the
document, in terms of words (length Normalization)
Divide each frequency count by the maximum frequency
count of any term in the document (frequency
normalization).
In this case the normalized frequency fij is used
instead of FREQik
17
Cont…
The normalized tf is given by
FREQij
f ij
max m ( FREQi m )
Where
fij is the normalized frequency of term j in document i
maxm is the maximum frequency of any term in document dl
18
Inverse Document Frequency (IDF)
weights
Also called collection frequency,
introduced by Spark Jones, another personality in
Information Retrieval.
According to this measure, the importance of a term in a
document is measured (weighted) by the number of
documents in a collection that contain the term.
The basic idea here is to differentiate terms in queries.
Accordingly the assumption is:
If a term occurs in many of the documents in the
collection then it doesn’t serve well as a document
identifier and should be given a low importance (weight)
as a potential index term. 19
Cont…
Assuming that term k occurs in at least one
document (dk ≠ 0) a possible measure of the inverse
document frequency is defined by
( dN )
wk log2 k
Where,
N is the total number of documents in the collection
dk the number of documents in which k occurs
Wk is the weight assigned to term k
20
Cont…
That is, the weight of a term in a document is the logarithm of the
number of documents in the collection divided by the number of
documents in the collection that contain the term (with 2 as the
base of the logarithm)
The log is used to make the values of tf and idf comparable.
It can be interpreted as the amount of information associated with
term ki
IDF measures the rarity of a term across the whole document.
According to this measure, i.e., IDF, The more a term t occurs
throughout all documents, the more poorly that term t
discriminates between documents.
As the collection frequency of a term decreases its weight increases.
Emphasis is on terms exhibiting the lowest document frequency.
21
Cont…
Term importance is inversely proportional to the total
number of documents to which each term is assigned,
associated towards terms appearing in less number of
documents or items
Tests with IDF scheme show that it consistently
produces substantial performance improvements
compared to unweighted (binary) systems
22
Problems with IDF weights
It identifies a term that appears in many documents as
not very useful for distinguishing relevant documents
from non-relevant ones
But this function does not take into account the
frequency of a term in a given document (i.e.,
FREQik ).
That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author who uses
important terms now and then
23
Solution to the problem of IDF
weights
Term importance or weights, should combine two
measurements
The direct proportion to the frequency of the term in
a document (which quantifies how well the term
describes the document or the content)
The inverse proportion to the number of documents in
the collection in which the term appears
This takes us to the next and best term importance
measure, the composite measure (tf*idf)
24
The composite measure (tf*idf)
Is a measure that combines term frequency and inverse document
frequency
Rational for this approach:
A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same time
that the importance of the term in the remainder of the collection
is relatively small so that the term can actually distinguish the
documents to which it is assigned from the remainder of the
collection.
Thus, such a term can be considered as being of potentially greater
importance for retrieval purposes
25
Cont…
This scheme assigns a weight to each term (vocabulary
word) in a given document by
combining term frequency and inverse document
frequency. And it is defined by
tf * idf WEIGHTik wik FREQik *{log2N logd2k }
FREQik
tf * idf WEIGHTik wik *{log2N logd2 k }
max k {FREQkj }
26
According to this function
Weight of term k in a given document i would
increase as the frequency of the term in the
document (FREQik ) increases but decreases as the
document frequency dk increases
Example on the composite measure (tf*idf):
Consider the following scenario in identifying the
importance of the terms in the sample collection
27
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And
also term frequencies (TF) of these terms in a
document are: A(3), B(2), C(1).
Compute TF*IDF for each term?
28
Solution
A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;
tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300)= 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250)= 5.322;
tf*idf = 1.774
29
Exercise - 1
A document contains, and only contains, the phrase
“being Ethiopian and not being Ethiopian”. Suppose
every word is indexed.
The document collection contains 1000 documents
and every word has equal document frequency of 100.
What is the weight of each term according to the
tf.idf weighting formula using a normalized
(frequency) term weight?
30
Exercise-2
A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season.
A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than
any other term in the document.
Calculate the weight of both terms in this document
using three different term weight methods
31
Solution
W (holiday)
Tf= 7/7= 1
Idf= Log 1000000/200000= log5 = 2.32
Tf*idf= 2.32
W (season)
Tf=5/7= 0.71
Idf=log 1000000/250000= log 4= 2
Tf*idf= 0.71*2= 1.42
32
More Example (length normalization)
Consider a document containing 100 words wherein the
word computer appears 3 times. Now, assume we have
10, 000, 000 documents and computer appears in 1, 000
of these.
The term frequency (TF) for computer :
3/100 = 0.03
The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
The TF*IDF score is the product of these frequencies: 0.03 *
13.228 = 0.39684
33
Exercise (use length normalization for tf)
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
• TD = total number of forest 2 46 3 1
documents in a corpus,
justice 7 46 3 3
and
• DF = total number of love 2 46 3 1
documents containing might 2 46 3 1
a given word; perl 5 46 3 2
• compute TF, IDF and rose 6 46 3 3
TF*IDF score for each shoe 4 46 3 1
term
thesis 2 46 3 2 34
The Signal-Noise Ratio Approach
Is an approach to term weighting based on Shannon‟s
Information Theory, which states that,the information content
of a message, or term, is measured as an inverse function of
the probability of occurrence of the terms in a given text.
As a result, the higher the probability of a word in a document
(e.g. the) the lower the message (or the information it
contains)
Accordingly the information content of a word or term is
measured using the formula
INFORMATIO N log2p ,
Where
P is the probability of occurrence of the word
35
Examples
Finding information content of the following terms will result
in the following situation
„T 1’ if it occurs once in every 10,000 words breed
(collection). Using the above formula the result will be
13.28
“T 2” if it occurs once in every 10 terms. The result will be
3.322
(That is probability of occurrence increases from 0.0001 to
0.1)
36
Cont…
The above values can be regarded as a measure of
reduced uncertainty
Accordingly from the above example again, knowing the
term T2 reduce no (less) uncertainty while T1 reduces
the uncertainty by 13.28 values
In other words Information content decreases from
13.278 to 3.223 and it is non-sense to look for T2 to
identify the document (that is why we have low value
(3.223)
37
Extending the idea
Suppose we have t number of terms selected to
represent a document,
Let pk be the probability of each term, then the average
information content (i.e., average reduction in uncertainty
about the document) can be given by Shannon‟s formula
t
AVERAGEINF ORMATION pk . log2pk
k 1
38
Example (Average information content)
If the terms ALPHA, BETA, GAMA and DELTA are
expected to occur with the probabilities 0.5, 0.2,
0.2, and 0.1 respectively, find the average
information
Using the above formula the result will be 1.3,
and this is the average reduction of uncertainty
that we will achieve by selecting any term from
the above.
39
Cont…
It is known that the average information is maximized,
when the occurrence probabilities of the terms are all
equal to 1/t for t distinct terms and the maximum
value is log p
2
For example if all the above 4 (four) terms are
expected to occur equally, that means one-fourth of
the times, the average information value will be 2
instead of 1.3, which is the maximum value expected.
40
Noise
Salton, assumes that a complete picture of term
behavior can be obtained by considering the
frequency characteristics of each term not only in a
particular document whose term weight are currently
under consideration but also in all other documents
in the collection
One such measure, which is derived from Shannon‟s
communication theory, is the Signal-Noise Ratio
41
Cont…
Noise
Disturbance in doing some thing
Of terms in our case (in IR)
Noisy term Vs good term
42
Cont…
The noise of an index term k, Nk or NOISEk, for a
collection of N documents can be defined by
analogy to Shannon‟s information measure and is
given by
N TOTFREQk
FREQik ( )
N k NOISE k log 2
FREQik
i 1 TOTFREQ k
43
Cont…
Nk or NOISEk
Isa function that measures the noise of the index
term k for a collection of N documents and
relates the noise to the spread of an index term
throughout the document collection
This measure of noise varies inversely with the
“concentration” of a term in the document
collection
44
Cont…
For a perfectly even distribution, when a term
occurs an identical number of times in every
document of the collection, the noise is maximized
If we have an even distribution (i.e. if you find
the term almost appearing equally in all
documents or its is a non-specific term) the noise
is higher
Thatis, the term is not useful to select
documents and thus noise is maximized
45
Example
If a term k occurs exactly once in each document
(perfect even distribution, i.e., FREQik =1
{1,2,3…N}).
The noise of term k= NOISEk = log2N, which is the
maximum noise
On the other hand, the noise of a term k which
appears in only one document with
frequency = TOTFREQk will be, zero
(If a term appears in only one document with its total
frequency, then the Noise is zero)
46
Cont…
If the noise is maximum or the term does not exist in a
document the weight of a term is zero, the term does not
discriminate the documents
Thus there is a relation between noise and term
specificity.
Broad, nonspecific terms tend to have more even
distribution across the document of a collection, and hence
a high noise
That is, broad nonspecific terms have high noise
An inverse function of the noise might be used as a possible
function of term value (measure of term importance)
One such function is known as the SIGNAL of 47term k
Signal
Amount of information
The signal of a term k, Sk or SIGNALk, is defined by
SIGNALk log (TOTFREQk )
2 NOISEk
48
Cont…
For the maximum noise case previously discussed
(where each FREQik is equal to 1) the SIGNAL is equal
to 0, it is because TOTFREQk in that case equals to n
On the other hand, when the term occurs in only one
document (NOISEk = 0) a maximum signal of the term,
will be obtained.
SIGNALk Sk logTOTFREQ
2
k
49
Cont…
In principle, it is possible to rank the index words
extracted from documents of a collection in
decreasing order of signal value
Alternatively, the importance, or weight, of term k
in a document i can be computed as a composite
function taking into account FREQik as well as
SIGNALk.
50
Cont…
A possible measure of this type, analogous to the term
weighting function of expression is
wik FREQik *{log log }
N
2
dk
2
wik FREQik * SIGNALk FREQik * S k
51
Cont…
It is just to get the signal following the approach
discussed and then multiply it by the appropriate
FREQik
This is the signal ratio approach
It is suggested that even the simple ranking in order
of decreasing signal is sufficient to indicate the
“better” index terms
52
Term Discrimination Value (TDV)
(The Discrimination Model)
As discussed, the major use of indexing is to identify
sets of documents that are relevant to the user‟s
information need
Thus, an index term should provide a means of
separating documents into two sets
Those to be retrieved and
Those to be ignored
53
Cont…
The term discrimination value is proposed to measure
the degree to which the use of a term will help to
distinguish (or discriminate) the documents from each
other
It is an elegant mathematical model for the indexing
process and provides facilities for
Selecting index terms and
Weighting index terms
It is designed to rate a given index term in accordance
with its usefulness as a discriminator among the
documents collection
That is, how well they are able to discriminate the
documents of a collection from each other
54
Cont…
It takes as its base the idea of document space
configuration
where a document may be considered as a vector,
the elements of which are weights of every indexing
term to describe the document, and
where retrieval is considered as a vector matching
operation between a query vector (defined, similarly,
as a vector the elements of which are the weights of
the terms used to describe the query) and a document
vector
55
Cont…
Requires first a way to measure the similarity of two documents
Dot product, Cosine, Dice, Jaccard or overlapping coefficient
similarity measures
Consider a collection D of n documents each indexed by a set of t
terms
A particular document in D can be represented by a vector as
di=(wi1, wi2, …, wit)
Where
wij represents the weight, or degree of importance, of the j-th
term in document i
The wij may be weighted according to their importance using
one of the term weighting schemes
56
Cont…
These vectors (vector dj) can then be represented as
belonging to a t-dimensional vector space
A document is thus a vector in this space
(multidimensional space)
The main assumption here is that we can talk about
density of documents and compactness of space
57
Cont…
In the way described above, each document may be
represented by a single point whose position is
specified by the location where the corresponding
document vector touches of the sphere
It is then possible to compute a similarity coefficient,
S(di , dj ), between any two documents by comparing
the corresponding vector pairs,
The similarity coefficient S(di , dj ) reflects the degree
of similarity or closeness between the two documents
58
Overall procedure for computing DV
of a term
Compute the average inter document similarity in the
collection using some appropriate function
Next, the term k being evaluated is removed from the
index vocabulary and the same average document
similarity is computed
The DV for the term is then computed as
DVk= average similarity without k – average
similarity with k
DVK gives difference in space density before and after
the assignment of term k as an index term
A higher value is better because including the keyword
will result in better information retrieval.
59
Calculating the DV of a Term
For each potential index term, a discrimination value,
DISCVALUEk or DVk, can be computed as the difference
in space density before and after assignments of that
term
The greater the difference in space densities, the
more the space will be spread after assignment of
a given term, and therefore the better that this
term will function as discriminator
60
61
Indexing structures
Last thing before retrieval
62
How Current Search Engines
index?
Search engines build indexes using a web crawler,
which gather each page on the Web for indexing.
The pages are then organized with the help of the
selected indexing structure.
Once the pages are indexed, the local copy of each
page is discarded, unless stored in a cache.
63
Cont…
Some of the search engines, such as Google, AltaVista,
Excite, HotBot, InfoSeek, Lycos, automatically index
pages.
Others, such as Yahoo, Magellan, Galaxy and WWW
Virtual Library, semi-automatically index; that means
there is partial human involvement during indexing.
64
Building Index file
Given text document collections, they can be
described by a set of representative keywords called
index terms.
An index file of a document is therefore a file
consisting of a list of index terms and a link to one or
more documents that has the index term, as shown in
the following Figure.
65
Cont…
A good index file maps each keyword Ki to a set of
documents Di that contain the keyword.
Index file usually has index terms in a sorted order.
The sort order of the terms in the index file provides
an order on a physical file.
66
Cont…
An index file contains list of search terms that are
organized for associative look-up, i.e., to answer
user‟s query.
Once index file is constructed it becomes easier to
answer in which documents a specified search term
appears.
In case there may be several occurrences index file
also enables to know the position where within each
document the terms appear.
67
Cont…
For organizing index file for a collection of
documents, there are various techniques available.
The common indexing structures, are sequential file
and inverted file.
68
Next on
Modeling Modern IR Systems
But, remember two key issues in IR
Organizing (indexing)-index
Retrieval (Searching)- responding to query
69
Quiz
Describe the four basic weighting mechanisms ?
Assumptions , how they work and limitations
70
End
Questions, comments and
reflections
71