0% found this document useful (0 votes)

314 views71 pages

Term Weighting

This document discusses term weighting in information retrieval systems. It introduces the concept of assigning importance levels or weights to index terms to better represent their significance for describing document content. Several common term weighting schemes are described, including term frequency (tf), inverse document frequency (idf), and the composite tf-idf weighting. Tf weighting assigns weights based on raw term frequencies, but has limitations like favoring common words. Idf weighting assigns lower weights to terms that occur in many documents, making rare terms better descriptors. The document explores methods to address issues with tf and motivate the use of term weighting overall to improve retrieval effectiveness.

Uploaded by

dawit woldu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

314 views71 pages

Term Weighting

Uploaded by

dawit woldu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Identifying Term Importance

(Term Weighting)
Chapter Three

( assigning importance level to index terms)

1
Objectives

 Understand why query terms and document terms

are weighted in modern information retrieval
 Describe the various methods for weighting terms
 Discuss the problems with the existing term
weighting methods in information retrieval
 Write programs to implement the various term
weighting schemes

2
Motivation

 The indexing process so far has generated a set of

natural language index terms as a representation
of the text

 Although these terms belong to the general class of

the content words, they are not equally important
regarding the content of the text

 All terms may not have equal capacity in telling

about the semantics of the document/item
3
Cont…
 Thus assignment of importance indicator or term
weighting is important.

 It enhances the performance of an information

retrieval system by improving precision.

 But what do you think is the impact of it on recall?

4
Parameters playing important
role in weight computation are

 The index term it self

 The text to be indexed: length of text and number of
different terms in the text
 The relationship between the index term and the
text to be indexed: frequency of term in a text
 The relationship between index terms and document
collection (its frequency in the whole collection

5
Cont…
 Remark:
 Most weighting functions relay on the distribution pattern of
the terms in the text to be indexed and/or in a reference
collection and use statistics to compute the weights.
 The weight of an index term is usually a numerical value.
 Term weights have a value of zero or above and in the case of
normalized weights it vary b/n 0 and 1,
 with values closer to one indicating very important index
terms and
 values closer to zero very weak terms.
 Term weighting is a crucial part of an automatic information
retrieval system.
6
Cont…
 In conventional retrieval systems,
a term is either used to identify a given item (in
which case it is assumed to carry weight 1) or it is
not (in which case it is assumed to carry a weight
of 0).
 Such a system is called a binary system and has
proved to be of limited importance.
 For example no distinction is made within the set of
retrieved documents.
 That is, all retrieved documents are considered to be
equally important for the query
7
Cont…
 Term weighting based systems (e.g., systems that
use statistical weighting schemes) are designed to
overcome these shortcomings
 And this is done by collecting numerical values to
each of the index terms in a query or a document
reflecting their relative importance.
 A term with a high weight is assumed to be very
relevant to the document or query
 A term with low weight on the other hand indicates
little relevance to the content of the document or
query
8
Cont…
 These importance indicator values (weights) can then be used to
define a function, which measures the similarity or closeness
between query and documents

 This in turn makes it possible to retrieve documents in the

decreasing order of query-document similarity, the most similar
(presumably relevant) being retrieved first
 Two difficulties in this respect
 The difficulty in deciding how to allocate the weights

 The difficulty in actually calculating the values which are to

be used (computation intensive)

9
Some of the term weighting schemes (or
functions or methods) suggested are:

 The term frequency (tf) weights

 An inverse document frequency (IDF) or collection
frequency) weights
 The composite weight (tf*idf)
 The signal-noise ratio

10
Early proposal to term weighting
 It will be important to consider an early proposal to term weighting
(assignment of importance indicators). And it follows basically the
following procedures:
 Calculate the frequency of each unique term for each
document for a given collection of n documents, The frequency
of term k in document i, or FREQik
 Determine the total collection frequency TOTFREQk for each term by
summing the frequencies of each unique term across all n documents,
 TOTFREQk = ∑ FREQik

 Arrange the words in decreasing order according to their collection

frequency. Decide on some threshold value and remove all words
with a collection frequency above this threshold.
 Do the same for low-frequency words.
 The remaining medium-frequency words are now used for
assignment to the documents as index terms 11
Term frequency (tf) weights

 Motivation:
 The frequency of occurrence of a term is a useful
indication of its relative importance in describing
(defining) a document
 In other words, term importance is related to its
frequency of occurrence.
 If term A is mentioned more than term B, then the
document is more about A than about B
 (assuming A and B to be content bearing terms)

12
Cont…
 Supporting idea:
 “Authors tend to repeat certain words as he or she
advances or varies the argument on an aspect of the
subject”
 One such measure assumes that the value, importance, or
weight, of a term assigned to a document is simply
proportional to the term frequency (i.e., the frequency of
occurrence of that particular term in that particular document)
 Thus the assumption here is that,
 The more frequently a term occurs in a document the
more likely it is to be of value in describing the
content of the document.
 Do you agree with this?
13
Cont…
 Accordingly, the weight of term k in document i,
denoted by wik, might be determined by

wik  FREQik
Where, FREQik is the frequency of term k in document i

14
Cont…
 Remarks:
 It is a simple count of the number of occurrences of a
term in a particular document (or query).
 Is a measure of term density in a document.
 Experiments have shown that this is better than Boolean.
 Having all weaknesses this method shows better results
than that of Boolean (Binary systems)
 The basic idea is to differentiate terms in a document

15
Problems with Term frequency (tf)
weights
 Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency words
are equally distributed throughout the collection
 Since it does not take into account the role of term k in any
document other than document i it doesn‟t consider the
importance of term k in a collection.
 This simple measure is not normalized to account for variances
in the length of documents
 A one-page document with 10 mentions of A is “more about
A” than a 100-page document with 20 mentions of A
 Used alone, favors common words and long documents
 HOW??????
16
Solutions to the Problems with
Term frequency (tf) weights
 Divide each frequency count by the length of the
document, in terms of words (length Normalization)
 Divide each frequency count by the maximum frequency
count of any term in the document (frequency
normalization).
 In this case the normalized frequency fij is used
instead of FREQik

17
Cont…
 The normalized tf is given by

FREQij
f ij 
max m ( FREQi m )

Where

fij is the normalized frequency of term j in document i

maxm is the maximum frequency of any term in document dl

18
Inverse Document Frequency (IDF)
weights
 Also called collection frequency,
 introduced by Spark Jones, another personality in
Information Retrieval.
 According to this measure, the importance of a term in a
document is measured (weighted) by the number of
documents in a collection that contain the term.
 The basic idea here is to differentiate terms in queries.
Accordingly the assumption is:
 If a term occurs in many of the documents in the
collection then it doesn’t serve well as a document
identifier and should be given a low importance (weight)
as a potential index term. 19
Cont…
 Assuming that term k occurs in at least one
document (dk ≠ 0) a possible measure of the inverse
document frequency is defined by

( dN )
wk  log2 k
Where,
N is the total number of documents in the collection
dk the number of documents in which k occurs
Wk is the weight assigned to term k

20
Cont…
 That is, the weight of a term in a document is the logarithm of the
number of documents in the collection divided by the number of
documents in the collection that contain the term (with 2 as the
base of the logarithm)
 The log is used to make the values of tf and idf comparable.
 It can be interpreted as the amount of information associated with
term ki
 IDF measures the rarity of a term across the whole document.
 According to this measure, i.e., IDF, The more a term t occurs
throughout all documents, the more poorly that term t
discriminates between documents.
 As the collection frequency of a term decreases its weight increases.
 Emphasis is on terms exhibiting the lowest document frequency.
21
Cont…
 Term importance is inversely proportional to the total
number of documents to which each term is assigned,
associated towards terms appearing in less number of
documents or items

 Tests with IDF scheme show that it consistently

produces substantial performance improvements
compared to unweighted (binary) systems

22
Problems with IDF weights
 It identifies a term that appears in many documents as
not very useful for distinguishing relevant documents
from non-relevant ones
 But this function does not take into account the
frequency of a term in a given document (i.e.,
FREQik ).
 That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author who uses
important terms now and then
23
Solution to the problem of IDF
weights

 Term importance or weights, should combine two

measurements
 The direct proportion to the frequency of the term in
a document (which quantifies how well the term
describes the document or the content)
 The inverse proportion to the number of documents in
the collection in which the term appears
 This takes us to the next and best term importance
measure, the composite measure (tf*idf)

24
The composite measure (tf*idf)
 Is a measure that combines term frequency and inverse document
frequency
 Rational for this approach:
 A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
 A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same time
that the importance of the term in the remainder of the collection
is relatively small so that the term can actually distinguish the
documents to which it is assigned from the remainder of the
collection.
 Thus, such a term can be considered as being of potentially greater
importance for retrieval purposes

25
Cont…

 This scheme assigns a weight to each term (vocabulary

word) in a given document by
 combining term frequency and inverse document
frequency. And it is defined by

tf * idf  WEIGHTik  wik  FREQik *{log2N  logd2k }

FREQik
tf * idf  WEIGHTik  wik  *{log2N  logd2 k }
max k {FREQkj }
26
According to this function
 Weight of term k in a given document i would
increase as the frequency of the term in the
document (FREQik ) increases but decreases as the
document frequency dk increases

 Example on the composite measure (tf*idf):

 Consider the following scenario in identifying the
importance of the terms in the sample collection

27
Computing TF-IDF: An Example

 Assume collection contains 10,000 documents and

statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And
also term frequencies (TF) of these terms in a
document are: A(3), B(2), C(1).

 Compute TF*IDF for each term?

28
Solution

A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;

tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300)= 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250)= 5.322;
tf*idf = 1.774

29
Exercise - 1
 A document contains, and only contains, the phrase
“being Ethiopian and not being Ethiopian”. Suppose
every word is indexed.
 The document collection contains 1000 documents
and every word has equal document frequency of 100.
 What is the weight of each term according to the
tf.idf weighting formula using a normalized
(frequency) term weight?

30
Exercise-2
 A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season.
 A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than
any other term in the document.
 Calculate the weight of both terms in this document
using three different term weight methods

31
Solution

 W (holiday)
 Tf= 7/7= 1
 Idf= Log 1000000/200000= log5 = 2.32
 Tf*idf= 2.32

 W (season)
 Tf=5/7= 0.71
 Idf=log 1000000/250000= log 4= 2
 Tf*idf= 0.71*2= 1.42

32
More Example (length normalization)
Consider a document containing 100 words wherein the
word computer appears 3 times. Now, assume we have
10, 000, 000 documents and computer appears in 1, 000
of these.
 The term frequency (TF) for computer :
3/100 = 0.03

 The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

 The TFIDF score is the product of these frequencies: 0.03

13.228 = 0.39684
33
Exercise (use length normalization for tf)
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
• TD = total number of forest 2 46 3 1
documents in a corpus,
justice 7 46 3 3
and
• DF = total number of love 2 46 3 1
documents containing might 2 46 3 1
a given word; perl 5 46 3 2
• compute TF, IDF and rose 6 46 3 3
TF*IDF score for each shoe 4 46 3 1
term
thesis 2 46 3 2 34
The Signal-Noise Ratio Approach
 Is an approach to term weighting based on Shannon‟s
Information Theory, which states that,the information content
of a message, or term, is measured as an inverse function of
the probability of occurrence of the terms in a given text.
 As a result, the higher the probability of a word in a document
(e.g. the) the lower the message (or the information it
contains)
 Accordingly the information content of a word or term is
measured using the formula

INFORMATIO N   log2p ,
Where
P is the probability of occurrence of the word
35
Examples
 Finding information content of the following terms will result
in the following situation
 „T 1’ if it occurs once in every 10,000 words breed
(collection). Using the above formula the result will be
13.28

 “T 2” if it occurs once in every 10 terms. The result will be

3.322

 (That is probability of occurrence increases from 0.0001 to

0.1)

36
Cont…
 The above values can be regarded as a measure of
reduced uncertainty

 Accordingly from the above example again, knowing the

term T2 reduce no (less) uncertainty while T1 reduces
the uncertainty by 13.28 values

 In other words Information content decreases from

13.278 to 3.223 and it is non-sense to look for T2 to
identify the document (that is why we have low value
(3.223)

37
Extending the idea
 Suppose we have t number of terms selected to
represent a document,
 Let pk be the probability of each term, then the average
information content (i.e., average reduction in uncertainty
about the document) can be given by Shannon‟s formula

t
AVERAGEINF ORMATION    pk . log2pk
k 1

38
Example (Average information content)

 If the terms ALPHA, BETA, GAMA and DELTA are

expected to occur with the probabilities 0.5, 0.2,
0.2, and 0.1 respectively, find the average
information

 Using the above formula the result will be 1.3,

and this is the average reduction of uncertainty
that we will achieve by selecting any term from
the above.

39
Cont…
 It is known that the average information is maximized,
when the occurrence probabilities of the terms are all
equal to 1/t for t distinct terms and the maximum
value is  log p
2

 For example if all the above 4 (four) terms are

expected to occur equally, that means one-fourth of
the times, the average information value will be 2
instead of 1.3, which is the maximum value expected.

40
Noise

 Salton, assumes that a complete picture of term

behavior can be obtained by considering the
frequency characteristics of each term not only in a
particular document whose term weight are currently
under consideration but also in all other documents
in the collection

 One such measure, which is derived from Shannon‟s

communication theory, is the Signal-Noise Ratio

41
Cont…
 Noise
 Disturbance in doing some thing

 Of terms in our case (in IR)

 Noisy term Vs good term

42
Cont…
 The noise of an index term k, Nk or NOISEk, for a
collection of N documents can be defined by
analogy to Shannon‟s information measure and is
given by

N TOTFREQk
FREQik ( )
N k  NOISE k   log 2
FREQik

i 1 TOTFREQ k

43
Cont…

 Nk or NOISEk
 Isa function that measures the noise of the index
term k for a collection of N documents and
relates the noise to the spread of an index term
throughout the document collection

 This measure of noise varies inversely with the

“concentration” of a term in the document
collection

44
Cont…
 For a perfectly even distribution, when a term
occurs an identical number of times in every
document of the collection, the noise is maximized
 If we have an even distribution (i.e. if you find
the term almost appearing equally in all
documents or its is a non-specific term) the noise
is higher

 Thatis, the term is not useful to select

documents and thus noise is maximized

45
Example
 If a term k occurs exactly once in each document
(perfect even distribution, i.e., FREQik =1
{1,2,3…N}).
 The noise of term k= NOISEk = log2N, which is the
maximum noise
 On the other hand, the noise of a term k which
appears in only one document with
frequency = TOTFREQk will be, zero
(If a term appears in only one document with its total
frequency, then the Noise is zero)

46
Cont…
 If the noise is maximum or the term does not exist in a
document the weight of a term is zero, the term does not
discriminate the documents
 Thus there is a relation between noise and term
specificity.
 Broad, nonspecific terms tend to have more even
distribution across the document of a collection, and hence
a high noise
 That is, broad nonspecific terms have high noise
 An inverse function of the noise might be used as a possible
function of term value (measure of term importance)
 One such function is known as the SIGNAL of 47term k
Signal
 Amount of information

 The signal of a term k, Sk or SIGNALk, is defined by

SIGNALk  log (TOTFREQk )

2  NOISEk

48
Cont…
 For the maximum noise case previously discussed
(where each FREQik is equal to 1) the SIGNAL is equal
to 0, it is because TOTFREQk in that case equals to n

 On the other hand, when the term occurs in only one

document (NOISEk = 0) a maximum signal of the term,
will be obtained.

SIGNALk  Sk  logTOTFREQ
2
k

49
Cont…

 In principle, it is possible to rank the index words

extracted from documents of a collection in
decreasing order of signal value

 Alternatively, the importance, or weight, of term k

in a document i can be computed as a composite
function taking into account FREQik as well as
SIGNALk.

50
Cont…
 A possible measure of this type, analogous to the term
weighting function of expression is

wik  FREQik *{log  log }

N
2
dk
2

wik  FREQik * SIGNALk  FREQik * S k

51
Cont…
 It is just to get the signal following the approach
discussed and then multiply it by the appropriate
FREQik

 This is the signal ratio approach

 It is suggested that even the simple ranking in order

of decreasing signal is sufficient to indicate the
“better” index terms

52
Term Discrimination Value (TDV)
(The Discrimination Model)
 As discussed, the major use of indexing is to identify
sets of documents that are relevant to the user‟s
information need

 Thus, an index term should provide a means of

separating documents into two sets
 Those to be retrieved and
 Those to be ignored

53
Cont…
 The term discrimination value is proposed to measure
the degree to which the use of a term will help to
distinguish (or discriminate) the documents from each
other

 It is an elegant mathematical model for the indexing

process and provides facilities for
 Selecting index terms and
 Weighting index terms
 It is designed to rate a given index term in accordance
with its usefulness as a discriminator among the
documents collection

 That is, how well they are able to discriminate the

documents of a collection from each other
54
Cont…
 It takes as its base the idea of document space
configuration
 where a document may be considered as a vector,
the elements of which are weights of every indexing
term to describe the document, and

 where retrieval is considered as a vector matching

operation between a query vector (defined, similarly,
as a vector the elements of which are the weights of
the terms used to describe the query) and a document
vector

55
Cont…
 Requires first a way to measure the similarity of two documents
 Dot product, Cosine, Dice, Jaccard or overlapping coefficient
similarity measures
 Consider a collection D of n documents each indexed by a set of t
terms
 A particular document in D can be represented by a vector as
di=(wi1, wi2, …, wit)
 Where
 wij represents the weight, or degree of importance, of the j-th
term in document i
 The wij may be weighted according to their importance using
one of the term weighting schemes
56
Cont…

 These vectors (vector dj) can then be represented as

belonging to a t-dimensional vector space
A document is thus a vector in this space
(multidimensional space)
 The main assumption here is that we can talk about
density of documents and compactness of space

57
Cont…
 In the way described above, each document may be
represented by a single point whose position is
specified by the location where the corresponding
document vector touches of the sphere

 It is then possible to compute a similarity coefficient,

S(di , dj ), between any two documents by comparing
the corresponding vector pairs,

 The similarity coefficient S(di , dj ) reflects the degree

of similarity or closeness between the two documents
58
Overall procedure for computing DV
of a term
 Compute the average inter document similarity in the
collection using some appropriate function
 Next, the term k being evaluated is removed from the
index vocabulary and the same average document
similarity is computed
 The DV for the term is then computed as
DVk= average similarity without k – average
similarity with k
 DVK gives difference in space density before and after
the assignment of term k as an index term
 A higher value is better because including the keyword
will result in better information retrieval.
59
Calculating the DV of a Term

 For each potential index term, a discrimination value,

DISCVALUEk or DVk, can be computed as the difference
in space density before and after assignments of that
term
 The greater the difference in space densities, the
more the space will be spread after assignment of
a given term, and therefore the better that this
term will function as discriminator

60
61
Indexing structures
Last thing before retrieval

62
How Current Search Engines
index?
 Search engines build indexes using a web crawler,
which gather each page on the Web for indexing.
 The pages are then organized with the help of the
selected indexing structure.
 Once the pages are indexed, the local copy of each
page is discarded, unless stored in a cache.

63
Cont…
 Some of the search engines, such as Google, AltaVista,
Excite, HotBot, InfoSeek, Lycos, automatically index
pages.

 Others, such as Yahoo, Magellan, Galaxy and WWW

Virtual Library, semi-automatically index; that means
there is partial human involvement during indexing.

64
Building Index file

 Given text document collections, they can be

described by a set of representative keywords called
index terms.
 An index file of a document is therefore a file
consisting of a list of index terms and a link to one or
more documents that has the index term, as shown in
the following Figure.

65
Cont…
 A good index file maps each keyword Ki to a set of
documents Di that contain the keyword.
 Index file usually has index terms in a sorted order.
 The sort order of the terms in the index file provides
an order on a physical file.

66
Cont…
 An index file contains list of search terms that are
organized for associative look-up, i.e., to answer
user‟s query.
 Once index file is constructed it becomes easier to
answer in which documents a specified search term
appears.
 In case there may be several occurrences index file
also enables to know the position where within each
document the terms appear.

67
Cont…

 For organizing index file for a collection of

documents, there are various techniques available.
 The common indexing structures, are sequential file
and inverted file.

68
Next on
 Modeling Modern IR Systems
 But, remember two key issues in IR
 Organizing (indexing)-index
 Retrieval (Searching)- responding to query

69
Quiz
 Describe the four basic weighting mechanisms ?
 Assumptions , how they work and limitations

70
End
Questions, comments and
reflections

Term Weighting & Similarity Basics
50% (2)
Term Weighting & Similarity Basics
54 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
Data Structures & Algorithms Exam
No ratings yet
Data Structures & Algorithms Exam
2 pages
IOT Questions and Answers - Solution
No ratings yet
IOT Questions and Answers - Solution
8 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
No ratings yet
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
19 pages
Unit #3 - Data Warehouse and Data Mining
No ratings yet
Unit #3 - Data Warehouse and Data Mining
70 pages
Jimma University JIT School of Computing Advanced Database System Lab
100% (1)
Jimma University JIT School of Computing Advanced Database System Lab
70 pages
Chapter 1 - Concept of Object Oriented Database
100% (1)
Chapter 1 - Concept of Object Oriented Database
23 pages
Chapter 5 Database Recovery Techniques
100% (1)
Chapter 5 Database Recovery Techniques
46 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
r05410307 Selected Topics in Computer Science
No ratings yet
r05410307 Selected Topics in Computer Science
1 page
SQL Quiz for Database Students
No ratings yet
SQL Quiz for Database Students
31 pages
Object Oriented Databases
No ratings yet
Object Oriented Databases
12 pages
Assignment ON Data Mining
No ratings yet
Assignment ON Data Mining
24 pages
Chapter 3. Control Statements
100% (1)
Chapter 3. Control Statements
62 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Zipf's Law and Heaps Law
No ratings yet
Zipf's Law and Heaps Law
10 pages
AI Family Tree Prolog Exercise
No ratings yet
AI Family Tree Prolog Exercise
7 pages
Internetworking: Samson A. School of Electrical & Computer Engineering, Hawassa Institute of Technology
100% (1)
Internetworking: Samson A. School of Electrical & Computer Engineering, Hawassa Institute of Technology
86 pages
CH 10 Questions
No ratings yet
CH 10 Questions
5 pages
Finding Max Min
No ratings yet
Finding Max Min
20 pages
Exit Exam Training
No ratings yet
Exit Exam Training
16 pages
z5270662 INFS2621 Exam Paper
No ratings yet
z5270662 INFS2621 Exam Paper
6 pages
2009 S Pre Exam2 Review 6up PDF
No ratings yet
2009 S Pre Exam2 Review 6up PDF
9 pages
File System Fundamentals Guide
No ratings yet
File System Fundamentals Guide
12 pages
Distributed Systems Chapter 3-Processes
No ratings yet
Distributed Systems Chapter 3-Processes
33 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Chapter 2 - Query Processing and Optimization
No ratings yet
Chapter 2 - Query Processing and Optimization
16 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Group - A (Short Answer Questions)
No ratings yet
Group - A (Short Answer Questions)
9 pages
Chapter 3 Review Questions
No ratings yet
Chapter 3 Review Questions
5 pages
5.object Oriented Databases
No ratings yet
5.object Oriented Databases
33 pages
Internetworking Devices
100% (1)
Internetworking Devices
29 pages
Ambo University Woliso Campus
No ratings yet
Ambo University Woliso Campus
10 pages
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
100% (1)
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
6 pages
Mis - Classification of Information
No ratings yet
Mis - Classification of Information
2 pages
Data Structure & Algorithm Analysis
No ratings yet
Data Structure & Algorithm Analysis
4 pages
Chapter 3 - Naming and Threads-1
No ratings yet
Chapter 3 - Naming and Threads-1
21 pages
Exit Exam
No ratings yet
Exit Exam
100 pages
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
24 pages
Linear Algebra for ICT Students
No ratings yet
Linear Algebra for ICT Students
25 pages
1st Round Exit Exam Tutorial 2016
No ratings yet
1st Round Exit Exam Tutorial 2016
2 pages
Chapter 4 - Distributed System
No ratings yet
Chapter 4 - Distributed System
24 pages
Serializability
No ratings yet
Serializability
26 pages
C++ Exam Questions Guide
No ratings yet
C++ Exam Questions Guide
28 pages
Enhancing Effectiveness of Afaan Oromo Information Retrieval Using Latent Semantic Indexing and Document Clustering Based Searching
No ratings yet
Enhancing Effectiveness of Afaan Oromo Information Retrieval Using Latent Semantic Indexing and Document Clustering Based Searching
108 pages
Implementation of Digital Communication Using Matlab (Graduation Project For B.Sc. Degree)
No ratings yet
Implementation of Digital Communication Using Matlab (Graduation Project For B.Sc. Degree)
52 pages
Final Year Project Networking
No ratings yet
Final Year Project Networking
21 pages
Software Engineering Exam Guide
100% (1)
Software Engineering Exam Guide
4 pages
05 - Strategies For Query Processing (Ch18)
No ratings yet
05 - Strategies For Query Processing (Ch18)
50 pages
Mid Examination
No ratings yet
Mid Examination
2 pages
CS513 Midterm Exam
No ratings yet
CS513 Midterm Exam
6 pages
MCQS On DCN
No ratings yet
MCQS On DCN
16 pages
Data Communication Assignment
No ratings yet
Data Communication Assignment
4 pages
Chapter One:data Structures and Algorithm Analysis
No ratings yet
Chapter One:data Structures and Algorithm Analysis
209 pages
CISS-472 Midterm Exam Study Guide
No ratings yet
CISS-472 Midterm Exam Study Guide
1 page
Network Security Protocols
100% (1)
Network Security Protocols
3 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Carretero and Lagaly 2007
No ratings yet
Carretero and Lagaly 2007
3 pages
Warranty Claim Form: Dealer
No ratings yet
Warranty Claim Form: Dealer
2 pages
Unit Plan Fitness 20 30
No ratings yet
Unit Plan Fitness 20 30
4 pages
CFA Level 1 FRA
No ratings yet
CFA Level 1 FRA
17 pages
Mini M-70 Series Industrial Spray Nozzle: Specifications
No ratings yet
Mini M-70 Series Industrial Spray Nozzle: Specifications
2 pages
Universal Helmet Laws Reduce Injuries
No ratings yet
Universal Helmet Laws Reduce Injuries
13 pages
TmForum ODA
No ratings yet
TmForum ODA
42 pages
Weeks 5-6 Glute & Strength Workout Guide
No ratings yet
Weeks 5-6 Glute & Strength Workout Guide
6 pages
Kyecu-Cocoa - 26.05.2025
No ratings yet
Kyecu-Cocoa - 26.05.2025
4 pages
Unit 3 PDF Forging Sheet Metal
No ratings yet
Unit 3 PDF Forging Sheet Metal
75 pages
Patent Dispute: Manzano vs. Madolaria
No ratings yet
Patent Dispute: Manzano vs. Madolaria
7 pages
AG PIECO Glass & Rubber Products Catalog
No ratings yet
AG PIECO Glass & Rubber Products Catalog
8 pages
Semester Rules of Shaheed Benazir Bhutto University Sheringal, Dir Upper
No ratings yet
Semester Rules of Shaheed Benazir Bhutto University Sheringal, Dir Upper
34 pages
Traditional Witches Silence Hex
No ratings yet
Traditional Witches Silence Hex
2 pages
Reframing Organizations Intergrating The Four Frames
No ratings yet
Reframing Organizations Intergrating The Four Frames
10 pages
RXT 10GE Manual
No ratings yet
RXT 10GE Manual
98 pages
Digital Marketing & Content Writing
No ratings yet
Digital Marketing & Content Writing
11 pages
Sawasdee SET: S-T Retracement, Opportunity To Buy
No ratings yet
Sawasdee SET: S-T Retracement, Opportunity To Buy
14 pages
Age of Fantasy - Saurians v2.0: Background About OPR
No ratings yet
Age of Fantasy - Saurians v2.0: Background About OPR
2 pages
Finland Danske Bank
No ratings yet
Finland Danske Bank
1 page
Financial Stability Report - NOV2024
No ratings yet
Financial Stability Report - NOV2024
73 pages
Senior Project Manager (Mar 2023)
No ratings yet
Senior Project Manager (Mar 2023)
3 pages
Fire Investigation Essentials
No ratings yet
Fire Investigation Essentials
11 pages
Revision 2 - SHORT ANSWER KEY
No ratings yet
Revision 2 - SHORT ANSWER KEY
12 pages
Нұсқа 4081 Mathematics
No ratings yet
Нұсқа 4081 Mathematics
9 pages
DIVERTRON Instructions
No ratings yet
DIVERTRON Instructions
8 pages
Gasket Brochure
100% (1)
Gasket Brochure
8 pages
Library Management System Report IPT
No ratings yet
Library Management System Report IPT
6 pages
Wall - Wash - Sampling - and - Analysis - Procedures COMPLETE
No ratings yet
Wall - Wash - Sampling - and - Analysis - Procedures COMPLETE
27 pages
Time Study Method Implementation in Manufacturing Industry Nor Diana Hashim TS183.N67 2008 - 24 Pages
No ratings yet
Time Study Method Implementation in Manufacturing Industry Nor Diana Hashim TS183.N67 2008 - 24 Pages
24 pages

Term Weighting

Uploaded by

Term Weighting

Uploaded by

Identifying Term Importance

( assigning importance level to index terms)

 Understand why query terms and document terms

 The indexing process so far has generated a set of

 Although these terms belong to the general class of

 All terms may not have equal capacity in telling

 It enhances the performance of an information

 But what do you think is the impact of it on recall?

 The index term it self

 This in turn makes it possible to retrieve documents in the

 The difficulty in actually calculating the values which are to

 The term frequency (tf) weights

 Arrange the words in decreasing order according to their collection

fij is the normalized frequency of term j in document i

 Tests with IDF scheme show that it consistently

 Term importance or weights, should combine two

 This scheme assigns a weight to each term (vocabulary

tf * idf  WEIGHTik  wik  FREQik *{log2N  logd2k }

 Example on the composite measure (tf*idf):

 Assume collection contains 10,000 documents and

 Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;

 The inverse document frequency is

 The TF*IDF score is the product of these frequencies: 0.03 *

 “T 2” if it occurs once in every 10 terms. The result will be

 (That is probability of occurrence increases from 0.0001 to

 Accordingly from the above example again, knowing the

 In other words Information content decreases from

 If the terms ALPHA, BETA, GAMA and DELTA are

 Using the above formula the result will be 1.3,

 For example if all the above 4 (four) terms are

 Salton, assumes that a complete picture of term

 One such measure, which is derived from Shannon‟s

 Of terms in our case (in IR)

 Noisy term Vs good term

 This measure of noise varies inversely with the

 Thatis, the term is not useful to select

 The signal of a term k, Sk or SIGNALk, is defined by

SIGNALk  log (TOTFREQk )

 On the other hand, when the term occurs in only one

 In principle, it is possible to rank the index words

 Alternatively, the importance, or weight, of term k

wik  FREQik *{log  log }

wik  FREQik * SIGNALk  FREQik * S k

 This is the signal ratio approach

 It is suggested that even the simple ranking in order

 Thus, an index term should provide a means of

 It is an elegant mathematical model for the indexing

 That is, how well they are able to discriminate the

 where retrieval is considered as a vector matching

 These vectors (vector dj) can then be represented as

 It is then possible to compute a similarity coefficient,

 The similarity coefficient S(di , dj ) reflects the degree

 For each potential index term, a discrimination value,

 Others, such as Yahoo, Magellan, Galaxy and WWW

 Given text document collections, they can be

 For organizing index file for a collection of

You might also like

 The TFIDF score is the product of these frequencies: 0.03