Vector Space Model

Information Retrieval

Uploaded by

saajalkumarborhade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views7 pages

Vector Space Model

Information Retrieval

Uploaded by

saajalkumarborhade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

VECTOR SPACE MODEL

The VSM is an algebraic model used for Information Retrieval. It represent natural
language document in a formal manner by the use of vectors in a multidimensional space.
The Vector Space Model (VSM) is a way of representing documents through the words
that they contain. The concepts behind vector space modeling are that by placing terms,
documents, and queries in a term-document space it is possible to compute the
similarities between queries and the terms or documents, and allow the results of the
computation to be ranked according to the similarity measure between them. The VSM
allows decisions to be made about which documents are similar to each other and to
queries.
(a) How it works
 Each document is broken down into a word frequency table
 The tables are called vectors and can be stored as arrays
 A vocabulary is built from all the words in all documents in the system
 Each document and user query is represented as a vector based against the vocabulary
 Calculating similarity measure
 Ranking the documents for relevance
The vector space model provide the user with a guide to documents that might be more
similar and of greater significance by calculating the distance or angle measure between
the query and terms or document. Vector space modeling is based on the assumption that
the meaning of a document can be understood from the document’s constituent terms.
Documents are represented as “vectors of terms d = (t1, t2, … tn) where ti (1<= i <= t) is
a non-negative value denoting the single or multiple occurrences of term i in document
d.” Each unique term in the document represents a dimension in the space. “Similarly, a
query is represented as a vector Q = (t1, t2, …, tn) where term ti (1<=i<=n) is a non-
negative value denoting the number of occurrences of ti (or, merely a 1 to signify the
occurrence of term) in the query”. Once both the documents and query have their
respective vectors calculated it is possible to calculate the distance between the objects in
the space and the query, allowing objects with similar semantic content to the query
should be retrieved. Vector space models that don’t calculate the distance between the
objects within the space treat each term independently. Using various similarity measures
it is possible to compare queries to terms and documents in order to emphasize or de-
emphasize properties of the document collection. A good example of this is, “the dot
product (or, inner product) similarity measure finds the Euclidean distance between the
query and a term or document in the space”.
Consider the following two documents
 Document A: “A man and a woman.”
 Document B: “A baby.”
Step-1: Each document is broken down into a word frequency table

The tables are called vectors and can be stored as arrays

Step-2: A vocabulary is built from all the words in all documents in the system The
vocabulary contains all words used: a, man, and, woman, baby
Step-3: The vocabulary needs to be sorted: a, and, man, woman, baby
Step-4: Each document is represented as a vector based against the vocabulary

Step-5: Queries can be represented as vectors in the same way as documents For
example, Woman = (0,0,0,1,0)
(b) Similarity measures/coefficient
Using a similarity measure, a set of documents can be compared to a query and the most
similar documents are returned. The similarity in VSM is determined by using associative
coefficients based on the inner product of the document vector and query vector, where
word overlap indicates similarity. There are many different ways to measure how similar
two vectors are, like Inner Product, Cosine Measure, Dice Coefficient, Jaccard
Coefficient. The most popular similarity measure is the cosine coefficient, which
measures the angel between a document vector and query vector.
(c) The cosine measure
The cosine measure calculates the angle between the vectors in a high-dimensional
virtual space. For two vectors d and d’ the cosine similarity between d and d’ is given by:
( D * D’ ) / | D | * | D’ | Here d X d’ is the vector product of d and d’, calculated by
multiplying corresponding frequencies together.
Step-5: Calculate the similarity measure of query with every document in the collection
For Document A, d = (2,1,1,1,0) and d’ = (0,0,0,1,0)
dXd’ = 2X0 + 1X0 + 1X0 + 1X1 + 0X0=1
|d| =sqrt (22 +12 +12 +12 +02 ) = sqrt(7)=2.646
|d’| = sqrt(02 +02 +02 +12 +02 ) = sqrt(1)=1
Similarity = 1/(1 X 2.646) = 0.378
For Document B, d = (1,0,0,0,1) and d’ = (0,0,0,1,0)
dXd’ = 1X0 + 0X0 + 0X0 + 0X1 + 1X0=0
|d| =sqrt (12 +02 +02 +02 +12 ) = sqrt(2)=1.414
|d’| = sqrt(02 +02 +02 +12 +02 ) = sqrt(1)=1
Similarity = 0/(1 X 1.414) = 0
(d) Ranking documents
 A user enters a query
 The query is compared to all documents using a similarity measure
 The user is shown the documents in decreasing order of similarity to the query term
Step-6: Rank the in descending order and display to user
Variation in VSM
(a) Stop words
 Commonly occurring words are unlikely to give useful information and may be
removed from the vocabulary to speed processing
 Stop word lists contain frequent words to be excluded
 The top 20 Stop words according to their average frequency per 1000 words: The, of,
and, to, a, in, that, is, was, he, for, it, with, as, his, as, on, be, at, by, I, etc.
Term weighting
 Not all words are equally useful.
 A word is most likely to be highly relevant to document A if it is: Infrequent in other
documents and Frequent in document A
 The cosine measure needs to be modified to reflect this
 Considering the frequency of every word in a document can do this.
 It is given by tf = log (1 + n(d, t) / n(d) ), where n(d, t) is number of occurrences of the
term t in document d; and n(d) is the number of terms in document d.
Normalized term frequency (tf)
 A normalized measure of the importance of a word to a document is its frequency,
divided by the maximum frequency of any term in the document
 This is known as the tf factor.
 Document A: raw frequency vector: (2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)
 This stops large documents from scoring higher
Inverse document frequency (idf)
 A calculation designed to make rare words more important than common words
 The idf of word i is given by idf(i) = log (n / n(i)), Where N is the number of
documents and n(i) is the number that contain word i
 IDF provides high values for rare words and low values for common words
tf-idf
 The tf-idf weighting scheme is to multiply each word in each document by its tf factor
and idf factor
 Different schemes are usually used for query vectors
 Different variants of tf-idf are also used
 Increases with the number of occurrences within a doc
 Increases with the rarity of the term across the whole corpus.
(b) Stemming
Stemming is the process of removing suffixes from words to get the common origin. In
statistical analysis, it greatly helps when comparing texts to be able to identify words
with a common meaning and form as being identical. For example, we would like to
count the words stopped and stopping as being the same and derived from stop.
Stemming identifies these common forms.
(c) Synonyms and Multiple meaning of word
 There are many ways to describe something. For example, car and automobile may
describe the same thing.
 Words often have multiple meanings.
(d) Concept based VSM
It considers the semantic of the document instead of only considering the terms contained
in document.
(e) Proximity
If the terms occur close to each other in the document, the document would be ranked
higher than if they occur far apart. The words “Information” and “Retrieval” that comes
together defines a document on “Information Retrieval” than the document containing
two words “Information” and “Retrieval” scattered.
(f) File Attribute
For temporal data, the file attributes, like last modified date, plays an important role in
deciding the document relevance.
(g) Hyperlink
The hyperlink coming in the page and going out of the page (particularly IN LINK) can
give useful information about page.
(h) Position of word
The word in header (for example, HTML tag) can be considered more content bearing
word than word in Body of document. Similarly, the word occurring in the beginning of
document can be more indicative about document content than word coming later in the
document; or the word found nearer to word ‘Abstract’ may define the content of
document well.
(i) User Profile
User profile can be used to improve the process using adaptive approach. The response of
the user can also be considered to improve the process. This can be possible if and only if
we can measure trustworthiness (dependent variable) of user response based on some
dependent variables like education, age, gender, income, number of time he visited page,
etc.
(j) User defined weight to terms in query
There can be some way by which user can provide weight to his/her query.
Major Problems with VSM
 There is no real theoretical basis for the assumption of a term space
 It is more for visualization that having any real basis
 Most similarity measures work about the same regardless of model
 Terms are not really orthogonal dimensions
 Terms are not independent of all other terms

Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
30 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
A Generalized Vector Space Model For Text Retrieval Based On Semantic Relatedness
No ratings yet
A Generalized Vector Space Model For Text Retrieval Based On Semantic Relatedness
9 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Lecture 3 VSM
No ratings yet
Lecture 3 VSM
16 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Unit - II
No ratings yet
Unit - II
5 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
L04
No ratings yet
L04
35 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
No ratings yet
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
3 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
DL U4 - Deep Learning Unit 4 DL U4 - Deep Learning Unit 4: Scan To Open On Studocu Scan To Open On Studocu
No ratings yet
DL U4 - Deep Learning Unit 4 DL U4 - Deep Learning Unit 4: Scan To Open On Studocu Scan To Open On Studocu
21 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
06 VectorSpaceModel PDF
No ratings yet
06 VectorSpaceModel PDF
75 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Retrieval Models & Ranking Overview
No ratings yet
Retrieval Models & Ranking Overview
16 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Text
No ratings yet
Text
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Webir 06
No ratings yet
Webir 06
32 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
07 Opto-Mechanical Hardware
No ratings yet
07 Opto-Mechanical Hardware
16 pages
Problems in Electrostatics, Current Electricity
No ratings yet
Problems in Electrostatics, Current Electricity
14 pages
MEMS Microthruster Performance Analysis
No ratings yet
MEMS Microthruster Performance Analysis
13 pages
Flow Over Humps & Constrictions
No ratings yet
Flow Over Humps & Constrictions
15 pages
Assistant Engineer Exam Key
No ratings yet
Assistant Engineer Exam Key
19 pages
Gate Physics Paper With Answer Key 2022
No ratings yet
Gate Physics Paper With Answer Key 2022
13 pages
Electrician Objective Type Question Answer in English - Series-04
No ratings yet
Electrician Objective Type Question Answer in English - Series-04
4 pages
Activity 2.1.4 Calculating Force Vectors Answer Key: 5 Sin 30 Right 2.5
No ratings yet
Activity 2.1.4 Calculating Force Vectors Answer Key: 5 Sin 30 Right 2.5
4 pages
Conservation of Momentum Lab Report-2
50% (2)
Conservation of Momentum Lab Report-2
10 pages
2020 August, Preliminary Design of Post-Tensioned Transfer Girders
No ratings yet
2020 August, Preliminary Design of Post-Tensioned Transfer Girders
7 pages
Asset Pricing with Continuous Time
No ratings yet
Asset Pricing with Continuous Time
23 pages
(Student Mathematical Library) Gary L. Mullen, Carl Mummert - Finite Fields and Applications (2007, American Mathematical Society)
100% (2)
(Student Mathematical Library) Gary L. Mullen, Carl Mummert - Finite Fields and Applications (2007, American Mathematical Society)
190 pages
Measurement Systems Overview
No ratings yet
Measurement Systems Overview
15 pages
Condition Monitoring
No ratings yet
Condition Monitoring
20 pages
SWH Function Paper
No ratings yet
SWH Function Paper
19 pages
Structural Report for Ajaokuta Building
No ratings yet
Structural Report for Ajaokuta Building
20 pages
Tutorial 6sol2 PDF
No ratings yet
Tutorial 6sol2 PDF
4 pages
Case Studies in Thermal Engineering: Omar Rafae Alomar, Omar Mohammed Ali
No ratings yet
Case Studies in Thermal Engineering: Omar Rafae Alomar, Omar Mohammed Ali
12 pages
Super Dense Coding
No ratings yet
Super Dense Coding
5 pages
Corel Draw Tools Details
No ratings yet
Corel Draw Tools Details
5 pages
Grade 9 Science Chemistry 1 DLP
No ratings yet
Grade 9 Science Chemistry 1 DLP
13 pages
27 PPT 27 Final Calculus of Complex Variables Part II Output
No ratings yet
27 PPT 27 Final Calculus of Complex Variables Part II Output
49 pages
Unit 1.4
No ratings yet
Unit 1.4
5 pages
Aashto T312-08
100% (1)
Aashto T312-08
7 pages
5.theodolites, Total Station
No ratings yet
5.theodolites, Total Station
36 pages
Parallelogram - Google Search
No ratings yet
Parallelogram - Google Search
1 page
2020 - Hui Pan, Feng Ju, Guichun He - Numerical and Experimental Research On A Kaibel DWC Design and Steady-State and Dynamic Operation - ACS
No ratings yet
2020 - Hui Pan, Feng Ju, Guichun He - Numerical and Experimental Research On A Kaibel DWC Design and Steady-State and Dynamic Operation - ACS
11 pages
SSC-JE 2022: Preliminary Examination
No ratings yet
SSC-JE 2022: Preliminary Examination
12 pages
Wither: Directio
No ratings yet
Wither: Directio
11 pages
Science Mathematics Practical Question Bank 2024-25
No ratings yet
Science Mathematics Practical Question Bank 2024-25
7 pages

Vector Space Model

Uploaded by

Vector Space Model

Uploaded by

VECTOR SPACE MODEL

The tables are called vectors and can be stored as arrays

You might also like