0% found this document useful (0 votes)
40 views3 pages

Vector Model Sum

The document outlines the process of ranking three documents (D1, D2, D3) based on their relevance to a given query using the vector model. It details steps including term frequency calculation, inverse document frequency computation, TF-IDF weight calculation, and cosine similarity measurement. Ultimately, the documents are ranked with D2 being the most relevant, followed by D1 and D3.

Uploaded by

diya.kharat001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views3 pages

Vector Model Sum

The document outlines the process of ranking three documents (D1, D2, D3) based on their relevance to a given query using the vector model. It details steps including term frequency calculation, inverse document frequency computation, TF-IDF weight calculation, and cosine similarity measurement. Ultimately, the documents are ranked with D2 being the most relevant, followed by D1 and D3.

Uploaded by

diya.kharat001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Q Consider the following documents.

D1: I went to the park to play D2: park is nearby to play


D3: going to the park is fun Q: park nearby play
Apply the vector model to rank the above documents.

Ans:
Step 1: Represent Documents and Query by the index terms
• D1: I, went, to, the, park, to, play
• D2: park, is, nearby, to, play
• D3: going, to, the, park, is, fun
• Q: park, nearby, play

Step 2: Create the Vocabulary


Combine unique terms from all documents and the query.
Vocabulary:
{ I, went, to, the, park, play, is, nearby, going, fun }

Step 3: Construct Term Frequency (TF) Matrix


Count the frequency of each term from the vocabulary in each document and query.
Term D1 (fij) D2 (fij) D3 (fij) Q (fij)
i 1 0 0 0
went 1 0 0 0
to 2 1 1 0
the 1 0 1 0
park 1 1 1 1
play 1 1 0 1
is 0 1 1 0
nearby 0 1 0 1
going 0 0 1 0
fun 0 0 1 0

Step 4: Compute Inverse Document Frequency (IDF)


IDF is calculated as:

• N: Total number of documents (N=3).


• n: Number of documents containing the term.
Term ni (Documents containing term) IDF = log(3/ni)
i 1 log(3/1) = log(3) ≈ 0.477
went 1 log(3/1) = log(3) ≈ 0.477
to 3 log(3/3) = log(1) = 0.000
the 2 log(3/2) = log(1.5) ≈ 0.176
park 3 log(3/3) = log(1) = 0.000
play 2 log(3/2) = log(1.5) ≈ 0.176
is 2 log(3/2) = log(1.5) ≈ 0.176
nearby 1 log(3/1) = log(3) ≈ 0.477
going 1 log(3/1) = log(3) ≈ 0.477
fun 1 log(3/1) = log(3) ≈ 0.477

Step 5: Calculate TF-IDF Weight (wij):

Term D1 (wij) D2 (wij) D3 (wij) Q (wij)


(1 + log 1) * 0.477 =
i 1 * 0.477 = 0.477 0 0 0
(1 + log 1) * 0.477 =
went 1 * 0.477 = 0.477 0 0 0
(1 + log 2) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 =
to 1.301 * 0 = 0 1*0=0 1*0=0 0
(1 + log 1) * 0.176 = (1 + log 1) * 0.176 =
the 1 * 0.176 = 0.176 0 1 * 0.176 = 0.176 0
(1 + log 1) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 =
park 1*0=0 1*0=0 1*0=0 1*0=0
(1 + log 1) * 0.176 = (1 + log 1) * 0.176 = (1 + log 1) * 0.176 =
play 1 * 0.176 = 0.176 1 * 0.176 = 0.176 0 1 * 0.176 = 0.176
(1 + log 1) * 0.176 = (1 + log 1) * 0.176 =
is 0 1 * 0.176 = 0.176 1 * 0.176 = 0.176 0
(1 + log 1) * 0.477 = (1 + log 1) * 0.477 =
nearby 0 1 * 0.477 = 0.477 0 1 * 0.477 = 0.477
(1 + log 1) * 0.477 =
going 0 0 1 * 0.477 = 0.477 0
(1 + log 1) * 0.477 =
fun 0 0 1 * 0.477 = 0.477 0
Step 6: Represent Documents and Query as Vectors
Using the TF-IDF weights, we form vectors for each document and the query, in the order of
our vocabulary: {i, went, to, the, park, play, is, nearby, going, fun}.
• D1 Vector: [0.477, 0.477, 0, 0.176, 0, 0.176, 0, 0, 0, 0]
• D2 Vector: [0, 0, 0, 0, 0, 0.176, 0.176, 0.477, 0, 0]
• D3 Vector: [0, 0, 0, 0.176, 0, 0, 0.176, 0, 0.477, 0.477]
• Q Vector: [0, 0, 0, 0, 0, 0.176, 0, 0.477, 0, 0]

Step 7: Calculate Cosine Similarity:


Cosine similarity is given by:

Calculate Dot Products (Q. D):


• Q. D1 = (0 * 0.477) + (0 * 0.477) + (0 * 0) + (0 * 0.176) + (0 * 0) + (0.176 * 0.176) +
(0 * 0) + (0.477 * 0) + (0 * 0) + (0 * 0) = 0.030976
• Q. D2 = (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0) + (0.176 * 0.176) + (0 * 0.176) +
(0.477 * 0.477) + (0 * 0) + (0 * 0) = 0.030976 + 0.227529 = 0.258505
• Q. D3 = (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0.176) + (0 * 0) + (0.176 * 0) + (0 * 0.176) +
(0.477 * 0) + (0 * 0.477) + (0 * 0.477) = 0

Calculate Magnitudes:
• ||D1|| = sqrt(0.477^2 + 0.477^2 + 0.176^2 + 0.176^2) = sqrt(0.227529 + 0.227529 +
0.030976 + 0.030976) = sqrt(0.51701) ≈ 0.7190
• ||D2|| = sqrt(0.176^2 + 0.176^2 + 0.477^2) = sqrt(0.030976 + 0.030976 + 0.227529) =
sqrt(0.289481) ≈ 0.5380
• ||D3|| = sqrt(0.176^2 + 0.176^2 + 0.477^2 + 0.477^2) = sqrt(0.030976 + 0.030976 +
0.227529 + 0.227529) = sqrt(0.51701) ≈ 0.7190
• ||Q|| = sqrt(0.176^2 + 0.477^2) = sqrt(0.030976 + 0.227529) = sqrt(0.258505) ≈ 0.5084

Calculate Cosine Similarities:


• Sim(Q, D1) = 0.030976 / (0.5084 * 0.7190) = 0.030976 / 0.3655316 ≈ 0.0847
• Sim(Q, D2) = 0.258505 / (0.5084 * 0.5380) = 0.258505 / 0.2735272 ≈ 0.9451
• Sim(Q, D3) = 0 / (0.5084 * 0.7190) = 0

Step 8: Rank Documents


Based on the cosine similarity scores (higher score means more similar):
1. D2 (Cosine Similarity ≈ 0.9451)
2. D1 (Cosine Similarity ≈ 0.0847)
3. D3 (Cosine Similarity = 0)

You might also like