Q Consider the following documents.
D1: I went to the park to play                D2: park is nearby to play
D3: going to the park is fun                  Q: park nearby play
Apply the vector model to rank the above documents.
Ans:
Step 1: Represent Documents and Query by the index terms
   •   D1: I, went, to, the, park, to, play
   •   D2: park, is, nearby, to, play
   •   D3: going, to, the, park, is, fun
   •   Q: park, nearby, play
Step 2: Create the Vocabulary
Combine unique terms from all documents and the query.
Vocabulary:
{ I, went, to, the, park, play, is, nearby, going, fun }
Step 3: Construct Term Frequency (TF) Matrix
Count the frequency of each term from the vocabulary in each document and query.
                        Term         D1 (fij) D2 (fij) D3 (fij) Q (fij)
                            i                 1   0        0      0
                          went                1   0        0      0
                            to                2   1        1      0
                           the                1   0        1      0
                          park                1   1        1      1
                          play                1   1        0      1
                            is                0   1        1      0
                         nearby               0   1        0      1
                          going               0   0        1      0
                           fun                0   0        1      0
Step 4: Compute Inverse Document Frequency (IDF)
IDF is calculated as:
   •   N: Total number of documents (N=3).
   •   n: Number of documents containing the term.
     Term         ni (Documents containing term)                    IDF = log(3/ni)
       i                           1                          log(3/1) = log(3) ≈ 0.477
     went                          1                          log(3/1) = log(3) ≈ 0.477
       to                          3                          log(3/3) = log(1) = 0.000
      the                          2                         log(3/2) = log(1.5) ≈ 0.176
      park                         3                          log(3/3) = log(1) = 0.000
      play                         2                         log(3/2) = log(1.5) ≈ 0.176
       is                          2                         log(3/2) = log(1.5) ≈ 0.176
     nearby                        1                          log(3/1) = log(3) ≈ 0.477
     going                         1                          log(3/1) = log(3) ≈ 0.477
      fun                          1                          log(3/1) = log(3) ≈ 0.477
Step 5: Calculate TF-IDF Weight (wij):
Term D1 (wij)                      D2 (wij)              D3 (wij)              Q (wij)
             (1 + log 1) * 0.477 =
i            1 * 0.477 = 0.477     0                     0                     0
             (1 + log 1) * 0.477 =
went         1 * 0.477 = 0.477     0                     0                     0
             (1 + log 2) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 =
to           1.301 * 0 = 0         1*0=0                 1*0=0                 0
             (1 + log 1) * 0.176 =                       (1 + log 1) * 0.176 =
the          1 * 0.176 = 0.176     0                     1 * 0.176 = 0.176     0
             (1 + log 1) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 = (1 + log 1) * 0.000 =
park         1*0=0                 1*0=0                 1*0=0                 1*0=0
             (1 + log 1) * 0.176 = (1 + log 1) * 0.176 =                      (1 + log 1) * 0.176 =
play         1 * 0.176 = 0.176     1 * 0.176 = 0.176     0                    1 * 0.176 = 0.176
                                   (1 + log 1) * 0.176 = (1 + log 1) * 0.176 =
is           0                     1 * 0.176 = 0.176     1 * 0.176 = 0.176     0
                                   (1 + log 1) * 0.477 =                      (1 + log 1) * 0.477 =
nearby 0                           1 * 0.477 = 0.477     0                    1 * 0.477 = 0.477
                                                         (1 + log 1) * 0.477 =
going 0                            0                     1 * 0.477 = 0.477     0
                                                         (1 + log 1) * 0.477 =
fun          0                     0                     1 * 0.477 = 0.477     0
Step 6: Represent Documents and Query as Vectors
Using the TF-IDF weights, we form vectors for each document and the query, in the order of
our vocabulary: {i, went, to, the, park, play, is, nearby, going, fun}.
    • D1 Vector: [0.477, 0.477, 0, 0.176, 0, 0.176, 0, 0, 0, 0]
    • D2 Vector: [0, 0, 0, 0, 0, 0.176, 0.176, 0.477, 0, 0]
    • D3 Vector: [0, 0, 0, 0.176, 0, 0, 0.176, 0, 0.477, 0.477]
    • Q Vector: [0, 0, 0, 0, 0, 0.176, 0, 0.477, 0, 0]
Step 7: Calculate Cosine Similarity:
Cosine similarity is given by:
Calculate Dot Products (Q. D):
   • Q. D1 = (0 * 0.477) + (0 * 0.477) + (0 * 0) + (0 * 0.176) + (0 * 0) + (0.176 * 0.176) +
      (0 * 0) + (0.477 * 0) + (0 * 0) + (0 * 0) = 0.030976
   • Q. D2 = (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0) + (0.176 * 0.176) + (0 * 0.176) +
      (0.477 * 0.477) + (0 * 0) + (0 * 0) = 0.030976 + 0.227529 = 0.258505
   • Q. D3 = (0 * 0) + (0 * 0) + (0 * 0) + (0 * 0.176) + (0 * 0) + (0.176 * 0) + (0 * 0.176) +
      (0.477 * 0) + (0 * 0.477) + (0 * 0.477) = 0
Calculate Magnitudes:
   • ||D1|| = sqrt(0.477^2 + 0.477^2 + 0.176^2 + 0.176^2) = sqrt(0.227529 + 0.227529 +
      0.030976 + 0.030976) = sqrt(0.51701) ≈ 0.7190
   • ||D2|| = sqrt(0.176^2 + 0.176^2 + 0.477^2) = sqrt(0.030976 + 0.030976 + 0.227529) =
      sqrt(0.289481) ≈ 0.5380
   • ||D3|| = sqrt(0.176^2 + 0.176^2 + 0.477^2 + 0.477^2) = sqrt(0.030976 + 0.030976 +
      0.227529 + 0.227529) = sqrt(0.51701) ≈ 0.7190
   • ||Q|| = sqrt(0.176^2 + 0.477^2) = sqrt(0.030976 + 0.227529) = sqrt(0.258505) ≈ 0.5084
Calculate Cosine Similarities:
   • Sim(Q, D1) = 0.030976 / (0.5084 * 0.7190) = 0.030976 / 0.3655316 ≈ 0.0847
   • Sim(Q, D2) = 0.258505 / (0.5084 * 0.5380) = 0.258505 / 0.2735272 ≈ 0.9451
   • Sim(Q, D3) = 0 / (0.5084 * 0.7190) = 0
Step 8: Rank Documents
Based on the cosine similarity scores (higher score means more similar):
   1. D2 (Cosine Similarity ≈ 0.9451)
   2. D1 (Cosine Similarity ≈ 0.0847)
   3. D3 (Cosine Similarity = 0)