unit 3 nlp
srilakshmi ch
September 2025
1 Introduction
Word Embeddings: Count Vector
Definition
Count Vectorization is one of the simplest methods for representing text as
numerical vectors. It converts a document into a fixed-length vector by counting
the frequency of each word in the document based on a given vocabulary.
Process
1. Build a vocabulary of all unique words in the corpus.
2. Count the occurrence of each word in every document.
3. Represent each document as a vector of these counts.
Example
Consider the corpus with two documents:
• Doc1: “NLP is fun”
• Doc2: “NLP is easy”
Vocabulary = {NLP, is, fun, easy}
Document NLP is fun easy
Doc1 1 1 1 0
Doc2 1 1 0 1
Thus:
Doc1 → [1, 1, 1, 0], Doc2 → [1, 1, 0, 1]
1
Advantages:
• Simple and intuitive representation.
• No training is required.
• Works well as a baseline model.
Disadvantages:
• High-dimensional vectors (equal to vocabulary size).
• Sparse representation.
• Does not capture semantic similarity (e.g., “fun” and “enjoyable” are un-
related).
• Word order is lost (bag-of-words assumption).
Word Embeddings: Frequency-Based Embedding
Definition
Frequency-based embeddings represent text documents by using the relative
frequency of words instead of raw counts. This method normalizes the occur-
rence of words in a document so that longer documents do not dominate the
representation.
Mathematical Representation
For a document di with vocabulary size n:
di = [fi1 , fi2 , . . . , fin ]
where
cij
fij = Pn
k=1 cik
Here:
• cij = count of word j in document i
• fij = normalized frequency of word j in document i
2
Example
Consider the corpus:
• Doc1: “NLP is fun and NLP is easy”
Vocabulary = {NLP, is, fun, easy, and}
Raw counts for Doc1:
{N LP : 2, is : 2, f un : 1, easy : 1, and : 1}
Total word count = 7
Normalized frequencies:
{N LP : 27 , is : 27 , f un : 17 , easy : 17 , and : 17 }
Thus: 2 2 1 1 1
Doc1 → 7, 7, 7, 7, 7
Properties
Advantages:
• Normalizes word counts to avoid bias toward longer documents.
• Still simple and easy to implement.
Disadvantages:
• High-dimensional and sparse representation.
• Does not capture semantic meaning.
• Loses word order information.
Word Embeddings: Prediction-Based Embedding
Definition
Prediction-based embeddings learn word representations by predicting a word
given its context or predicting the context given a word. Instead of counting
word occurrences, a neural network is trained to generate dense, low-dimensional
vectors that capture semantic and syntactic relationships between words.
Concept
The key idea is:
Similar words appear in similar contexts.
Thus, embeddings are learned such that words with similar meaning have
vectors that are close in the embedding space.
3
Popular Models
1. Word2Vec
• CBOW (Continuous Bag-of-Words): Predicts the current word from
surrounding context words.
P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )
• Skip-Gram: Predicts surrounding words given the current word.
P (wt−m , . . . , wt+m | wt )
2. GloVe (Global Vectors): Learns embeddings by factorizing the co-occurrence
matrix, combining both global word statistics and prediction.
3. FastText: Extension of Word2Vec that considers subword information
(character n-grams).
Example (Word2Vec Skip-Gram)
Sentence: “The cat sits on the mat”
Target word: “cat”
Context window size m = 2
Training samples:
(cat → The), (cat → sits)
The model learns embeddings such that:
vec(cat) ≈ vec(dog), vec(king) − vec(man) + vec(woman) ≈ vec(queen)
Properties
Advantages:
• Produces dense, low-dimensional vectors.
• Captures semantic and syntactic meaning.
• Supports analogy and similarity tasks.
Disadvantages:
• Requires training with large corpora.
• Computationally more expensive than count/frequency methods.
• Static embeddings (same vector for a word regardless of context).
4
Word2Vec
Definition
Word2Vec is a neural network-based model introduced by Mikolov et al. (2013)
that learns dense vector representations of words. It is based on the principle
that “words appearing in similar contexts have similar meanings”.
Architecture
Word2Vec has two main architectures:
1. Continuous Bag-of-Words (CBOW)
Predicts the current word wt from its surrounding context words.
P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )
2. Skip-Gram
Predicts surrounding context words from the current word.
P (wt−m , . . . , wt+m | wt )
Mathematical Model
For a sequence of training words w1 , w2 , . . . , wT , the Skip-Gram objective is to
maximize the average log probability:
T
1X X
log P (wt+j | wt )
T t=1
−m≤j≤m,j̸=0
where m is the context window size.
The conditional probability is defined using the softmax function:
′
exp vw · vwI
P (wO | wI ) = PW O
′
w=1 exp (vw · vwI )
Here:
• vwI = input vector of word wI
• vw
′
O
= output vector of word wO
• W = vocabulary size
Training Optimizations
• Hierarchical Softmax: Reduces computation using a binary tree struc-
ture.
• Negative Sampling: Updates only a small number of negative samples
instead of all words in the vocabulary.
5
Example
Sentence: “The cat sits on the mat”
With window size m = 2, Skip-Gram generates training pairs such as:
(cat → The), (cat → sits), (cat → on)
Properties
Advantages:
• Produces dense, low-dimensional vectors.
• Captures semantic relationships (e.g., “king - man + woman ≈ queen”).
• Efficient to train on large corpora.
Disadvantages:
• Same vector for a word regardless of context (polysemy problem).
• Requires large training data for good performance.
CBOW (Continuous Bag-of-Words) Vectorization
Definition
CBOW is a prediction-based word embedding model where the objective is to
predict the target word given its surrounding context words.
P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )
Here:
• wt = target word
• m = context window size
Vectorization Process
1. Each context word is represented as a one-hot vector of vocabulary size
V.
2. These vectors are projected into a hidden layer (embedding dimension N )
using a shared weight matrix W ∈ RV ×N .
3. The embeddings of all context words are averaged:
m
1 X
h= vwt+j
2m
j=−m,j̸=0
where vwt+j is the embedding vector of the context word.
6
4. This hidden representation h is used to predict the target word wt using
a softmax layer:
exp(uwt · h)
P (wt | context) = PV
w=1 exp(uw · h)
where uw is the output embedding of word w.
Example
Sentence: “The cat sits on the mat”
Target word: “sits”, context window m = 2
Context = {“The”, “cat”, “on”, “the”}
Steps:
• Represent each context word as one-hot vector.
• Project into embedding space (dimension N , e.g., 100).
• Compute average embedding h of these context words.
• Predict target word “sits” using softmax classifier.
Properties
Advantages:
• Efficient and faster to train than Skip-Gram on large datasets.
• Works well for frequent words.
Disadvantages:
• Less effective for rare words compared to Skip-Gram.
• Averaging context embeddings may lose order information.
Skip-Gram Model
Definition
The Skip-Gram model is a prediction-based embedding technique in Word2Vec
where the current (center) word is used to predict its surrounding context words.
P (wt−m , . . . , wt−1 , wt+1 , . . . , wt+m | wt )
Here:
• wt = input (center) word
• m = context window size
7
Vectorization Process
1. Represent the input word wt as a one-hot vector of size V (vocabulary
size).
2. Project it into the embedding space of dimension N using a weight matrix
W ∈ RV ×N .
3. The resulting vector vwt is used to predict each context word individually.
4. The conditional probability for a context word wc given input word wt is:
exp(uwc · vwt )
P (wc | wt ) = PV
w=1 exp(uw · vwt )
where uw is the output vector of word w.
Training Objective
For a sequence of words w1 , w2 , . . . , wT , the Skip-Gram model maximizes:
T
1X X
log P (wt+j | wt )
T t=1
−m≤j≤m,j̸=0
Example
Sentence: “The cat sits on the mat”
Target word: “cat”, context window m = 2
Generated training pairs:
(cat → The), (cat → sits), (cat → on)
Thus, “cat” is the input, and the model tries to predict its context words.
Properties
Advantages: