0% found this document useful (0 votes)

8 views8 pages

Unit 3 NLP

This pdf covers basics of NLP important topics

Uploaded by

mksb0401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Unit 3 NLP

This pdf covers basics of NLP important topics

Uploaded by

mksb0401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

unit 3 nlp

srilakshmi ch
September 2025

1 Introduction
Word Embeddings: Count Vector
Definition
Count Vectorization is one of the simplest methods for representing text as
numerical vectors. It converts a document into a fixed-length vector by counting
the frequency of each word in the document based on a given vocabulary.

Process
1. Build a vocabulary of all unique words in the corpus.
2. Count the occurrence of each word in every document.
3. Represent each document as a vector of these counts.

Example
Consider the corpus with two documents:
• Doc1: “NLP is fun”

• Doc2: “NLP is easy”

Vocabulary = {NLP, is, fun, easy}

Document NLP is fun easy

Doc1 1 1 1 0
Doc2 1 1 0 1
Thus:
Doc1 → [1, 1, 1, 0], Doc2 → [1, 1, 0, 1]

1
Advantages:
• Simple and intuitive representation.

• No training is required.
• Works well as a baseline model.
Disadvantages:
• High-dimensional vectors (equal to vocabulary size).

• Sparse representation.
• Does not capture semantic similarity (e.g., “fun” and “enjoyable” are un-
related).
• Word order is lost (bag-of-words assumption).

Word Embeddings: Frequency-Based Embedding

Definition
Frequency-based embeddings represent text documents by using the relative
frequency of words instead of raw counts. This method normalizes the occur-
rence of words in a document so that longer documents do not dominate the
representation.

Mathematical Representation
For a document di with vocabulary size n:

di = [fi1 , fi2 , . . . , fin ]

where
cij
fij = Pn
k=1 cik
Here:

• cij = count of word j in document i

• fij = normalized frequency of word j in document i

2
Example
Consider the corpus:
• Doc1: “NLP is fun and NLP is easy”
Vocabulary = {NLP, is, fun, easy, and}

Raw counts for Doc1:

{N LP : 2, is : 2, f un : 1, easy : 1, and : 1}
Total word count = 7
Normalized frequencies:
{N LP : 27 , is : 27 , f un : 17 , easy : 17 , and : 17 }
Thus: 2 2 1 1 1

Doc1 → 7, 7, 7, 7, 7

Properties
Advantages:
• Normalizes word counts to avoid bias toward longer documents.
• Still simple and easy to implement.
Disadvantages:
• High-dimensional and sparse representation.
• Does not capture semantic meaning.
• Loses word order information.

Word Embeddings: Prediction-Based Embedding

Definition
Prediction-based embeddings learn word representations by predicting a word
given its context or predicting the context given a word. Instead of counting
word occurrences, a neural network is trained to generate dense, low-dimensional
vectors that capture semantic and syntactic relationships between words.

Concept
The key idea is:
Similar words appear in similar contexts.
Thus, embeddings are learned such that words with similar meaning have
vectors that are close in the embedding space.

3
Popular Models
1. Word2Vec
• CBOW (Continuous Bag-of-Words): Predicts the current word from
surrounding context words.

P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )

• Skip-Gram: Predicts surrounding words given the current word.

P (wt−m , . . . , wt+m | wt )

2. GloVe (Global Vectors): Learns embeddings by factorizing the co-occurrence

matrix, combining both global word statistics and prediction.
3. FastText: Extension of Word2Vec that considers subword information
(character n-grams).

Example (Word2Vec Skip-Gram)

Sentence: “The cat sits on the mat”
Target word: “cat”
Context window size m = 2
Training samples:
(cat → The), (cat → sits)
The model learns embeddings such that:

vec(cat) ≈ vec(dog), vec(king) − vec(man) + vec(woman) ≈ vec(queen)

Properties
Advantages:
• Produces dense, low-dimensional vectors.

• Captures semantic and syntactic meaning.

• Supports analogy and similarity tasks.
Disadvantages:

• Requires training with large corpora.

• Computationally more expensive than count/frequency methods.
• Static embeddings (same vector for a word regardless of context).

4
Word2Vec
Definition
Word2Vec is a neural network-based model introduced by Mikolov et al. (2013)
that learns dense vector representations of words. It is based on the principle
that “words appearing in similar contexts have similar meanings”.

Architecture
Word2Vec has two main architectures:
1. Continuous Bag-of-Words (CBOW)
Predicts the current word wt from its surrounding context words.
P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )

2. Skip-Gram
Predicts surrounding context words from the current word.
P (wt−m , . . . , wt+m | wt )

Mathematical Model
For a sequence of training words w1 , w2 , . . . , wT , the Skip-Gram objective is to
maximize the average log probability:
T
1X X
log P (wt+j | wt )
T t=1
−m≤j≤m,j̸=0

where m is the context window size.

The conditional probability is defined using the softmax function:
′

exp vw · vwI
P (wO | wI ) = PW O
′
w=1 exp (vw · vwI )

Here:
• vwI = input vector of word wI
• vw
′
O
= output vector of word wO
• W = vocabulary size

Training Optimizations
• Hierarchical Softmax: Reduces computation using a binary tree struc-
ture.
• Negative Sampling: Updates only a small number of negative samples
instead of all words in the vocabulary.

5
Example
Sentence: “The cat sits on the mat”
With window size m = 2, Skip-Gram generates training pairs such as:
(cat → The), (cat → sits), (cat → on)

Properties
Advantages:
• Produces dense, low-dimensional vectors.
• Captures semantic relationships (e.g., “king - man + woman ≈ queen”).
• Efficient to train on large corpora.
Disadvantages:
• Same vector for a word regardless of context (polysemy problem).
• Requires large training data for good performance.

CBOW (Continuous Bag-of-Words) Vectorization

Definition
CBOW is a prediction-based word embedding model where the objective is to
predict the target word given its surrounding context words.

P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )

Here:
• wt = target word
• m = context window size

Vectorization Process
1. Each context word is represented as a one-hot vector of vocabulary size
V.
2. These vectors are projected into a hidden layer (embedding dimension N )
using a shared weight matrix W ∈ RV ×N .
3. The embeddings of all context words are averaged:
m
1 X
h= vwt+j
2m
j=−m,j̸=0

where vwt+j is the embedding vector of the context word.

6
4. This hidden representation h is used to predict the target word wt using
a softmax layer:
exp(uwt · h)
P (wt | context) = PV
w=1 exp(uw · h)

where uw is the output embedding of word w.

Example
Sentence: “The cat sits on the mat”
Target word: “sits”, context window m = 2
Context = {“The”, “cat”, “on”, “the”}
Steps:
• Represent each context word as one-hot vector.
• Project into embedding space (dimension N , e.g., 100).
• Compute average embedding h of these context words.
• Predict target word “sits” using softmax classifier.

Properties
Advantages:
• Efficient and faster to train than Skip-Gram on large datasets.
• Works well for frequent words.
Disadvantages:
• Less effective for rare words compared to Skip-Gram.
• Averaging context embeddings may lose order information.

Skip-Gram Model
Definition
The Skip-Gram model is a prediction-based embedding technique in Word2Vec
where the current (center) word is used to predict its surrounding context words.

P (wt−m , . . . , wt−1 , wt+1 , . . . , wt+m | wt )

Here:
• wt = input (center) word
• m = context window size

7
Vectorization Process
1. Represent the input word wt as a one-hot vector of size V (vocabulary
size).
2. Project it into the embedding space of dimension N using a weight matrix
W ∈ RV ×N .
3. The resulting vector vwt is used to predict each context word individually.
4. The conditional probability for a context word wc given input word wt is:

exp(uwc · vwt )
P (wc | wt ) = PV
w=1 exp(uw · vwt )

where uw is the output vector of word w.

Training Objective
For a sequence of words w1 , w2 , . . . , wT , the Skip-Gram model maximizes:
T
1X X
log P (wt+j | wt )
T t=1
−m≤j≤m,j̸=0

Example
Sentence: “The cat sits on the mat”
Target word: “cat”, context window m = 2
Generated training pairs:

(cat → The), (cat → sits), (cat → on)

Thus, “cat” is the input, and the model tries to predict its context words.

Properties
Advantages:

Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Word2Vec for NLP Enthusiasts
No ratings yet
Word2Vec for NLP Enthusiasts
13 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Module 3 - NLP
No ratings yet
Module 3 - NLP
34 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
No ratings yet
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
9 pages
Wordembed
No ratings yet
Wordembed
31 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Vector Semantics and Embedding (Part 2)
No ratings yet
Vector Semantics and Embedding (Part 2)
47 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
NLP Word Embeddings Explained
No ratings yet
NLP Word Embeddings Explained
55 pages
Chapter II
No ratings yet
Chapter II
26 pages
Word Embeddings & Word2Vec Guide
No ratings yet
Word Embeddings & Word2Vec Guide
9 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
NLP Word Embeddings Guide
No ratings yet
NLP Word Embeddings Guide
5 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Common Word Embedding - Continuous Bag-Of-Words - Word2Vec
No ratings yet
Common Word Embedding - Continuous Bag-Of-Words - Word2Vec
12 pages
NLP2
No ratings yet
NLP2
11 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
Continuous Bag of Words
No ratings yet
Continuous Bag of Words
3 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
Unit IV
No ratings yet
Unit IV
57 pages
Let's Learn NLP in 5 Minutes (Part 7)
No ratings yet
Let's Learn NLP in 5 Minutes (Part 7)
8 pages
1725888984module 4 Deep Learning For Natural Language Processing (NLP)
No ratings yet
1725888984module 4 Deep Learning For Natural Language Processing (NLP)
15 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Word 2 Vec
No ratings yet
Word 2 Vec
33 pages
NLP - L9 Word Embedding
No ratings yet
NLP - L9 Word Embedding
5 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
NLP Using Deep Learning Handson
No ratings yet
NLP Using Deep Learning Handson
7 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Advanced Text Representation Models
No ratings yet
Advanced Text Representation Models
9 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Hung-Yi Lee Word2vec (v3)
No ratings yet
Hung-Yi Lee Word2vec (v3)
23 pages
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
NLP DL Lecture2
No ratings yet
NLP DL Lecture2
54 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
NLP Session 2
No ratings yet
NLP Session 2
9 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
Word Embeddings Notes Cleaned
No ratings yet
Word Embeddings Notes Cleaned
4 pages
Corrections and Minor Revisions of Mathematical Methods in The Physical Sciences, Third Edition, by Mary L. Boas (Deceased)
No ratings yet
Corrections and Minor Revisions of Mathematical Methods in The Physical Sciences, Third Edition, by Mary L. Boas (Deceased)
6 pages
House-Hunting Student Guide
No ratings yet
House-Hunting Student Guide
8 pages
8th Grade Summer Homework Packet
100% (1)
8th Grade Summer Homework Packet
6 pages
Toward The Construction of Theology - Response To Richard McKeon
No ratings yet
Toward The Construction of Theology - Response To Richard McKeon
13 pages
Puritanism & Early American Literature
No ratings yet
Puritanism & Early American Literature
4 pages
Ap Csa Q4 2024
No ratings yet
Ap Csa Q4 2024
6 pages
Altair: Holographic Zoos & Consciousness
No ratings yet
Altair: Holographic Zoos & Consciousness
2 pages
L-3 - Introduction of RDBMS PDF
No ratings yet
L-3 - Introduction of RDBMS PDF
7 pages
Phil BJ
No ratings yet
Phil BJ
2 pages
Capture D'écran . 2025-01-18 À 07.42.24
No ratings yet
Capture D'écran . 2025-01-18 À 07.42.24
1 page
تركات ثالثه إعدادي نسخه مجابه
No ratings yet
تركات ثالثه إعدادي نسخه مجابه
19 pages
Bdi-9611 User
No ratings yet
Bdi-9611 User
32 pages
Arrays in C: Types, Syntax, and Examples
No ratings yet
Arrays in C: Types, Syntax, and Examples
36 pages
Legal English - FULL Complete
No ratings yet
Legal English - FULL Complete
286 pages
Four Square Writing
No ratings yet
Four Square Writing
11 pages
Something's Missing: A Discussion Between Ernst Bloch and Theodor W. Adorno On The Contradictions of Utopian Longing
No ratings yet
Something's Missing: A Discussion Between Ernst Bloch and Theodor W. Adorno On The Contradictions of Utopian Longing
63 pages
Work of Sir Syed Ahmed Khan:: Beliefs
No ratings yet
Work of Sir Syed Ahmed Khan:: Beliefs
2 pages
2.1 (Tauhid) - 1
No ratings yet
2.1 (Tauhid) - 1
10 pages
Pka Estimations Tutorial Web
No ratings yet
Pka Estimations Tutorial Web
6 pages
3rd Grade Grammar Syllabus
No ratings yet
3rd Grade Grammar Syllabus
3 pages
Magazine Article Rubric Guide
No ratings yet
Magazine Article Rubric Guide
2 pages
SOLUTIONS OF Ytha Yu Charles Marut-Assem PDF
0% (1)
SOLUTIONS OF Ytha Yu Charles Marut-Assem PDF
129 pages
St. Paul's Clothing Closet
No ratings yet
St. Paul's Clothing Closet
10 pages
1 The Dawn of A New Architecture 1 The Core Transformer Architecture: An Overview 2
No ratings yet
1 The Dawn of A New Architecture 1 The Core Transformer Architecture: An Overview 2
189 pages
Real Vs Unreal Conditional
No ratings yet
Real Vs Unreal Conditional
3 pages
Student Presentation Evaluation
No ratings yet
Student Presentation Evaluation
7 pages
MD Shariful Islam - 5
No ratings yet
MD Shariful Islam - 5
4 pages
EME1201 - Lecture #2 - Vectors and Forces
No ratings yet
EME1201 - Lecture #2 - Vectors and Forces
56 pages
Jonathan North Washington
No ratings yet
Jonathan North Washington
230 pages
Master Key-File-1
No ratings yet
Master Key-File-1
13 pages

Unit 3 NLP

Uploaded by

Unit 3 NLP

Uploaded by

unit 3 nlp

• Doc2: “NLP is easy”

Document NLP is fun easy

Word Embeddings: Frequency-Based Embedding

di = [fi1 , fi2 , . . . , fin ]

• cij = count of word j in document i

Raw counts for Doc1:

Word Embeddings: Prediction-Based Embedding

P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )

• Skip-Gram: Predicts surrounding words given the current word.

2. GloVe (Global Vectors): Learns embeddings by factorizing the co-occurrence

Example (Word2Vec Skip-Gram)

vec(cat) ≈ vec(dog), vec(king) − vec(man) + vec(woman) ≈ vec(queen)

• Captures semantic and syntactic meaning.

• Requires training with large corpora.

where m is the context window size.

CBOW (Continuous Bag-of-Words) Vectorization

P (wt | wt−m , . . . , wt−1 , wt+1 , . . . , wt+m )

where vwt+j is the embedding vector of the context word.

where uw is the output embedding of word w.

P (wt−m , . . . , wt−1 , wt+1 , . . . , wt+m | wt )

where uw is the output vector of word w.

(cat → The), (cat → sits), (cat → on)

You might also like