0% found this document useful (0 votes)

27 views19 pages

NLP Part

Here are the key steps in calculating TF-IDF for terms in documents: 1. Calculate TF (Term Frequency) of each term in a document. TF is the number of occurrences of the term in the document divided by the total number of terms in the document. 2. Calculate DF (Document Frequency) of each term in the corpus. DF is the number of documents containing the term. 3. Calculate IDF (Inverse Document Frequency) of each term using the formula: IDF = log(N/DF) where N is the total number of documents. 4. Calculate TF-IDF by multiplying TF and IDF: TF-IDF = TF * IDF. This gives more weight to terms that occur frequently in a

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views19 pages

NLP Part

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Artificial Intelligence for Natural Language Processing (NLP)

Part II – From Word to Numerical Analysis

Dr. Eng. Wael Ouarda
Assistant Professor, CRNS, Higher Education Ministry, Tunisia

Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie

Wael Ouarda - CRNS 1

1. Machine Learning algorithm for NLP
100 persons
7 emotions

Data Scrapping
85 persons 15 persons
Train&Val Test
Data Cleaning
Pr
Data Representation Model
Word Embedding Embedding
85 * 0,8 85 * 0,2
Train Validation
Data Partitioning

Train Data Validation Data Test Data

X_train, Y_Train X_Val, Y_Val X_Test, Y_Test

Machine Learning Y_Val’= Y_Test’=

(Algorithm,Options) Model.predict(X_Val) Model.predict(X_Test)
Pr
Model Pr

Performance Evaluation Performance Evaluation

Wael Ouarda - CRNS 2

2. Web Scraping Tools

Wael Ouarda - CRNS

2. Web Scraping Tools

• Open source python libraries and

frameworks for web scraping:
• Textual Content:
• Newspaper3k: send an HTTP request to the
website’s server to retrieve the data displayed on the
target web page;
• BeautifulSoup: a python library designed to parse
data, i.e., to extract data from HTML or XML
documents;
• Selenium: Selenium is a web driver designed to
render web pages like your web browser would for
the purpose of automated testing of web applications;
• Scrapy: complete web scraping frameworks
designed explicitly for the job of scraping the web.
• Visual Content:
• MechanicalSoup: a python library designed to parse data,
i.e., to extract url and hypertext from webpages.

Wael Ouarda - CRNS

3. Libraries & Frameworks

• Newspaper3k: Scraping data;

• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
punctuation ,etc);
• re: Regular Expression.

Wael Ouarda - CRNS 5

4. Cleaning process

1. Tokenization: Split document into list of words

2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
will be considered as play
8. Spell check
9. Translation
Wael Ouarda - CRNS 6
4. Cleaning process: Regular Expression (re)

Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
functions
It will process the text shared on web or on social media as String

• \d : Matches any decimal digit; this is equivalent to the class [0-9].

• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @ @ @
Wael Ouarda - CRNS 7
4. Cleaning process: Regular Expression (re)

Pattern Description

^ Matches beginning of line (^ab means that it starts with ab)

$ Matches end of line. ($a means that it ends with a)

. Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)

[...] Matches any single character in brackets.

[^...] Matches any single character not in brackets

Wael Ouarda - CRNS 8

Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1. Tokenization: Split document into list of Tokenization
words
2. Lower casing: Transform Upper case to
[Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
lower case
Punctuation Removal
3. Stop words removal: Stop words is a list
of words=[‘When”, “I”, “How”, … ] (It
can be modified by removing some [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
words by adding other ones
4. Special Character removal: @#’” etc Special Character Removal
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: [Hi,How,are,you,I,am,very,content,to,see,you,today]
player players plaied plays -> play
7. Lemmatization: have and had will be Lower case
considered have plays and played will
be considered as play [hy,how,are,you,i,am,very,content,to,see,you,today]
8. Spell check Translation & Spell check
9. Translation [hi,how,are,you,i,am,very,happy,to,see,you,today]
[very,happy,see,today] Stop words removal Stop words removal
Wael Ouarda - CRNS
[very,happiness,see,today] 9
5. Sample of NLP Libraries for sentiment analysis

Sentiment = is a tuple of (Polarity, Subjectivity)

• Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
• Subjectivity in [0,1]: Weight of Subjectivity of the text.

Data Data
Data Collection Data Cleaning
Representation Classification

Wael Ouarda - CRNS 10

6. Word Embedding Techniques (TF-IDF)

TF-IDF: Term Frequency – Inverse Document Frequency

Terminology
• t — term (word)
• d — document (set of words) user Tweets Label
• N — count of corpus
• Corpus — the total document set Id1 Tweet 11 = [« word 111 », « word 112 »] -> TF +
= [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d Id1 Tweet 12 +
DF(t) = occurrence of t in documents (IDF=N/df)
Id2 Tweet 21 -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))

Wael Ouarda - CRNS 11

6. Word Embedding Techniques (TF-IDF)
TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
Activity TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df) TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1)) TF-IDF(‘bonsoir’)
TF-IDF(‘leaders’) = 1*log(7/3)
TF-IDF(‘souhaite’)
user Tweets Label TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1 [bonjour, ali, bienvenue, leaders] +
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, +
ahmed]
[log(7/2), log (7/3), log(7/3)]
Id2 [bonsoir, ali, ahmed] - [log(7/3), log(7/3), , log(7/3)]

N-gram to include context (N=3)

[bonjour, ali, bienvenue] [ali, bienvenue, leaders]

[bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
[bonsoir, ali, ahmed]

Wael Ouarda - CRNS 12

6. Word Embedding Techniques (Word2Vec)

Features Vector
Term = “machine”

Word Identification in
the Vocabulary Bag of Words Neural Network Training
Yes/No

prediction prediction
WordNet is the dictionary (N) in default Neural Network
0 0
0 0
… W V …
“machine” 1 1
Error: Out of vocabulary
0 0
0 13 0

Wael Ouarda - CRNS 13

6. Word Embedding Techniques (Word2Vec)

Some facts about the Autoencoder:

● Image Representation in Low Dimensional Level
● It is an unsupervised learning algorithm (like PCA) z = f(Wx)
y = g(Vz)
● It minimizes the same objective function as PCA X=Input Vector
● It is a neural network X’: Output Vector
X=X’
● The neural network’s target output is its input
W V
Possible Derivatives of Autoencoder

Stacked Autoencoder Sparse Autoencoder

Wael Ouarda - CRNS 14

6. Word Embedding Techniques (Word2Vec)
user Tweets Label

Activity Id1 [bonjour, ali, bienvenue, leaders] +

N=4 size of vocabulary
[bonjour, ali, bienvenue] Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed] +
W is the size of features vector
Id2 [bonsoir, ali, ahmed] -
0 v11

1
Bonjour N-gram to include context (N=3)
0
0 V1w

0 (V11+V21+V31)/3
V21
ali 0 Input Weight Matrix
(4,W) Final Features
1 Vectors

0 (V1W+V2W+V3W)/
V2w
3
0
v31
bienvenue 0
0
V3w
1 Wael Ouarda - CRNS 15
7. Features Selection, Analysis and Transformation

• Transformation
• Linear Transformation: Principal Component Analysis (PCA)
• Non-Linear Transformation: Auto encoder
• Selection
• Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
Optimization, etc.
• Statistical Methods: Correlation Matrix

Wael Ouarda - CRNS 16

7. Features Selection, Analysis and Transformation
A given dataset of size N features and M samples
Correlation Matrix
Correlation Matrix is based on Pearson moment

M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)

Example: N=3 N=2 (Feature I & III) or (Features II & III)

Feature I Feature II Features III

Feature I M(I,I) = 1 M(II,I) M(III,I) Features I & II are high correlated. So we

Can drop one among it
Feature II M(I,II) M(II,II) = 1 M(III,II)

Feature III M(I,III) M(II,III) M(III,III) = 1

M is in [-1;1]
Feature I Feature II Features III
[-1;-0,5] ]-0,5;0] ]0;0,5] ]0,5;1]
Feature I M(I,I) = 1 0,6 -0,2
M(I,J) I & J are I & J are not I & J are not I & J are high Feature II 0,6 M(II,II) = 1 0,001
inversely High inversely high correlated
high correlated correlated
correlated
Feature III -0,2 0,001 M(III,III) = 1
Wael Ouarda - CRNS 17
7. Features Selection, Analysis and Transformation
Principal Component Analysis

Compute Average For i=1:N

A = 1/N * Sum(Vi) Adjustment of the Dataset
Vector Va = Vi - A

Dataset {Vi}

Adjustment of
the Dataset Adjustment of
the Dataset
Sort for the proper
vector
85%

V1 3/8

V2 8/8
Dataset {Vai} adjusted
V3 2/8 Example: Vector1= a1*v1 + a2*v2 + … an*vn

V4 7/8 Singular Value

Compute the N proper vectors (vi) Decomposition
Each vector from the old dataset Transform the dataset into
can be described as a weighted matrix N*N (N features)
Vn sum of the proper vectors
Wael Ouarda - CRNS 18
8. NLP Applications

• NLP Classification
• Spam & Ham Detector
• Fake News Detecor
• Sentiment Analysis
• NLP Topic Modeling
• Word Cloud Visualisation
• Clustering data/ User -> Communities
• Chatbot
• Natural Lanagage Processing (NLP): to process the natural lanagege input by human
• Natural Lanagage Generation (NLG): to generate response to human

Wael Ouarda - CRNS 19

NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Module 3
No ratings yet
Module 3
40 pages
NLP Book
No ratings yet
NLP Book
599 pages
NLP 1
No ratings yet
NLP 1
8 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Formatted-Document NLP
No ratings yet
Formatted-Document NLP
11 pages
Unit 5
No ratings yet
Unit 5
8 pages
Chapter 1
No ratings yet
Chapter 1
78 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
BAI601 All Modules VTU 10 Mark Complete
No ratings yet
BAI601 All Modules VTU 10 Mark Complete
18 pages
Speech and Language Processing
100% (2)
Speech and Language Processing
623 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Text Mining
No ratings yet
Text Mining
34 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Ed3book (001 282)
No ratings yet
Ed3book (001 282)
282 pages
Transition Networks in Computing
No ratings yet
Transition Networks in Computing
7 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
NLP CH 1
No ratings yet
NLP CH 1
8 pages
Draft: Natural Language Processing For The Working Programmer
No ratings yet
Draft: Natural Language Processing For The Working Programmer
79 pages
A Word Sense Induction Model
No ratings yet
A Word Sense Induction Model
66 pages
Mod 1
No ratings yet
Mod 1
71 pages
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
No ratings yet
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
23 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
NLP Ans
No ratings yet
NLP Ans
91 pages
Current Transducers HTB 50-400-P Specs
No ratings yet
Current Transducers HTB 50-400-P Specs
2 pages
637931762595602517CSE 20CS42P W9 S1 Sy
No ratings yet
637931762595602517CSE 20CS42P W9 S1 Sy
8 pages
Practical Experience Using VLF Tan Delta and Partial
No ratings yet
Practical Experience Using VLF Tan Delta and Partial
5 pages
Procedure and Work Instrunction Manual (PAWIM)
No ratings yet
Procedure and Work Instrunction Manual (PAWIM)
219 pages
Atme SSR Report
No ratings yet
Atme SSR Report
126 pages
16 EasyIO FG - FS Series Email Service v2.2
No ratings yet
16 EasyIO FG - FS Series Email Service v2.2
14 pages
Orthophoto
No ratings yet
Orthophoto
8 pages
Faraz Resume
No ratings yet
Faraz Resume
3 pages
Invigilator Training Field Inv TCS
No ratings yet
Invigilator Training Field Inv TCS
27 pages
Grade 9 Mathematics P1 Memo November Examination 2023
No ratings yet
Grade 9 Mathematics P1 Memo November Examination 2023
8 pages
TB170604 - New 9040-9042-S8 IHM Board - OEM Firmware
No ratings yet
TB170604 - New 9040-9042-S8 IHM Board - OEM Firmware
3 pages
Montek Tech Services PVT LTD
No ratings yet
Montek Tech Services PVT LTD
15 pages
Virtual Keyboard Typing
No ratings yet
Virtual Keyboard Typing
17 pages
Auto GMS: An Automated Greenhouse Monitoring System of Abiotic Factors For Leafy Vegetables Production
No ratings yet
Auto GMS: An Automated Greenhouse Monitoring System of Abiotic Factors For Leafy Vegetables Production
26 pages
Data Platform For Machine Learning
No ratings yet
Data Platform For Machine Learning
14 pages
Footwear Outsoles Needle Tear Test
No ratings yet
Footwear Outsoles Needle Tear Test
10 pages
CPU Instruction Set and Architecture
No ratings yet
CPU Instruction Set and Architecture
5 pages
Ofii 221 Avenue Pierre Brossolette 92120 Montrouge Metro Station: Châtillon-Montrouge (See Map Attached)
No ratings yet
Ofii 221 Avenue Pierre Brossolette 92120 Montrouge Metro Station: Châtillon-Montrouge (See Map Attached)
1 page
Networking Media in Education Connecting Students and Opportunities
No ratings yet
Networking Media in Education Connecting Students and Opportunities
10 pages
Solution Architect - RAN
No ratings yet
Solution Architect - RAN
2 pages
AAIATC250282787 Stage1 4410091239 4197190046
No ratings yet
AAIATC250282787 Stage1 4410091239 4197190046
4 pages
Ijirt156711 Paper
No ratings yet
Ijirt156711 Paper
3 pages
Eidas - Infographic Scenario2 Citizens 54412
No ratings yet
Eidas - Infographic Scenario2 Citizens 54412
1 page
Kingston Memory KHX18C10AT3K2 - 16X
No ratings yet
Kingston Memory KHX18C10AT3K2 - 16X
2 pages
For New Sales Tax Registration, Biometric Verification Is Necessary Individual NTN S of Members and Company Is Necessary For The Process
No ratings yet
For New Sales Tax Registration, Biometric Verification Is Necessary Individual NTN S of Members and Company Is Necessary For The Process
1 page
Submersible Pumps for Home & Farm
No ratings yet
Submersible Pumps for Home & Farm
4 pages
Cycle Count1
No ratings yet
Cycle Count1
2 pages
Takeoff Cg/Trim Pos: Before Start Approach
No ratings yet
Takeoff Cg/Trim Pos: Before Start Approach
2 pages
Model 550 Bravo: Cessna® Illustrated Parts Catalog
No ratings yet
Model 550 Bravo: Cessna® Illustrated Parts Catalog
1 page
Image Restoration
No ratings yet
Image Restoration
92 pages