Artificial Intelligence for Natural Language Processing (NLP)
Part II – From Word to Numerical Analysis
                                        Dr. Eng. Wael Ouarda
                    Assistant Professor, CRNS, Higher Education Ministry, Tunisia
          Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie
                                                Wael Ouarda - CRNS                                               1
1. Machine Learning algorithm for NLP
                                                                                                   100 persons
                                                                                                    7 emotions
                                      Data Scrapping
                                                                                            85 persons                15 persons
                                                                                             Train&Val                   Test
                                      Data Cleaning
                                            Pr
                                    Data Representation          Model
                                     Word Embedding            Embedding
                                                                                     85 * 0,8             85 * 0,2
                                                                                      Train              Validation
                                     Data Partitioning
                 Train Data            Validation Data                    Test Data
               X_train, Y_Train         X_Val, Y_Val                    X_Test, Y_Test
               Machine Learning            Y_Val’=                        Y_Test’=
              (Algorithm,Options)    Model.predict(X_Val)           Model.predict(X_Test)
                      Pr
                    Model                                      Pr
                                    Performance Evaluation          Performance Evaluation
                                              Wael Ouarda - CRNS                                                                   2
2. Web Scraping Tools
                        Wael Ouarda - CRNS
 2. Web Scraping Tools
• Open source python libraries and
  frameworks for web scraping:
  • Textual Content:
     • Newspaper3k: send an HTTP request to the
       website’s server to retrieve the data displayed on the
       target web page;
     • BeautifulSoup: a python library designed to parse
       data, i.e., to extract data from HTML or XML
       documents;
     • Selenium: Selenium is a web driver designed to
       render web pages like your web browser would for
       the purpose of automated testing of web applications;
     • Scrapy: complete web scraping frameworks
       designed explicitly for the job of scraping the web.
  • Visual Content:
     • MechanicalSoup: a python library designed to parse data,
       i.e., to extract url and hypertext from webpages.
                                                      Wael Ouarda - CRNS
3. Libraries & Frameworks
• Newspaper3k: Scraping data;
• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
  punctuation ,etc);
• re: Regular Expression.
                               Wael Ouarda - CRNS                      5
4. Cleaning process
1. Tokenization: Split document into list of words
2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
   ] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
   will be considered as play
8. Spell check
9. Translation
                                 Wael Ouarda - CRNS                        6
4. Cleaning process: Regular Expression (re)
 Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
 functions
 It will process the text shared on web or on social media as String
• \d : Matches any decimal digit; this is equivalent to the class [0-9].
• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
  9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @                 @          @
                                            Wael Ouarda - CRNS                                7
4. Cleaning process: Regular Expression (re)
 Pattern                                                      Description
       ^ Matches beginning of line (^ab means that it starts with ab)
       $ Matches end of line. ($a means that it ends with a)
           . Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)
     [...] Matches any single character in brackets.
    [^...] Matches any single character not in brackets
                                                     Wael Ouarda - CRNS                                                8
                                                        Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1.   Tokenization: Split document into list of                               Tokenization
     words
2.   Lower casing: Transform Upper case to
                                                 [Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
     lower case
                                                                          Punctuation Removal
3.   Stop words removal: Stop words is a list
     of words=[‘When”, “I”, “How”, … ] (It
     can be modified by removing some                   [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
     words by adding other ones
4.   Special Character removal: @#’” etc                               Special Character Removal
5.   Punctuation removal: :,;-,?! etc
6.   Stemming: take the basic of the word:               [Hi,How,are,you,I,am,very,content,to,see,you,today]
     player players plaied plays -> play
7.   Lemmatization: have and had will be                                      Lower case
     considered have plays and played will
     be considered as play                              [hy,how,are,you,i,am,very,content,to,see,you,today]
8.   Spell check                                                         Translation & Spell check
9.   Translation                                        [hi,how,are,you,i,am,very,happy,to,see,you,today]
     [very,happy,see,today]             Stop words removal                 Stop words removal
                                                 Wael Ouarda - CRNS
                                                                      [very,happiness,see,today]            9
5. Sample of NLP Libraries for sentiment analysis
    Sentiment = is a tuple of (Polarity, Subjectivity)
    • Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
    • Subjectivity in [0,1]: Weight of Subjectivity of the text.
                                                           Data                  Data
   Data Collection          Data Cleaning
                                                       Representation        Classification
                                            Wael Ouarda - CRNS                                10
   6. Word Embedding Techniques (TF-IDF)
TF-IDF: Term Frequency – Inverse Document Frequency
Terminology
    •   t — term (word)
    •   d — document (set of words)                         user                      Tweets                            Label
    •   N — count of corpus
    •   Corpus — the total document set                     Id1    Tweet 11 = [« word 111 », « word 112 »] -> TF    +
                                                                   = [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d           Id1    Tweet 12                                         +
DF(t) = occurrence of t in documents (IDF=N/df)
                                                            Id2    Tweet 21                                         -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))
                                             Wael Ouarda - CRNS                                                    11
       6. Word Embedding Techniques (TF-IDF)
                                                                               TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
 Activity                                                                      TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
                                                                               TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d                              TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df)                               TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))                                     TF-IDF(‘bonsoir’)
                                                                               TF-IDF(‘leaders’) = 1*log(7/3)
                                                                               TF-IDF(‘souhaite’)
user                             Tweets                             Label      TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1      [bonjour, ali, bienvenue, leaders]                     +
                                                                                       [bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1      [bonsoir, ahmed, leaders, souhaite, bienvenue,         +
         ahmed]
                                                                                                  [log(7/2), log (7/3), log(7/3)]
Id2      [bonsoir, ali, ahmed]                                  -                                 [log(7/3), log(7/3), , log(7/3)]
                    N-gram to include context (N=3)
 [bonjour, ali, bienvenue] [ali, bienvenue, leaders]
 [bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
 [bonsoir, ali, ahmed]
                                                                Wael Ouarda - CRNS                                                           12
      6. Word Embedding Techniques (Word2Vec)
                                                                                                                             Features Vector
       Term = “machine”
         Word Identification in
          the Vocabulary                           Bag of Words                                  Neural Network Training
               Yes/No
                                                                                    prediction                                    prediction
        WordNet is the dictionary (N) in default                                                        Neural Network
                                                                                         0                                                0
                                                                                         0                                                0
                                                                                         …              W                V                …
                                                                        “machine”        1                                                1
Error: Out of vocabulary
                                                                                         0                                                0
                                                                                         0                                      13        0
                                                   Wael Ouarda - CRNS                                                             13
6. Word Embedding Techniques (Word2Vec)
 Some facts about the Autoencoder:
  ● Image Representation in Low Dimensional Level
  ● It is an unsupervised learning algorithm (like PCA)         z = f(Wx)
                                                                y = g(Vz)
  ● It minimizes the same objective function as PCA             X=Input Vector
  ● It is a neural network                                      X’: Output Vector
                                                                X=X’
  ● The neural network’s target output is its input
                                                                                              W   V
 Possible Derivatives of Autoencoder
                  Stacked Autoencoder                                    Sparse Autoencoder
                                           Wael Ouarda - CRNS                                         14
        6. Word Embedding Techniques (Word2Vec)
                                                                          user       Tweets                                                        Label
Activity                                                                  Id1        [bonjour, ali, bienvenue, leaders]                            +
N=4 size of vocabulary
                                    [bonjour, ali, bienvenue]             Id1        [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed]         +
W is the size of features vector
                                                                          Id2        [bonsoir, ali, ahmed]                                         -
                 0                                              v11
                 1
Bonjour                                                                                               N-gram to include context (N=3)
                 0
                 0                                              V1w
                 0                                                                                                  (V11+V21+V31)/3
                                                                V21
  ali            0                 Input Weight Matrix
                                          (4,W)                                                                                         Final Features
                 1                                                                                                                      Vectors
                 0                                                                                                  (V1W+V2W+V3W)/
                                                                V2w
                                                                                                                    3
                 0
                                                                v31
bienvenue        0
                 0
                                                                V3w
                 1                                                    Wael Ouarda - CRNS                                                                 15
7. Features Selection, Analysis and Transformation
 • Transformation
    • Linear Transformation: Principal Component Analysis (PCA)
    • Non-Linear Transformation: Auto encoder
 • Selection
    • Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
      Optimization, etc.
    • Statistical Methods: Correlation Matrix
                                       Wael Ouarda - CRNS                             16
     7. Features Selection, Analysis and Transformation
 A given dataset of size N features and M samples
                                                                                                                                          Correlation Matrix
 Correlation Matrix is based on Pearson moment
                 M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)
 Example: N=3                                                                                                  N=2 (Feature I & III) or (Features II & III)
                       Feature I             Feature II                Features III
  Feature I            M(I,I) = 1            M(II,I)                   M(III,I)                                                Features I & II are high correlated. So we
                                                                                                                               Can drop one among it
  Feature II           M(I,II)               M(II,II) = 1              M(III,II)
  Feature III          M(I,III)              M(II,III)                 M(III,III) = 1
M is in [-1;1]
                                                                                                                        Feature I         Feature II          Features III
           [-1;-0,5]              ]-0,5;0]             ]0;0,5]             ]0,5;1]
                                                                                                       Feature I        M(I,I) = 1        0,6                 -0,2
 M(I,J)    I & J are              I & J are not        I & J are not       I & J are high              Feature II       0,6               M(II,II) = 1        0,001
           inversely              High inversely       high                correlated
           high                   correlated           correlated
           correlated
                                                                                                       Feature III      -0,2              0,001               M(III,III) = 1
                                                                                        Wael Ouarda - CRNS                                                            17
7. Features Selection, Analysis and Transformation
                                                                                                 Principal Component Analysis
                                   Compute Average                                                                        For i=1:N
                                                               A = 1/N * Sum(Vi)       Adjustment of the Dataset
                                       Vector                                                                               Va = Vi - A
Dataset {Vi}
                                                                                         Adjustment of
                                                                                          the Dataset            Adjustment of
                                                                                                                  the Dataset
     Sort for the proper
            vector
             85%
     V1          3/8
     V2          8/8
                                                                                               Dataset {Vai} adjusted
     V3          2/8        Example: Vector1= a1*v1 + a2*v2 + … an*vn
     V4          7/8                                                   Singular Value
                           Compute the N proper vectors (vi)          Decomposition
                           Each vector from the old dataset      Transform the dataset into
                           can be described as a weighted          matrix N*N (N features)
     Vn                    sum of the proper vectors
                                                                  Wael Ouarda - CRNS                                                      18
8. NLP Applications
• NLP Classification
   • Spam & Ham Detector
   • Fake News Detecor
   • Sentiment Analysis
• NLP Topic Modeling
   • Word Cloud Visualisation
   • Clustering data/ User -> Communities
• Chatbot
   • Natural Lanagage Processing (NLP): to process the natural lanagege input by human
   • Natural Lanagage Generation (NLG): to generate response to human
                                              Wael Ouarda - CRNS                         19