NLP Module 1
NLP Module 1
Processing – CSE3015
▪ Preprocessing techniques
▪ Tokenization, stemming, lemmatization, stop word removal, rare word removal, spell correction.
                                                  Dr D Paul Joseph                                       2
    Basics of NLP
                                Dr D Paul Joseph                  3
ELIZA was an early     History of NLP
 natural language
 processing system
capable of carrying
on a limited form of
conversation with a
        user
                       Dr D Paul Joseph   4
   History of NLP
                                                                                                                             2000’s –                  2010’s –
   Mid 1950’s – Mid                                                                         Late 1980’s and
                                Mid 1960’s – Mid              1970’s and early                                              Statistics               Emergence of
   1960’s: Birth of                                                                             1990’s:
                                 1970’s: A Dark                1980’s – Slow                                               powered by                 embedding
       NLP and                                                                                 Statistical
                                      Era                      Revival of NLP                                               Linguistic              model and deep
     Linguistics                                                                              Revolution
                                                                                                                             Insights               neural networks
▪At first, people thought    ▪ After the initial hype, a   ▪Some research activities     • The computing power       • With more                 • Several models:
 NLP is easy!                  dark era follows.            revived, but the emphasis      increased                   sophistication with         Word2Vec, Glove,
 Researchers predicted                                      is still on linguistically     substantially.              the statistical models,     fastText, Elmo, BERT,
                             • People started
 that “machine                                              oriented, working on                                       richer linguistic           COLBERT, GTP[1-3.5]
                               believing that               small toy problems with      • Data-driven statistical
 translation” can be           machine translation is                                      approaches with             representation starts     • New techniques
 solved in 3 years or so                                    weak empirical
                               impossible, and most         evaluation                     simple representation       finding a new value         brought attention to
•Mostly hand- coded            abandoned research                                          win over complex                                        more complex tasks
 rules / linguistic            for NLP                                                     hand-coded linguistic
 ‐oriented approaches.                                                                     rules
•The 3-year project
 continued for 10 years,
 but still no good result,
 despite the significant
 amount of expenditure
                                                                             Dr D Paul Joseph                                                                     5
Challenges in NLP
       Dr D Paul Joseph   6
         Challenges contd..
▪   Ambiguity - sentences and phrases that potentially have two or more possible interpretations.
     ▪ Lexical
         ▪ The ambiguity of a single word is called lexical ambiguity. A word that could be used as a verb, noun, or adjective
         ▪ Ex: bat(Noun or object)? I made it( Made→ created or cooked)
         ▪ Can be solved by Part-of-Speech tagging and Word-sense ambiguity
     ▪   Semantic
          ▪ This kind of ambiguity occurs when the meaning of the words themselves can be misinterpreted
          ▪ “The car hit the pole while it was moving.
          ▪ It -> Car or pole? Ambiguity in entities.
          ▪ Can be solved by Probabilistic parsing
     ▪   Syntactic
          ▪ when a sentence is parsed in different ways
          ▪ “The man saw the girl with the telescope”
     ▪   Anaphoric
          ▪ the use of anaphora entities in discourse
          ▪ “the horse ran up the hill. It was very steep. It soon got tired”
     ▪   Pragmatic
           ▪ knowledge of the relationship of meaning to the goals and intentions of the speaker
           ▪ situation where the context of a phrase gives it multiple interpretations
           ▪ arises when the statement is not specific. Ex:- “I Dr D Paul
                                                                like you Joseph
                                                                          too”                                                   7
Challenges - Ambiguity
▪   Include your children when baking
    cookies
▪   Local High School Dropouts Cut in
    Half
▪   Hospitals are Sued by 7 Foot Doctors
▪   Iraqi Head Seeks Arms
▪   Safety Experts Say School Bus
    Passengers Should Be Belted
▪   Teacher Strikes Idle Kids
                                        Dr D Paul Joseph   8
 Pronoun     Challenges - Ambiguity
Reference
Ambiguity
            Dr D Paul Joseph          9
        Challenges contd..
▪ Errors in text or speech
    ▪ Misspelled or misused words can create problems for text analysis.
▪ Domain-specific language
   ▪ Different businesses and industries often use very different language.
▪ Low-resource languages
   ▪ many languages, especially those spoken by people with less access to technology often go overlooked and
      under processed
                                                               Sentiment Analysis
Question Answering      Spam Detection
Google Home , Alexa
Spelling correction
Chatbot
  Machine Translation
                                          Dr D Paul Joseph                                    11
Machine
Translation
              Dr D Paul Joseph   12
Dialog
Systems
          Dr D Paul Joseph   13
Sentiment or Twitter analysis
                  Dr D Paul Joseph   14
Text Classification
                      Dr D Paul Joseph   15
Question & Answer
                Dr D Paul Joseph   16
Digital Personal Assistant
                  Dr D Paul Joseph   17
Information Extraction – Unstructured
text to database entries
                   Dr D Paul Joseph     18
Language
Comprehension
                Dr D Paul Joseph   19
                     Introduction to NLTK
Toolkit required: NLTK   Programming Language -   Installing - pip install nltk   A variety of tasks can be   Packages
                         Python                                                   performed using NLTK are    nltk.classify
Tokenization nltk.cluster
Stemming Nltk.parse
Lemmatization Nltk.stem
                                                    Dr D Paul Joseph                                                          20
           Text wrangling
Text wrangling is basically the pre-processing work that’s done to prepare raw text data
ready for training.
Simply it is the process of cleaning your data to make it readable by your program, and then
formatting it as such.
It includes:
  •   Tokenization
  •   Stop word removal
  •   Stemming
  •   Lemmatization
  •   Rare word removal
  •   Spell correction
                                         Dr D Paul Joseph                                      21
          ▪     Breaking the raw text into small chunks, called as tokens.
          ▪     These tokens help in understanding the context or developing the model for the
                NLP
          ▪     2 Types:
                ▪ Sentence –level
                    ▪ from nltk.tokenize import sent_tokenize
                ▪ Word – level
                    ▪ from nltk.tokenize import word_tokenize
                                                                       Dr D Paul Joseph                                       22
   ▪   The words which are generally filtered out before processing a natural language are called stop words.
   ▪   These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add
       much information to the text.
   ▪   By removing commonly used words that do not contribute much to the context, search systems are able to process data more quickly
       and accurately. stop words help to eliminate low information words from the text, allowing NLP algorithms to focus on the words that
       are more significant and provide context
   ▪   Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
▪ text = "Nick likes to play football, however he is not too fond of tennis."
▪ text_tokens = word_tokenize(text)
▪ print(tokens_without_sw)
▪ Stemming essentially strips affixes from words, leaving only the base form.
▪ Issues:
▪ Over Stemming (two semantically distinct words are reduced to the same root, and so conflated) Ex: Wander - >Wand
    ▪     Under Stemming (when two words semantically related are not reduced to the same root) Ex:- Knavish -> Knavish and Knave -
          >Knave (Dishonest)
▪ Types:
▪ Lovins Stemmer
                                                                    Dr D Paul Joseph                                                        25
                          Implementation
 ▪ from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
 ▪ porter = PorterStemmer()
 ▪ lancaster = LancasterStemmer()
 ▪ snowball = SnowballStemmer(language='english')
 ▪ regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
 ▪ word_list = ["friend", "friendship", "friends", "friendships"]
 ▪ print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp
   Stemmer'))
 ▪ for word in word_list:
          print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),reg
   exp.stem(word)))
Problems in stemming
                                  Dr D Paul Joseph                       27
          ▪    Lemmatization takes a word and breaks it down to its lemma.
          ▪    For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional
               endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its
               lemma, "walk.“
          ▪    word "saw" might be interpreted differently, depending on the sentence.
          ▪    For example, "saw" can be broken down into the lemma "see" or "saw."
          ▪    In these cases, lemmatization attempts to select the right lemma depending on the context of
               the word, surrounding words and sentence.
          ▪    Other words, such as "better" might be broken down to a lemma such as "good."
                                 ▪ search engine algorithms use lemmatization, the user can query any inflectional form of
                                   a word and get relevant results.
•Artificial intelligence (AI).
                                 ▪ For example, if the user queries the plural form of a word such as "routers," the search
•Big data analytics.               engine knows to also return relevant content that uses the singular form of the same
                                   word -- "router."
•Chatbots.
                                 ▪ Stemming operates without any contextual knowledge, meaning that it can't
•Machine learning (ML).
                                   discern between similar words with different meanings.
•NLP.
                                 ▪ Less complex than lemmatization
•Search queries.
                                 ▪ Faster than lemmatization
•Sentiment analysis.                                      Dr D Paul Joseph                                                    28
                                                             ▪   from nltk import FreqDist
                                                             ▪   tokens=['hi','i','am','am','whatever','this','is','just','a','test','test','j
                                                                 ava','python','java']
▪ Some times we need to remove the words that are
                                                             ▪   freq_dist = FreqDist(tokens)
   very unique in nature like names, brands, product
                                                             ▪   sorted_tokens=dict(sorted(freq_dist.items(), key=lambda
   names, and some of the noise characters, such as              x:x[1]))
   html leftouts.                                            ▪   final_tokens=[]
▪ Few methods are available in nltk library to correct the spelling of the incorrect words.
▪ Cosine Similarity
▪ Euclidean Distance
▪ Hamming Distance
import nltk
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()
# list of incorrect spellings
# that need to be corrected
incorrect_words=['happpy', 'azmaing', 'intelliengt']
# loop for finding correct spellings based on jaccard distance and printing the correct word
for word in incorrect_words:
       temp = [(jaccard_distance(set(ngrams(word, 2)),set(ngrams(w, 2))),w) for w in correct_words if
w[0]==word[0]]
          print(sorted(temp, key = lambda val:val[0])[0][1])
                                                       Dr D Paul Joseph                                 31
                     Jaccard Distance
▪ the opposite of the Jaccard coefficient, is used to measure the dissimilarity between two
  sample sets.
▪ We get Jaccard distance by subtracting the Jaccard coefficient from 1.
▪ We can also get it by dividing the difference between the sizes of the union and the intersection
  of two sets by the size of the union.
▪    We work with Q-grams (these are equivalent to N-grams) which are referred to as characters
    instead of tokens.
▪ Jaccard Distance is given by the following formula
                                                                Dr D Paul Joseph                      32
                                     Example
▪ Doc_1= “educative is the best platform out there”                                 ▪ Application:
                                                                                    ▪ Netflix could represent customers
▪ Doc_2=“educative is a new platform”                                                 as multisets of movies watched.
                                                                                    ▪ It uses Jaccard distance to measure
▪ ***Tokenizing the sentences***                                                      the similarity between two
▪ words_doc_1 = {'educative', 'is', 'the', 'best', 'platform', 'out', 'there'}        customers, i.e. how close their
                                                                                      tastes are.
▪ words_doc_2 = {'educative', 'is', 'a', 'new', 'platform’}                         ▪ Then, based on the preferences of
                                                                                      two users and their similarity, we
▪ The intersection or the common words between the documents are:                     could        potentially      make
    ▪   {'educative', 'is', 'platform’}.                                              recommendations to one or the
                                                                                      other.
    ▪ 3 words are familiar.
▪ The union or all the words in the documents are:
    ▪ {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new’}.
    ▪ Totally, there are 9 words.
                                                          Dr D Paul Joseph                                           33
▪ Hence, the Jaccard similarity is 3/9 = 0.333
         2. Edit distance method
▪ Edit Distance measures dissimilarity between two strings by finding the minimum number of
  operations needed to transform one string into the other.
▪ The transformations that can be performed are:
                                                   Dr D Paul Joseph                                           35
▪   To understand and generate text, NLP-powered systems must be able to recognize words,
    grammar, and a whole lot of language nuances. For computers, this is easier said than done
    because they can only comprehend numbers.
▪   To bridge the gap, NLP experts developed a technique called word embeddings that convert
    words into their numerical representations. Once converted, NLP algorithms can easily
    digest these learned representations to process textual information.
▪   Word embeddings map the words as real-valued numerical vectors. It does so by tokenizing
    each word in a sequence (or sentence) and converting them into a vector space. Word
    embeddings aim to capture the semantic meaning of words in a sequence of text. It assigns
    similar numerical representations to words that have similar meanings.
                                             Dr D Paul Joseph                                    36
Why ?
  Capturing semantic meaning: Word                               Dimensionality reduction: In contrast to
  embeddings allow us to quantify and                            traditional bag-of-words models, where
  categorize semantic similarities                               each unique word in the corpus is
  between linguistic items. They provide a                       assigned a unique dimension, word
  rich representation of words where the                         embeddings map words into a lower-
  semantics are embedded in the                                  dimensional space where the
  dimensions of the vector space, making                         dimensions represent semantic features.
  it possible for algorithms to understand                       This makes word embeddings more
  the relationships between words.                               computationally efficient.
                                              Dr D Paul Joseph                                              37
  Types
 One Hot Encoding
TF-IDF
Word2vec
GloVe
     FastText
                     Dr D Paul Joseph   38
          1. One hot encoding
                                        Dr D Paul Joseph   39
▪   Sentence: I am teaching NLP in Python
▪   Since a dictionary is defined as the list of all unique words present in the
    sentence. So, a dictionary may look like –
▪   Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]
▪   Therefore, the vector representation in this format according to the above
    dictionary is
▪   Vector for NLP: [0,0,0,1,0,0]
▪   Vector for Python: [0,0,0,0,0,1]
                                            Dr D Paul Joseph                       40
        Disadvantages
▪ The Size of the vector is equal to the count of unique
  words in the vocabulary.
▪ One-hot encoding does not capture the relationships
  between different words. Therefore, it does not convey
  information about the context
                                  Dr D Paul Joseph         41
                 2. Bag-of-Words
▪   One of the popular word embedding techniques of text where each value
    in   the   vector      would   represent   the   count       of     words   in   a
    document/sentence.
▪   In other words, it extracts features from the text., which we also refer to
    it as vectorization.
▪   2 approaches:
    ▪    Tokenization
    ▪    Vectorization
                                                     Dr D Paul Joseph                    42
    Working of BOW
▪   Next, the sentences tokenized in the first step have further tokenized
    words.
                                        Dr D Paul Joseph                     43
▪   The idea is to treat each document as a bag, or a collection, of
    words, and then count the frequency of each word in the document.
                              Dr D Paul Joseph                          44
▪   Review 1: This movie is very scary and long
▪   Review 2: This movie is not scary and is slow
▪   Review 3: This movie is spooky and good
▪   Vocabulary consists of 11 words
    ▪   ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
▪   We can now take each of these words and mark their occurrence in the three movie
    reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews
                                                    Dr D Paul Joseph                              45
from sklearn.feature_extraction.text import CountVectorizer
# Sample data
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',]
print(X.toarray())
print(vectorizer.get_feature_names_out())
▪ Terminology
▪ Term frequency(TF)
                            Dr D Paul Joseph   48
3.1 Terminology
     ▪   t — term (word).
     ▪   d — document (set of words).
     ▪   N — count of corpus.
     ▪   corpus — the total document set.
                        Dr D Paul Joseph    49
▪   Term Frequency:
▪ The number of times a term occurs in a document is called its term frequency.
▪ The weight of a term that occurs in a document is simply proportional to the term frequency
▪ Document Frequency:
    ▪   The only difference is that TF is a frequency counter for a term t in document d, whereas DF is the count
        of occurrences of term t in the document set N.
                                                  Dr D Paul Joseph                                           50
▪   Inverse Document Frequency:
    ▪   While computing TF, all terms are considered equally important.
▪ However, certain terms, such as “is,” “of,” and “that,” may appear a lot of times but have little importance.
▪ We need to weigh down the frequent terms while scaling up the rare ones.
    ▪   When we compute IDF, an inverse document frequency factor is incorporated, which diminishes the weight of terms
        that occur very frequently in the document set and increases the weight of terms that rarely occur.
    ▪   IDF is the inverse of the document frequency, which measures the informativeness of term t. When we calculate IDF, it
        will be very low for the most occurring words, such as stop words like “is.” That’s because those words are present in
        almost all of the documents, and N/df will give a very low value to words like that.
▪ idf(t) = N/df
▪ If you have a large corpus, say 100,000,000, the IDF value explodes.
    ▪   To avoid this, we take the log of IDF. During the query time, when a word that’s not in the vocabulary occurs, the DF
        will be 0. Since we can’t divide by 0, we smoothen the value by adding 1 to the denominator.
                                                           Dr D Paul Joseph                                                51
                TF-IDF Implementation
▪   TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus.
▪   Imagine the term t appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of t can be
    calculated as follow:
▪   Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents
    contain the term t, Inverse Document Frequency (IDF) of t can be calculated as follows
▪ Using these two quantities, we can calculate TF-IDF score of the term t for the document.
                                                        Dr D Paul Joseph                                              52
▪ import pandas as pd                                                         ▪ for w in words:
▪ import numpy as np                                                              ▪ df_tf[w][i] = df_tf[w][i] + (1 / len(words))
▪ corpus = ['data science is one of the most important fields of              ▪ df_tf
  science', 'this is one of the best data science courses', 'data
  scientists analyze data’ ]                                                  ▪ #computing IDF
           Dr D Paul Joseph               56
         C-BOW(Continuous Bag-of-Words (CBOW)
▪   CBOW is a technique where, given the neighboring words, the
    center word is determined.
▪   If our input sentence is “I am reading the book.”, then the
    input pairs and labels for a window size of 3 would be:
▪   We start with the one-hot encodings of I and reading (shape 1x5), multiplying those encodings with an
    encoding matrix of shape 5x3. The result is a 1x3 hidden layer.
▪   This hidden layer is now multiplied by a 3x5 decoding matrix to give us our prediction of a 1x5 shape. This
    is compared to the actual label (am) one-hot encoding of the same shape to complete the architecture.
                                                  Dr D Paul Joseph                                                58
           Skip-Gram Model
▪     Given the center word, we have to predict its
      neighboring words. Quite literally the opposite of
      CBOW, but more efficient.
▪     Let our given input sentence be “I am reading the
      book.” The corresponding Skip-Gram pairs for a
      window size of 3 would be:
                                             Dr D Paul Joseph   59
▪   Vocabulary size= 5, and we will assume there are 3 embedding dimensions for simplicity.
▪   Starting with the encoding matrix, we grab the vector located at the index of our center word (am in this case). Transposing it,
    we now have a 3x1 vector representation of the word am(since we are directly grabbing a row of the encoding matrix,
    this WILL NOT be a one-hot encoding).
▪   Multiply this vector representation with the decoding matrix of shape 5x3, giving us the predicted output of shape 5x1. Now,
    this vector will essentially be a SoftMax representation over the whole vocabulary, pointing to the indices belonging to the
    neighboring words of our input center word. In this case, the output should point to the indices of I and reading
.
                                                           Dr D Paul Joseph                                                     60
                                               Training Word2Vec
1.    Initialization of vectors
     1.     Initially high dimension upto 1000-D
     2.     Random initialization breaks symmetry and ensure that model learns something useful as it starts training.
     3.     During training, based on objective function, vectors of similar contextual words are positioned nearer.
2.    Optimization techniques and Backpropagation
     1.    To capture linguistic context of words
     2.    To iteratively adjust the word vectors so that the model’s predictions align more closely with the actual context words.
     3.    Backpropagation is a method used in neural networks to calculate the gradient of the loss function with respect to the weights of the
           network. In the context of Word2Vec, backpropagation adjusts the word vectors based on the errors in predicting context words. Through
           successive iterations, the model becomes increasingly accurate in its predictions, leading to optimized word vectors.
3.    Window size
     1.   Words within the window are considered as context words, while those outside are ignored.
     2.   A smaller window size results in learning more about the word’s syntactic roles, while a larger window size helps the model understand
          the broader semantic context.
4.    Negative Sampling and Subsampling of frequent words
     1.    Negative sampling addresses the issue of computational efficiency by updating only a small percentage of the model’s weights at each
           step rather than all of them. This is done by sampling a small number of “negative” words (words not in the context) to update for each
           target word.
     2.    Subsampling of frequent words helps in improving the quality of word vectors. The basic idea is to reduce the impact of high-frequency
           words in the training process as they often carry less meaningful information compared to rare words.
     3.    By randomly discarding some instances of frequent words, the model is forced to focus more on the rare words, leading to more balanced
                                                                   Dr D Paul Joseph                                                              61
           and meaningful word vectors.
        Things to remember
                                 Dr D Paul Joseph                        62
4. GloVe5. Glove
Embeddings
       Embeddings
Global Vectors
▪   It is an unsupervised learning algorithm developed by researchers at Stanford
    University aiming to generate word embeddings by aggregating global word co-
    occurrence matrices from a given corpus.
▪ word2vec and GloVe provide distinct vector representations for the words in the vocabulary.
▪ FastText provides embeddings for character n-grams, representing words as the average of these
   embeddings .
▪ Word2Vec model provides embedding to the words, whereas fastText provides embeddings to the
   character n-grams. Like the word2vec model, fastText uses CBOW and Skip-gram to compute the
   vectors.
▪ FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings
   that are not present at the time of training.
▪ Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not
   present in the model’s vocabulary.
                                                   Dr D Paul Joseph                              66
▪ In FastText, each word is represented as the average of the vector representation of
  its character n-grams along with the word itself.
▪ Consider the word “equal” and n = 3, then the word will be represented by character
  n-grams:
▪ < eq, equ, qua, ual, al > and < equal >
▪ the word embedding for the word ‘equal’ can be given as the sum of all vector
  representations of all of its character n-gram and the word itself.
                                         Dr D Paul Joseph                           67
                                    FastText - CBOW
                                  Dr D Paul Joseph   69
                       Word2vec vs Fasttext
▪   Word2Vec works on the word level, while fastText works on the character n-grams.
▪   FastText uses the hierarchical classifier to train the model; hence it is faster than
    word2vec.
                                         Dr D Paul Joseph                                   70
           Thank you
Dr D Paul Joseph,
Asst Prof, Sr Gr-I,
Department of Network and Security,
School of Computer Science and Engineering,
VIT-Amaravathi
pauljoseph.d@vitap.ac.in