Unit2 A
Unit2 A
WHAT IS
TOKENIZATION?
Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words
or “tokens.” These tokens are the basic building blocks of language, and tokenization helps computers understand
and process human language by splitting it into manageable units.
For example, tokenizing the sentence “I love ice cream” would result in three tokens: “I,” “love,” and “ice cream.”
It’s a fundamental step in natural language processing and text analysis tasks.
   •   Word Tokenization: Common for languages with clear separation (e.g., English).
   •   Character Tokenization: Useful for languages without clear separation or for very detailed analysis.
   •   Subwords Tokenization: Smaller than words, but bigger than characters (useful for complex languages or
       unknown words).
Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that
form doesn’t have any meaning in machine learning.
Stemming and lemmatization are text preprocessing techniques in natural language processing (NLP). Specifically,
they reduce the inflected forms of words across a text data set to one common root word or dictionary form, also
known as a “lemma” in computational linguistics.1
Stemming and lemmatization are particularly helpful in information retrieval systems like search engines where
users may submit a query with one word (for example, meditate) but expect results that use any inflected form of
the word (for example, meditates, meditation, etc.). Stemming and lemmatization further aim to improve text
processing in machine learning algorithms.
Lemmatization takes more time as compared to stemming because it finds meaningful word/ representation.
Stemming just needs to get a base word and therefore takes less time.Stemming has its application in Sentiment
Analysis while Lemmatization has its application in Chatbots, human-answering.
STEMMING VS LEMMATIZATION
 Stemming                                         Lemmatization
 Stemming is a process that stems or              Lemmatization considers the context and
 removes last few characters from a word,         converts the word to its meaningful base form,
 often leading to incorrect meanings and          which is called Lemma.
 spelling.
 For instance, stemming the word                  For instance, lemmatizing the word
 ‘Caring‘ would return ‘Car‘.                     ‘Caring‘ would return ‘Care‘.
 Stemming is used in case of large dataset        Lemmatization is computationally expensive
 where performance is an issue.                   since it involves look-up tables and what not.
INTRODUCTION TO MORPHOLOGICAL
ANALYSIS
Morphology is the branch of linguistics concerned with the structure and form of words in a language.
Morphological analysis, in the context of NLP, refers to the computational processing of word structures. It aims to
break down words into their constituent parts, such as roots, prefixes, and suffixes, and understand their roles and
meanings. This process is essential for various NLP tasks, including language modeling, text analysis, and machine
translation.
   •   Understanding Word Formation: It helps in identifying the basic building blocks of words, which is crucial
       for language comprehension.
   •   Improving Text Analysis: By breaking down words into their roots and affixes, it enhances the accuracy of
       text analysis tasks like sentiment analysis and topic modeling.
   •   Enhancing Language Models: Morphological analysis provides detailed insights into word formation,
       improving the performance of language models used in tasks like speech recognition and text generation.
   •   Facilitating Multilingual Processing: It aids in handling the morphological diversity of different languages,
       making NLP systems more robust and versatile.
Morphological analysis involves breaking down words into their constituent morphemes (the smallest units of
meaning) and understanding their structure and formation. Various techniques can be employed to perform
morphological analysis, each with its own strengths and applications.
1. Stemming
Stemming reduces words to their base or root form, usually by removing suffixes. The resulting stems are not
necessarily valid words but are useful for text normalization.
   •   Porter Stemmer: One of the most popular stemming algorithms, known for its simplicity and efficiency.
   •   Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple languages.
   •   Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in shorter stems.
2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). It considers the context and part of speech,
producing valid words. To implement lemmatization in python, WordNet Lemmatizer is used, which leverages the
WordNet lexical database to find the base form of words.
3. Morphological Parsing
Morphological parsing involves analyzing the structure of words to identify their morphemes (roots, prefixes,
suffixes). It requires knowledge of morphological rules and patterns. Finite-State Transducers (FSTs) is uses as a tool
for morphological parsing.
Finite-State Transducers (FSTs)
FSTs are computational models used to represent and analyze the morphological structure of words. They consist
of states and transitions, capturing the rules of word formation.
Applications:
   •   Recurrent Neural Networks (RNNs): Useful for sequential data like text.
   •   Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
   •   Transformers: Advanced models like BERT and GPT that understand context and semantics.
5. Rule-Based Methods
Rule-based methods rely on manually defined linguistic rules for morphological analysis. These rules can handle
specific language patterns and exceptions.
Applications:
   •   Affix Stripping: Removing known prefixes and suffixes to find the root form.
   •   Inflectional Analysis: Identifying grammatical variations like tense, number, and case.
Applications:
    •   Morphological Segmentation: Breaking words into morphemes.
    •   Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.
    •   Sequence Prediction: Predicting the most likely sequence of morphemes for a given word.
INFLECTIONAL MORPHOLOGY
Inflectional morphology creates new forms of the same word, whereby the new forms agree with the tense, case,
voice, aspect, person, number, gender and the mood of an utterance.
The inflection of verbs is called as conjugation whereas the inflection of nouns, adjectives, prepositions, adverbs
and articles is called as declension. The inflected form of a word often contains both one or more free morphemes
(a unit of meaning which can stand by itself as a word), and one or more bound morphemes (a unit of meaning
which cannot exist independently as a word). Words which never undergo the process of inflection are said to be
invariant; for example, the English verb must is an invariant item: it never takes a suffix or changes form to signify a
different grammatical category. Its categories can be determined only from its context.
In sentence 1,2,3 and 4 the verb to perform is changing its form in accordance with the tense, aspect and mood.
In sentence 1, suffix -ed is added to the root perform to express past tense.
In sentence 2, suffix -s is added to the root perform because it is grammatically conditioned by the third person
singular subject of the subject the violinist.
In sentence 4, -ing form is attached to the root of the verb to express an action which is going on.
On the other hand, in sentence 5, the word performance is not a verb; it is a noun. In first four sentences, the
grammatical category i.e. verb remains constant whereas in the last sentence the grammatical category changes
from verb to noun.
Sentence 1 to 4 are an instance of inflection while sentence 5 is an instance of derivation. We will now discuss
inflectional morphology in detail. We will come back to derivational morphology later.
DERIVATIONAL MORPHOLOGY
Derivational morphemes are affixes which are added to a lexeme to change its meaning or function. It basically
involves two significant processes. One major difference which distinguishes Inflectional morphology from
derivational morphology is that, the latter does not only change the form of a word, it also changes the category
and the meaning of the word. One of them is affixation and the other one is compounding and the other
affixation.Compounding occurs when two or more words are joined together to for m a meaningful longer word.
For example:scare+crow = scarecrow post+office= postoffice maid+ servant= maidservant bitter+sweet=
bittersweetSemantically speaking compound words can be of four types. They are exocentric, endocentric,
copulative and appositional.
One common application of Finite-State Transducers (FSTs) in Natural Language Processing (NLP) is
morphological analysis, which involves analyzing the structure and meaning of words at the morpheme level.
Here, is the explanation of the application of FSTs in morphological analysis with an example of stemming using a
finite-state transducer for English.
Let’s consider an English stemming example where we want to reduce words to their stems. We’ll build a simple
FST for English stemming. Our FST will have states representing the process of removing common English
suffixes. Step-by-Step Explanation:
Start by defining the states of the FST, representing different stages of stemming.
Example transitions:
State 0: Initial state
Transition: If the input ends with “ing”, remove “ing” and transition to state 1.
Transition: If the input ends with “ly”, remove “ly” and transition to state 2.
Based on the defined states and transitions, construct the FST using a tool like OpenFST or write code to
implement the FST.
The FST traverses through the states according to the input word and transitions until it reaches a final state,
outputting the stemmed word.
1. Input: “running”
FST transitions: State 0 (input: “running”) →→ State 1 (remove “ing”) →→ State 2 (output: “run”)
2. Input: “quickly”
FST transitions: State 0 (input: “quickly”) →→State 1 (no “ing”) →→ State 2 (remove “ly”) →→ State 3 (output:
“quick”)
REGULAR EXPRESSIONS
A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings
or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX
as well as in MS WORD in identical way. We have various search engines using a number of RE features.
    •   ε is a Regular Expression, which indicates that the language is having an empty string.
    •   φ is a Regular Expression which denotes that it is an empty language.
    •   If X and Y are Regular Expressions, then
   •     X, Y
   •     X.Y(Concatenation of XY)
   •     X+Y (Union of X and Y)
   •     X*, Y* (Kleen Closure of X and Y)
An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA).
TYPES OF FINITE STATE AUTOMATION (FSA): Finite state automation is of two types.
Whereas graphically, a DFA can be represented by diagraphs called state diagrams where −
Example of DFA
Suppose a DFA be
   •   Q = {a, b, c},
   •   Σ = {0, 1},
   •   q0 = {a},
   •   F = {c},
Current State, Next State for Input 0, Next State for Input 1
   •   A, a, B
   •   B, b, A
   •   C, c, C
Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state diagrams where −
Example of NDFA
Suppose a NDFA be
   •   Q = {a, b, c},
   •   Σ = {0, 1},
   •   q0 = {a},
   •   F = {c},
Current State, Next State for Input 0, Next State for Input 1
   •   A, a, b, B
   •   B, C, a, c
   •   C, b, c, C
In linguistics (study of language and its structure), a stem is part of a word, that is common to
all of its inflected variants.
   •   CONNECT
   •   CONNECTED
   •   CONNECTION
   •   CONNECTING
Above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To this stem we
can add different suffixes to form different words.
The process of reducing such inflected (or sometimes derived) words to their word stem is known
as Stemming. For example, CONNECTED, CONNECTION and CONNECTING can be reduced to
the stem CONNECT.
The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes from an
English word and obtain its stem which becomes very useful in the field of Information
Retrieval (IR). This process reduces the number of terms kept by an IR system which will be
advantageous both in terms of space and time complexity. First, a few terms and expressions will
be introduced, which will be helpful for the ease of explanation.
A list of one or more consecutive consonants (ccc…) will be denoted by C, and a list of one or
more consecutive vowels (vvv…) will be denoted by V. Any word, or part of a word, therefore has
one of the four forms given below.
                                           [C]VCVC … [V]
Here the square brackets denote arbitrary presence of consonants or vowels.
(VC)m denotes VC repeated m times. So the above expression can be written as,
                                           [C](VC)m[V]
What is m?
The value m found in the above expression is called the measure of any word or word part when
represented in the form [C](VC)m[V]. Here are some examples for different values of m:
Rules
The rules for replacing (or removing) a suffix will be given in the form as shown below.
                                      (condition) S1 → S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given
condition, S1 is replaced by S2. The condition is usually given in terms of m in regard to the stem
before S1.
(m > 1) EMENT →
Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a
word part for which m = 2.
Conditions
The conditions may contain the following:
   •   *S   –   the stem ends with S (and similarly for the other letters)
   •   *v* –    the stem contains a vowel
   •   *d   –   the stem ends with a double consonant (e.g. -TT, -SS)
   •   *o   –   the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)
And the condition part may also contain expressions with and, or and not.
(m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T.
(*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant and does not end
with letters L, S or Z.
   1. SSES        →       SS
   2. IES         →       I
   3. SS          →       SS
   4. S           →
(Here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for
S1. Equally CARESS maps to CARESS (since S1=”SS”) and CARES to CARE (since S1=”S”).
The Algorithm
Step 1a
   5. SSES        →       SS
   6. IES         →       I
   7. SS          →       SS
   8. S           →
Step 1b
   9. (m>0) EED           →      EE
   10. (*v*) ED           →x
   11. (*v*) ING          →
If the second or third of the rules in Step 1b is successful, the following is performed.
 12. AT        →        ATE
 13. BL        →        BLE
 14. IZ        →        IZE
 15. (*d and not (*L or *S or *Z))        →    single letter
 16. (m=1 and *o)        →            E
Step 1c
 17. (*v*) Y        →         I
Step 2
 18. (m>0) ATIONAL                →           ATE
 19. (m>0) TIONAL                 →           TION
 20. (m>0) ENCI                   →           ENCE
 21. (m>0) ANCI                   →           ANCE
 22. (m>0) IZER                   →           IZE
 23. (m>0) ABLI                   →           ABLE
 24. (m>0) ALLI                   →           AL
 25. (m>0) ENTLI                  →           ENT
 26. (m>0) ELI                    →           E
 27. (m>0) OUSLI                  →           OUS
 28. (m>0) IZATION                →            IZE
 29. (m>0) ATION                  →           ATE
 30. (m>0) ATOR                   →           ATE
 31. (m>0) ALISM                  →           AL
 32. (m>0) IVENESS                →           IVE
 33. (m>0) FULNESS                →           FUL
 34. (m>0) OUSNESS                →           OUS
 35. (m>0) ALITI                  →           AL
 36. (m>0) IVITI                  →           IVE
 37. (m>0) BILITI                 →           BLE
Step 3
 38. (m>0) ICATE                  →           IC
 39. (m>0) ATIVE                  →
 40. (m>0) ALIZE                  →           AL
 41. (m>0) ICITI                  →           IC
   42. (m>0) ICAL                 →               IC
   43. (m>0) FUL                  →
   44. (m>0) NESS                 →
Step 4
   45. (m>1) AL                   →
   46. (m>1) ANCE                 →
   47. (m>1) ENCE                 →
   48. (m>1) ER                   →
   49. (m>1) IC                   →
   50. (m>1) ABLE                 →
   51. (m>1) IBLE                 →
   52. (m>1) ANT                  →
   53. (m>1) EMENT                →
   54. (m>1) MENT                 →
   55. (m>1) ENT                  →
   56. (m>1 and (*S or *T)) ION        →
   57. (m>1) OU                 →
   58. (m>1) ISM                  →
   59. (m>1) ATE                  →
   60. (m>1) ITI                  →
   61. (m>1) OUS                  →
   62. (m>1) IVE                  →
   63. (m>1) IZE                  →
Step 5a
   64. (m>1) E                    →
   65. (m=1 and not *o) E         →
Step 5b
   66. (m > 1 and *d and *L)      →         single letter
For each word you input to the algorithm, all the steps from 1 to 5 will be executed and the output
will be produced at the end.
Example Inputs
Let’s consider a few example inputs and check what will be their stem outputs. 🙂
Example 1
In the first example, we input the word MULTIDIMENSIONAL to the Porter Stemming algorithm.
Let’s see what happens as the word goes through steps 1 to 5.
   •   The suffix will not match any of the cases found in steps 1, 2 and 3.
   •   Then it comes to step 4.
   •   The stem of the word has m > 1 (since m = 5) and ends with “AL”.
   •   Hence in step 4, “AL” is deleted (replaced with null).
   •   Calling step 5 will not change the stem further.
   •   Finally the output will be MULTIDIMENSION.
             MULTIDIMENSIONAL → MULTIDIMENSION
Example 2
In the second example, we input the word CHARACTERIZATION to the Porter Stemming
algorithm. Let’s see what happens as the word goes through steps 1 to 5.
   •   The suffix will not match any of the cases found in step 1.
   •   So it will move to step 2.
   •   The stem of the word has m > 0 (since m = 3) and ends with “IZATION”.
   •   Hence in step 2, “IZATION” will be replaced with “IZE”.
   •   Then the new stem will be CHARACTERIZE.
   •   Step 3 will not match any of the suffixes and hence will move to step 4.
   •   Now m > 1 (since m = 3) and the stem ends with “IZE”.
   •   So in step 4, “IZE” will be deleted (replaced with null).
   •   No change will happen to the stem in other steps.
   •   Finally the output will be CHARACTER.
– (condition) S1--> S2
– Within each set, if more than one of the rules can apply, only the one with the longest matching suffix (S1) is
followed.
3. (*v*) Y. --> I
ator --> ate operator --> operate fulness --> ful gratefulness --> grateful
WHAT IS TF-IDF?
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language
processing and information retrieval. It measures how important a term is within a document relative to a
collection of documents (i.e., relative to a corpus).
Words within a text document are transformed into importance numbers by a text vectorization process. There are
many different text vectorization scoring schemes, with TF-IDF being one of the most common.
Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the
total number of words in the document.
                TF =      ----------------------------------------------------------------
                                 (total number of terms in the document)
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the
term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance
values than words common across all documents (e.g., a, the, and).
    IDF=          -------------------------------------------------------------------
                        number of documents in the corpus contain the term)
TF-IDF=𝑇𝐹∗𝐼𝐷𝐹
N-GRAM
N-grams, a fundamental concept in NLP, play a pivotal role in capturing patterns and relationships within a
sequence of words. In this blog post, we’ll delve into the world of N-grams, exploring their significance,
applications, and how they contribute to enhancing language processing tasks.
Understanding N-grams:
Definition:
N-grams are contiguous sequences of ’n’ items, typically words in the context of NLP. These items can be
characters, words, or even syllables, depending on the granularity desired. The value of ’n’ determines the order of
the N-gram.
Examples:
The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty
can be no less than the true uncertainty.
• Let’s take an example of the sentence: ‘Natural Language Processing’. For predicting
   the first word, let’s say the word has the following probabilities:
The 0.4
Processing 0.3
Natural 0.12
Language 0.18
• Now, we know the probability of getting the first word as natural. But, what’s the
   probability of getting the next word after getting the word ‘Language‘ after the word
   ‘Natural‘.
     word           P(word | ‘Natural’ )
The 0.05
Processing 0.3
Natural 0.15
Language 0.5
• After getting the probability of generating words ‘Natural Language’, what’s the
   probability of getting ‘Processing‘.
The 0.1
Processing 0.7
Natural 0.1
Language 0.1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=𝑙𝑜𝑔2(2.876)=1.524Entropy=log2 (2.876)=1.524