0% found this document useful (0 votes)
33 views22 pages

Unit2 A

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views22 pages

Unit2 A

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT - 2

WHAT IS
TOKENIZATION?
Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words
or “tokens.” These tokens are the basic building blocks of language, and tokenization helps computers understand
and process human language by splitting it into manageable units.

For example, tokenizing the sentence “I love ice cream” would result in three tokens: “I,” “love,” and “ice cream.”
It’s a fundamental step in natural language processing and text analysis tasks.

TYPES OF TOKENIZATION IN NLP

Here is types of tokenization in nlp:

• Word Tokenization: Common for languages with clear separation (e.g., English).
• Character Tokenization: Useful for languages without clear separation or for very detailed analysis.
• Subwords Tokenization: Smaller than words, but bigger than characters (useful for complex languages or
unknown words).

WHAT IS STEMMING IN NLP?


It is the process of reducing infected words to their stem. For instance, in figure 1, stemming with replace words
“history” and “historical” with “histori”. Similarly, for the words finally and final.

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that
form doesn’t have any meaning in machine learning.
Stemming and lemmatization are text preprocessing techniques in natural language processing (NLP). Specifically,
they reduce the inflected forms of words across a text data set to one common root word or dictionary form, also
known as a “lemma” in computational linguistics.1

Stemming and lemmatization are particularly helpful in information retrieval systems like search engines where
users may submit a query with one word (for example, meditate) but expect results that use any inflected form of
the word (for example, meditates, meditation, etc.). Stemming and lemmatization further aim to improve text
processing in machine learning algorithms.

WHAT IS LEMMATIZATION IN NLP?


The purpose of lemmatization is same as that of stemming but overcomes the drawbacks of stemming. In
stemming, for some words, it may not give may not give meaningful representation such as “Histori”. Here,
lemmatization comes into picture as it gives meaningful word.

Lemmatization takes more time as compared to stemming because it finds meaningful word/ representation.
Stemming just needs to get a base word and therefore takes less time.Stemming has its application in Sentiment
Analysis while Lemmatization has its application in Chatbots, human-answering.

STEMMING VS LEMMATIZATION
Stemming Lemmatization
Stemming is a process that stems or Lemmatization considers the context and
removes last few characters from a word, converts the word to its meaningful base form,
often leading to incorrect meanings and which is called Lemma.
spelling.
For instance, stemming the word For instance, lemmatizing the word
‘Caring‘ would return ‘Car‘. ‘Caring‘ would return ‘Care‘.
Stemming is used in case of large dataset Lemmatization is computationally expensive
where performance is an issue. since it involves look-up tables and what not.

INTRODUCTION TO MORPHOLOGICAL
ANALYSIS
Morphology is the branch of linguistics concerned with the structure and form of words in a language.
Morphological analysis, in the context of NLP, refers to the computational processing of word structures. It aims to
break down words into their constituent parts, such as roots, prefixes, and suffixes, and understand their roles and
meanings. This process is essential for various NLP tasks, including language modeling, text analysis, and machine
translation.

Importance of Morphological Analysis

Morphological analysis is a critical step in NLP for several reasons:

• Understanding Word Formation: It helps in identifying the basic building blocks of words, which is crucial
for language comprehension.
• Improving Text Analysis: By breaking down words into their roots and affixes, it enhances the accuracy of
text analysis tasks like sentiment analysis and topic modeling.
• Enhancing Language Models: Morphological analysis provides detailed insights into word formation,
improving the performance of language models used in tasks like speech recognition and text generation.
• Facilitating Multilingual Processing: It aids in handling the morphological diversity of different languages,
making NLP systems more robust and versatile.

Key Techniques used in Morphological Analysis for NLP Tasks

Morphological analysis involves breaking down words into their constituent morphemes (the smallest units of
meaning) and understanding their structure and formation. Various techniques can be employed to perform
morphological analysis, each with its own strengths and applications.

Here are some of the key techniques used in morphological analysis:

1. Stemming

Stemming reduces words to their base or root form, usually by removing suffixes. The resulting stems are not
necessarily valid words but are useful for text normalization.

Common ways to implement stemming in python:

• Porter Stemmer: One of the most popular stemming algorithms, known for its simplicity and efficiency.
• Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple languages.
• Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in shorter stems.

2. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). It considers the context and part of speech,
producing valid words. To implement lemmatization in python, WordNet Lemmatizer is used, which leverages the
WordNet lexical database to find the base form of words.

3. Morphological Parsing
Morphological parsing involves analyzing the structure of words to identify their morphemes (roots, prefixes,
suffixes). It requires knowledge of morphological rules and patterns. Finite-State Transducers (FSTs) is uses as a tool
for morphological parsing.
Finite-State Transducers (FSTs)

FSTs are computational models used to represent and analyze the morphological structure of words. They consist
of states and transitions, capturing the rules of word formation.

Applications:

• Morphological Analysis: Parsing words into their morphemes.


• Morphological Generation: Generating word forms from morphemes.

4. Neural Network Models


Neural network models, especially deep learning models, can be trained to perform morphological analysis by
learning patterns from large datasets.

Types of Neural Network

• Recurrent Neural Networks (RNNs): Useful for sequential data like text.
• Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
• Transformers: Advanced models like BERT and GPT that understand context and semantics.

5. Rule-Based Methods

Rule-based methods rely on manually defined linguistic rules for morphological analysis. These rules can handle
specific language patterns and exceptions.

Applications:

• Affix Stripping: Removing known prefixes and suffixes to find the root form.
• Inflectional Analysis: Identifying grammatical variations like tense, number, and case.

6. Hidden Markov Models (HMMs)


Hidden Markov Models (HMMs) are probabilistic models that can be used to analyze sequences of data, such as
morphemes in words. HMMs consist of a set of hidden states, each representing a possible state of the system, and
observable outputs generated from these states. In the context of morphological analysis, HMMs can be used to
model the probabilistic relationships between sequences of morphemes, helping to predict the most likely sequence
of morphemes for a given word.

Components of Hidden Markov Models (HMMs):

• States: Represent different parts of words (e.g., prefixes, roots, suffixes).


• Observations: The actual characters or morphemes in the words.
• Transition Probabilities: Probabilities of moving from one state to another.
• Emission Probabilities: Probabilities of an observable output being generated from a state.

Applications:
• Morphological Segmentation: Breaking words into morphemes.
• Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.
• Sequence Prediction: Predicting the most likely sequence of morphemes for a given word.

INFLECTIONAL MORPHOLOGY
Inflectional morphology creates new forms of the same word, whereby the new forms agree with the tense, case,
voice, aspect, person, number, gender and the mood of an utterance.

The inflection of verbs is called as conjugation whereas the inflection of nouns, adjectives, prepositions, adverbs
and articles is called as declension. The inflected form of a word often contains both one or more free morphemes
(a unit of meaning which can stand by itself as a word), and one or more bound morphemes (a unit of meaning
which cannot exist independently as a word). Words which never undergo the process of inflection are said to be
invariant; for example, the English verb must is an invariant item: it never takes a suffix or changes form to signify a
different grammatical category. Its categories can be determined only from its context.

1. The violinist performed in the plaza yesterday.

2. The violinist performs in the plaza everyday.

3. The violinist will perform in the plaza next week.

4. The violinist is performing in the plaza right now.

5. The performance of the violinist was magnificent.

In sentence 1,2,3 and 4 the verb to perform is changing its form in accordance with the tense, aspect and mood.

In sentence 1, suffix -ed is added to the root perform to express past tense.

In sentence 2, suffix -s is added to the root perform because it is grammatically conditioned by the third person
singular subject of the subject the violinist.

In sentence 3, modal+ verb is introduced to express an act in future.

In sentence 4, -ing form is attached to the root of the verb to express an action which is going on.

On the other hand, in sentence 5, the word performance is not a verb; it is a noun. In first four sentences, the
grammatical category i.e. verb remains constant whereas in the last sentence the grammatical category changes
from verb to noun.

Sentence 1 to 4 are an instance of inflection while sentence 5 is an instance of derivation. We will now discuss
inflectional morphology in detail. We will come back to derivational morphology later.
DERIVATIONAL MORPHOLOGY
Derivational morphemes are affixes which are added to a lexeme to change its meaning or function. It basically
involves two significant processes. One major difference which distinguishes Inflectional morphology from
derivational morphology is that, the latter does not only change the form of a word, it also changes the category
and the meaning of the word. One of them is affixation and the other one is compounding and the other
affixation.Compounding occurs when two or more words are joined together to for m a meaningful longer word.
For example:scare+crow = scarecrow post+office= postoffice maid+ servant= maidservant bitter+sweet=
bittersweetSemantically speaking compound words can be of four types. They are exocentric, endocentric,
copulative and appositional.

FINITE STATE TRANSDUCERS


Finite State Transducers (FSTs) are like smart helpers that work with words and sentences. Imagine you’re typing
on your phone and make a mistake. The autocorrect feature that suggests the right word? That’s thanks to FSTs.
They also help virtual assistants like Siri or Alexa understand what you’re asking them to do. Another cool thing
they do is translate languages in apps like Google Translate.

Step by Step working of Finite State Transducer in NLP

One common application of Finite-State Transducers (FSTs) in Natural Language Processing (NLP) is
morphological analysis, which involves analyzing the structure and meaning of words at the morpheme level.
Here, is the explanation of the application of FSTs in morphological analysis with an example of stemming using a
finite-state transducer for English.

Stemming with FSTs


Stemming is the process of reducing words to their root or base form, often by removing affixes such as prefixes
and suffixes. FSTs can be used to perform stemming efficiently by defining rules for stripping affixes and producing
the stem of a word.

Example: English Stemming with an FST

Let’s consider an English stemming example where we want to reduce words to their stems. We’ll build a simple
FST for English stemming. Our FST will have states representing the process of removing common English
suffixes. Step-by-Step Explanation:

Step 1. Define the FST’s States and Transitions

Start by defining the states of the FST, representing different stages of stemming.

Define transitions between states based on rules for removing suffixes.

Example transitions:
State 0: Initial state

Transition: If the input ends with “ing”, remove “ing” and transition to state 1.

State 1: “ing” suffix removed

Transition: If the input ends with “ly”, remove “ly” and transition to state 2.

State 2: “ly” suffix removed

Final state: Output the stemmed word

Step 2. Construct the FST

Based on the defined states and transitions, construct the FST using a tool like OpenFST or write code to
implement the FST.

Step 3. Apply the FST to Input Words:

Given an input word, apply the FST to find the stem.

The FST traverses through the states according to the input word and transitions until it reaches a final state,
outputting the stemmed word.

Example Input and Output:

1. Input: “running”

FST transitions: State 0 (input: “running”) →→ State 1 (remove “ing”) →→ State 2 (output: “run”)

2. Input: “quickly”

FST transitions: State 0 (input: “quickly”) →→State 1 (no “ing”) →→ State 2 (remove “ly”) →→ State 3 (output:
“quick”)

REGULAR EXPRESSIONS
A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings
or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX
as well as in MS WORD in identical way. We have various search engines using a number of RE features.

A Regular Expression can be defined as follows −

• ε is a Regular Expression, which indicates that the language is having an empty string.
• φ is a Regular Expression which denotes that it is an empty language.
• If X and Y are Regular Expressions, then
• X, Y
• X.Y(Concatenation of XY)
• X+Y (Union of X and Y)
• X*, Y* (Kleen Closure of X and Y)

Examples of Regular Expressions


The following table shows a few examples of Regular Expressions −

Regular Expressions Regular Set


(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }
(0*10*) {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) {ε, 0, 1, 01}
(a+b)* It would be set of strings of a’s and b’s of any length which also includes
the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}
(a+b)*abb It would be set of strings of a’s and b’s ending with the string abb i.e.
{abb, aabb, babb, aaabb, ababb, …………..}
(11)* It would be set consisting of even number of 1’s which also includes an
empty string i.e. {ε, 11, 1111, 111111, ……….}
(aa)*(bb)*b It would be set of strings consisting of even number of a’s followed by
odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab, aaaabbb,
…………..}
(aa + ab + ba + bb)* It would be string of a’s and b’s of even length that can be obtained by
concatenating any combination of the strings aa, ab, ba and bb including
null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}

FINITE STATE AUTOMATA


The term automata, derived from the Greek word "αὐτόματα" meaning "self-acting", is the plural of automaton
which may be defined as an abstract self-propelled computing device that follows a predetermined sequence of
operations automatically.

An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA).

Mathematically, an automaton can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

• Q is a finite set of states.


• Σ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function
• q0 is the initial state from where any input is processed (q0 ∈ Q).
• F is a set of final state/states of Q (F ⊆ Q).

TYPES OF FINITE STATE AUTOMATION (FSA): Finite state automation is of two types.

DETERMINISTIC FINITE AUTOMATION (DFA)


It may be defined as the type of finite automation wherein, for every input symbol we can determine the state to
which the machine will move. It has a finite number of states that is why the machine is called Deterministic Finite
Automaton (DFA).

Mathematically, a DFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

• Q is a finite set of states.


• Σ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function where δ: Q × Σ → Q .
• q0 is the initial state from where any input is processed (q0 ∈ Q).
• F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically, a DFA can be represented by diagraphs called state diagrams where −

• The states are represented by vertices.


• The transitions are shown by labeled arcs.
• The initial state is represented by an empty incoming arc.
• The final state is represented by double circle.

Example of DFA

Suppose a DFA be

• Q = {a, b, c},
• Σ = {0, 1},
• q0 = {a},
• F = {c},

Transition function δ is shown in the table as follows −

Current State, Next State for Input 0, Next State for Input 1

• A, a, B
• B, b, A
• C, c, C

The graphical representation of this DFA would be as follows −


NON-DETERMINISTIC FINITE AUTOMATION (NDFA)
It may be defined as the type of finite automation where for every input symbol we cannot determine the state to
which the machine will move i.e. the machine can move to any combination of the states. It has a finite number of
states that is why the machine is called Non-deterministic Finite Automation (NDFA).

Mathematically, NDFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

• Q is a finite set of states.


• Σ is a finite set of symbols, called the alphabet of the automaton.
• δ :-is the transition function where δ: Q × Σ → 2 Q.
• q0 :-is the initial state from where any input is processed (q0 ∈ Q).
• F :-is a set of final state/states of Q (F ⊆ Q).

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state diagrams where −

• The states are represented by vertices.


• The transitions are shown by labeled arcs.
• The initial state is represented by an empty incoming arc.The final state is represented by double circle.

Example of NDFA

Suppose a NDFA be

• Q = {a, b, c},
• Σ = {0, 1},
• q0 = {a},
• F = {c},

Transition function δ is shown in the table as follows −

Current State, Next State for Input 0, Next State for Input 1
• A, a, b, B
• B, C, a, c
• C, b, c, C

The graphical representation of this NDFA would be as follows −

PORTER STEMMING ALGORITHM – BASIC INTRO

In linguistics (study of language and its structure), a stem is part of a word, that is common to
all of its inflected variants.

• CONNECT
• CONNECTED
• CONNECTION
• CONNECTING
Above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To this stem we
can add different suffixes to form different words.

The process of reducing such inflected (or sometimes derived) words to their word stem is known
as Stemming. For example, CONNECTED, CONNECTION and CONNECTING can be reduced to
the stem CONNECT.

The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes from an
English word and obtain its stem which becomes very useful in the field of Information
Retrieval (IR). This process reduces the number of terms kept by an IR system which will be
advantageous both in terms of space and time complexity. First, a few terms and expressions will
be introduced, which will be helpful for the ease of explanation.

Consonants and Vowels


A consonant is a letter other than the vowels and other than a letter “Y” preceded by a consonant.
So in “TOY” the consonants are “T” and “Y”, and in “SYZYGY” they are “S”, “Z” and “G”.

If a letter is not a consonant it is a vowel.

A consonant will be denoted by c and a vowel by v.

A list of one or more consecutive consonants (ccc…) will be denoted by C, and a list of one or
more consecutive vowels (vvv…) will be denoted by V. Any word, or part of a word, therefore has
one of the four forms given below.

• CVCV … C → collection, management


• CVCV … V → conclude, revise
• VCVC … C → entertainment, illumination
• VCVC … V → illustrate, abundance
All of these forms can be represented using a single form as,

[C]VCVC … [V]
Here the square brackets denote arbitrary presence of consonants or vowels.

(VC)m denotes VC repeated m times. So the above expression can be written as,
[C](VC)m[V]

What is m?
The value m found in the above expression is called the measure of any word or word part when
represented in the form [C](VC)m[V]. Here are some examples for different values of m:

• m=0 → TREE, TR, EE, Y, BY


• m=1 → TROUBLE, OATS, TREES, IVY
• m=2 → TROUBLES, PRIVATE, OATEN, ROBBERY

Rules
The rules for replacing (or removing) a suffix will be given in the form as shown below.

(condition) S1 → S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given
condition, S1 is replaced by S2. The condition is usually given in terms of m in regard to the stem
before S1.

(m > 1) EMENT →
Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a
word part for which m = 2.

Conditions
The conditions may contain the following:
• *S – the stem ends with S (and similarly for the other letters)
• *v* – the stem contains a vowel
• *d – the stem ends with a double consonant (e.g. -TT, -SS)
• *o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)
And the condition part may also contain expressions with and, or and not.

(m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T.
(*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant and does not end
with letters L, S or Z.

How rules are obeyed?


In a set of rules written beneath each other, only one is obeyed, and this will be the one with the
longest matching S1 for the given word. For example, with the following rules,

1. SSES → SS
2. IES → I
3. SS → SS
4. S →
(Here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for
S1. Equally CARESS maps to CARESS (since S1=”SS”) and CARES to CARE (since S1=”S”).

The Algorithm
Step 1a
5. SSES → SS
6. IES → I
7. SS → SS
8. S →

Step 1b
9. (m>0) EED → EE
10. (*v*) ED →x
11. (*v*) ING →
If the second or third of the rules in Step 1b is successful, the following is performed.
12. AT → ATE
13. BL → BLE
14. IZ → IZE
15. (*d and not (*L or *S or *Z)) → single letter
16. (m=1 and *o) → E

Step 1c
17. (*v*) Y → I

Step 2
18. (m>0) ATIONAL → ATE
19. (m>0) TIONAL → TION
20. (m>0) ENCI → ENCE
21. (m>0) ANCI → ANCE
22. (m>0) IZER → IZE
23. (m>0) ABLI → ABLE
24. (m>0) ALLI → AL
25. (m>0) ENTLI → ENT
26. (m>0) ELI → E
27. (m>0) OUSLI → OUS
28. (m>0) IZATION → IZE
29. (m>0) ATION → ATE
30. (m>0) ATOR → ATE
31. (m>0) ALISM → AL
32. (m>0) IVENESS → IVE
33. (m>0) FULNESS → FUL
34. (m>0) OUSNESS → OUS
35. (m>0) ALITI → AL
36. (m>0) IVITI → IVE
37. (m>0) BILITI → BLE

Step 3
38. (m>0) ICATE → IC
39. (m>0) ATIVE →
40. (m>0) ALIZE → AL
41. (m>0) ICITI → IC
42. (m>0) ICAL → IC
43. (m>0) FUL →
44. (m>0) NESS →

Step 4
45. (m>1) AL →
46. (m>1) ANCE →
47. (m>1) ENCE →
48. (m>1) ER →
49. (m>1) IC →
50. (m>1) ABLE →
51. (m>1) IBLE →
52. (m>1) ANT →
53. (m>1) EMENT →
54. (m>1) MENT →
55. (m>1) ENT →
56. (m>1 and (*S or *T)) ION →
57. (m>1) OU →
58. (m>1) ISM →
59. (m>1) ATE →
60. (m>1) ITI →
61. (m>1) OUS →
62. (m>1) IVE →
63. (m>1) IZE →

Step 5a
64. (m>1) E →
65. (m=1 and not *o) E →

Step 5b
66. (m > 1 and *d and *L) → single letter
For each word you input to the algorithm, all the steps from 1 to 5 will be executed and the output
will be produced at the end.

Example Inputs

Let’s consider a few example inputs and check what will be their stem outputs. 🙂
Example 1

In the first example, we input the word MULTIDIMENSIONAL to the Porter Stemming algorithm.
Let’s see what happens as the word goes through steps 1 to 5.

• The suffix will not match any of the cases found in steps 1, 2 and 3.
• Then it comes to step 4.
• The stem of the word has m > 1 (since m = 5) and ends with “AL”.
• Hence in step 4, “AL” is deleted (replaced with null).
• Calling step 5 will not change the stem further.
• Finally the output will be MULTIDIMENSION.

MULTIDIMENSIONAL → MULTIDIMENSION
Example 2

In the second example, we input the word CHARACTERIZATION to the Porter Stemming
algorithm. Let’s see what happens as the word goes through steps 1 to 5.
• The suffix will not match any of the cases found in step 1.
• So it will move to step 2.
• The stem of the word has m > 0 (since m = 3) and ends with “IZATION”.
• Hence in step 2, “IZATION” will be replaced with “IZE”.
• Then the new stem will be CHARACTERIZE.
• Step 3 will not match any of the suffixes and hence will move to step 4.
• Now m > 1 (since m = 3) and the stem ends with “IZE”.
• So in step 4, “IZE” will be deleted (replaced with null).
• No change will happen to the stem in other steps.
• Finally the output will be CHARACTER.

CHARACTERIZATION → CHARACTERIZE → CHARACTER

LEXICON-FREE FST FOR STEMMING


Based on a series of simple cascaded rewrite rules

– (condition) S1--> S2

– Seven sets of rules, applied in order

– Within each set, if more than one of the rules can apply, only the one with the longest matching suffix (S1) is
followed.

Lexicon-free FST for stemming


1. Plural nouns / third person singular verbs (4 rules)
Sses --> ss possesses --> possess ies --> I ties --> ti

2. Verbal past tense and progressives (3 rules)

(*v*) ed --> null walked --> walk

+cleanup rules to remove double letters, add back e’s

at --> ate conflat(ed) --> conflate

3. (*v*) Y. --> I

happy --> happi

4. Derivational morphology I: multiple suffixes

ator --> ate operator --> operate fulness --> ful gratefulness --> grateful

5. Derivational morphology II: more multiple suffixes

ful --> null grateful --> grate

6. Derivational morphology III: single suffixes

ous --> null analogous --> analog

7. Cleanup (3 rules) (m>1) e --> null

probate --> probat rate --> rate

dropping double letters controll --> control

WHAT IS TF-IDF?
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language
processing and information retrieval. It measures how important a term is within a document relative to a
collection of documents (i.e., relative to a corpus).

Words within a text document are transformed into importance numbers by a text vectorization process. There are
many different text vectorization scoring schemes, with TF-IDF being one of the most common.

Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the
total number of words in the document.

(number of times the term appears in the document)

TF = ----------------------------------------------------------------
(total number of terms in the document)
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the
term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance
values than words common across all documents (e.g., a, the, and).

𝑙𝑜𝑔(number of the documents in the corpus

IDF= -------------------------------------------------------------------
number of documents in the corpus contain the term)

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF=𝑇𝐹∗𝐼𝐷𝐹

N-GRAM
N-grams, a fundamental concept in NLP, play a pivotal role in capturing patterns and relationships within a
sequence of words. In this blog post, we’ll delve into the world of N-grams, exploring their significance,
applications, and how they contribute to enhancing language processing tasks.

Understanding N-grams:

Definition:

N-grams are contiguous sequences of ’n’ items, typically words in the context of NLP. These items can be
characters, words, or even syllables, depending on the granularity desired. The value of ’n’ determines the order of
the N-gram.

Examples:

Unigrams (1-grams): Single words, e.g., “cat,” “dog.”


Bigrams (2-grams): Pairs of consecutive words, e.g., “natural language,” “deep learning.”
Trigrams (3-grams): Triplets of consecutive words, e.g., “machine learning model,” “data science approach.”

4-grams, 5-grams, etc.: Sequences of four, five, or more consecutive words.


• Cross-Entropy: It measures the ability of the trained model to represent test
data(𝑊1𝑖−1 W1i−1 ).
𝐻(𝑝)=∑𝑖=1𝑥1𝑛(−𝑙𝑜𝑔2(𝑝(𝑤𝑖∣𝑤1𝑖−1))) H(p)=∑i=1x n1 (−log2 (p(wi ∣w1i−1 )))

The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty
can be no less than the true uncertainty.

• Perplexity: Perplexity is a measure of how good a probability distribution predicts a


sample. It can be understood as a measure of uncertainty. The perplexity can be
calculated by cross-entropy to the exponent of 2.
For Example:

• Let’s take an example of the sentence: ‘Natural Language Processing’. For predicting
the first word, let’s say the word has the following probabilities:

word P(word | <start>)

The 0.4

Processing 0.3

Natural 0.12

Language 0.18

• Now, we know the probability of getting the first word as natural. But, what’s the
probability of getting the next word after getting the word ‘Language‘ after the word
‘Natural‘.
word P(word | ‘Natural’ )

The 0.05

Processing 0.3

Natural 0.15

Language 0.5

• After getting the probability of generating words ‘Natural Language’, what’s the
probability of getting ‘Processing‘.

word P(word | ‘Language’ )

The 0.1

Processing 0.7

Natural 0.1

Language 0.1

• Now, the perplexity can be calculated as:

𝑃𝑃(𝑊) =∏𝑖=1𝑁1𝑃(𝑤𝑖∣𝑤𝑖−1)𝑛=10.12∗0.5∗0.73≈2.876 PP(W) =n∏i=1N P(wi ∣wi−1 )1 =30.12∗0.5∗0.71 ≈2.876

• From that we can also calculate entropy:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦=𝑙𝑜𝑔2(2.876)=1.524Entropy=log2 (2.876)=1.524

You might also like