0% found this document useful (0 votes)

30 views25 pages

NLP UNIT-1(s)

The document provides an overview of Natural Language Processing (NLP), detailing its origins, historical developments, and current applications. It discusses the challenges faced in NLP, such as language diversity, training data requirements, and ambiguity in phrasing, as well as strategies to overcome these challenges. Additionally, it covers language modeling techniques and the role of regular expressions and finite automata in NLP.

Uploaded by

forgaming9955

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views25 pages

NLP UNIT-1(s)

Uploaded by

forgaming9955

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Sagar Institute of research & Technology

Fall Semester AIML(5th)

Faculty: Prof. Haritima Mishra

Natural Language
Processing
UNIT-1
___

Notes

Introduction: Origins and challenges of NLP

What is NLP?

NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to organize
knowledge for performing tasks such as translation, automatic summarization, Named Entity
Recognition (NER), speech recognition, relationship extraction, and topic segmentation.
History of NLP

(1940-1960) - Focused on Machine Translation (MT)

The Natural Languages Processing started in the year 1940s.

1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck
College, London.

1950s - In the Year 1950s, there was a conflicting view between linguistics and computer
science. Now, Chomsky developed his first book syntactic structures and claimed that language
is generative in nature.

In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.

(1960-1980) - Flavored with Artificial Intelligence (AI):

In this era the key developments were

Augmented Transition Networks (ATN): Augmented Transition Networks is a finite state

machine that is capable of recognizing regular languages.

Case Grammar

Case Grammar was developed by Linguist Charles J. Fillmore in 1968. Case Grammar uses
languages such as English to express the relationship between nouns and verbs by using the
preposition.

In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.

For example: "Neha broke the mirror with the hammer". In this example case grammar
identifies Neha as an agent, mirror as a theme, and hammer as an instrument.In the year 1960 to
1980, key systems were:

SHRDLU

SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate

with the computer and moving objects. It can handle instructions such as "pick up the green boll"
and also answer the questions like "What is inside the black box." The main importance of
SHRDLU is that it shows those syntax, semantics, and reasoning about the world that can be
combined to produce a system that understands a natural language.

LUNAR

LUNAR is the classic example of a Natural Language database interface system that is used by
ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural language
expressions into database queries and handle 78% of requests without errors

1980 - Current

Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.

In the beginning of the 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text was introduced, which provided a
good resource for training and examining natural language programs. Other factors may include
the availability of computers with fast CPUs and more memory. The major factor behind the
advancement of natural language processing was the Internet.

Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it allows
artificial intelligence to gain knowledge of the world. Let's consider the example of AMAZON
ALEXA, using this robot you can ask the question to Alexa, and it will reply to you.

Challenges of NLP

Natural Language Processing (NLP) faces various challenges due to the complexity and diversity
of human language.

1. Language differences

The human language and understanding is rich and intricate and there are many languages
spoken by humans. Human language is diverse and thousands of human languages spoken
around the world have their own grammar, vocabulary and cultural nuances. Humans cannot
understand all the languages and the productivity of human language is high. There is ambiguity
in natural language since the same words and phrases can have different meanings and different
contexts. This is the major challenge in understanding natural language.
2.Training Data

Training data is a curated collection of input-output pairs, where the input represents the features
or attributes of the data, and the output is the corresponding label or target. Training data is
composed of both the features (inputs) and their corresponding labels (outputs). For NLP,
features might include text data, and labels could be categories, sentiments, or any other relevant
annotations.

3. Development Time and Resource Requirements

Development Time and Resource Requirements for Natural Language Processing (NLP)
projects depend on various factors consisting of the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of experts involved. Here are some key
points:

● Complexity of the task: Tasks such as classification of text or analyzing the

sentiment of the text may require less time compared to more complex tasks such as
machine translation or answering the questions.
● Availability and Quality Data: For NLP models requires high-quality annotated
data. It can be time consuming to collect, annotate, and preprocess the large text
datasets and can be resource-intensive specially for tasks that require specialized
domain knowledge or fine-tuned annotations.

4. Navigating Phrasing Ambiguities in NLP

It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of

human languages. The cause of phrasing ambiguities is when a phrase can be evaluated in
multiple ways that leads to uncertainty in understanding the meaning. Here are some key points
for navigating phrasing ambiguities in NLP

● Contextual Understanding: Contextual information like previous sentences, topic

focus, or conversational cues can give valuable clues for solving ambiguities.
● Semantic analysis: The content of the semantic text is analyzed to find meaning
based on word, lexical relationships and semantic roles. Tools such as word sense
disambiguation, semantics role labeling can be helpful in solving phrasing
ambiguities.
● Syntactic Analysis: The syntactic structure of the sentence is analysed to find the
possible evaluation based on grammatical relationships and syntactic patterns.
● Pragmatic Analysis: Pragmatic factors such as intentions of the speaker,
implicatures to infer meaning of a phrase. This analysis consists of understanding the
pragmatic context.
● Statistical methods: Statistical methods and machine learning models are used to
learn patterns from data and make predictions about the input phrase.

5. Misspellings and Grammatical Errors

Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis. Here
are some key points for solving misspelling and grammatical error in NLP:

● Spell Checking: Implement spell-check algorithms and dictionaries to find and

correct misspelled words.
● Text Normalization: This is normalized by converting into a standard format which
may contain tasks such as conversion of text to lowercase, removal of punctuation
and special characters, and expanding contractions.
● Tokenization: The text is split into individual tokens with the help of tokenization
techniques. This technique allows to identify and isolate misspelt words and
grammatical errors that makes it easy to correct the phrase.
● Language Models: With the help of language models that are trained on a large
corpus of data to predict the likelihood of a word or phrase that is correct or not based
on its context.

6. Mitigating Innate Biases in NLP Algorithms

It is a crucial step of mitigating innate biases in the NLP algorithm for conforming fairness,
equity, and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.
● Collection of data and annotation: It is very important to confirm that the training
data used to develop NLP algorithms is diverse, representative and free from biases.
● Analysis and Detection of bias: Apply bias detection and analysis method on
training data to find biases that are based on demographic factors such as race,
gender, age.
● Data Preprocessing: Data Preprocessing the most important process to train data to
mitigate biases like debiasing word embeddings, balance class distributions and
augmenting underrepresented samples.
● Fair representation learning: Natural Language Processing models are trained to
learn fair representations that are invariant to protect attributes like race or gender.
● Auditing and Evaluation of Models: Natural Language models are evaluated for
fairness and bias with the help of metrics and audits. NLP models are evaluated on
diverse datasets and perform post-hoc analyses to find and mitigate innate biases in
NLP algorithms.

7. Words with Multiple Meanings

Words with multiple meanings play a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meanings are known as polysemous or
homonymous and have different meanings based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings in
NLP:

● Semantic analysis: Implement Semantic analysis techniques to find the underlying

meaning of the word in various contexts. Word embedding or semantic networks are
the semantic representation that can find the semantic similarity and relatedness
between different word senses.
● Domain specific knowledge: It is very important to have a specific
domain-knowledge in Natural Processing tasks that can be helpful in providing
valuable context and constraints for determining the correct context of the word.
● Multi-word Expression (MWEs): The meaning of the entire sentence or phrase is
analyzed to disambiguate the word with multiple meanings.
● Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to
find the semantic relationships between different words and contexts.

8.Addressing Multilingualism

It is very important to address language diversity and multilingualism in Natural Language

Processing to confirm that NLP systems can handle the text data in multiple languages
effectively.

9. Reducing Uncertainty and False Positives in NLP

It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models

10. Facilitating Continuous Conversations with NLP

Facilitating continuous conversations with NLP includes the development of a system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines gives
the capability to analyze and interpret user input as it is received involving algorithms optimized
and systems for low latency processing to confirm quick responses to user queries and inputs.

Building an NLP model that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history tracking,
and generating relevant responses based on the ongoing dialogue. Apply intent recognition
algorithm to find the underlying goals and intentions expressed by users in their messages.

Overcome NLP Challenges

It requires a combination of innovative technologies, experts of domain, and methodological
approaches to overcome the challenges in NLP. Here are some key points to overcome the
challenge of NLP tasks:

● Quantity and Quality of data: High quality of data and diverse data is used to train the
NLP algorithms effectively. Data augmentation, data synthesis, and crowdsourcing are
the techniques to address data scarcity issues.
● Ambiguity: The NLP algorithm should be trained to disambiguate the words and
phrases.
● Out-of-vocabulary Words: The techniques are implemented to handle
out-of-vocabulary words such as tokenization, character-level modeling, and vocabulary
expansion.
● Lack of Annotated Data: Techniques such as transfer learning and pre-training can be
used to transfer knowledge from a large dataset to specific tasks with limited labeled data.
● Selection of algorithm and development of model: It is difficult to choose the right
machine learning algorithms that are best for Natural Language Processing tasks.
● Evaluation and Training: It requires powerful computation resources that consist of
powerful hardware (GPUs or TPUs) and time for training the algorithm's iteration. It is
also important to evaluate the performance of the model with the help of suitable metrics
and validation techniques for conforming the quality of the results.

Language modelling:

Language modelling is the way of determining the probability of any sequence of words.
Language modelling is used in various applications such as Speech Recognition, Spam filtering,
etc. Language modelling is the key aim behind implementing many state-of-the-art Natural
Language Processing models.

Grammar Based LM, Statistical LM

1. Grammar based language models

Due to the smoothing techniques, bigram and trigram language models are robust and have
been successfully used more widely in speech recognition than conventional grammars like
context free or even context sensitive grammars. Although these grammars are expected to
better capture the inherent structures of the language, they have a couple of problems:

● robustness: the grammar must be able to handle a vocabulary of 10000 or more words,
and ultimately a non-zero probability must be assigned to each possible word sequence.
● ambiguity: while m-gram language models avoid any ambiguity in parsing, context
free grammars are typically ambiguous and thus produce more than a single parse tree .

2. Statistical Language Models

Statistical models include the development of probabilistic models that are able to predict the
next word in the sequence, given the words that precede it. A number of statistical language
models are in use already. Let’s take a look at some of those popular models:

N-Gram: This is one of the simplest approaches to language modeling. Here, a probability
distribution for a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size of
the gram (or sequence of words being assigned a probability). If n=4, a gram may look like: “can
you help me”. Basically, ‘n’ is the amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.

Unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently. Unigram
models commonly handle language processing tasks such as information retrieval. The unigram
is the foundation of a more specific model variant called the query likelihood model, which uses
information retrieval to examine a pool of documents and match the most relevant one to a
specific query.

Bidirectional: Unlike n-gram models, which analyze text in one direction (backwards),
bidirectional models analyze text in both directions, backwards and forwards. These models can
predict any word in a sentence or body of text by using every other word in the text. Examining
text bidirectionally increases result accuracy. This type is often utilized in machine learning and
speech generation applications. For example, Google uses a bidirectional model to process
search queries.

Exponential: This type of statistical model evaluates text by using an equation which is a
combination of n-grams and feature functions. Here the features and parameters of the desired
results are already specified. The model is based on the principle of entropy, which states that
probability distribution with the most entropy is the best choice. Exponential models have fewer
statistical assumptions which mean the chances of having accurate results are more.

Continuous Space: In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning weight to a word is known
as word embedding. This type of model proves helpful in scenarios where the data set of words
continues to become large and include unique words.

In cases where the data set is large and consists of rarely used or unique words, linear models
such as n-gram do not work. This is because, with increasing words, the possible word sequences
increase, and thus the patterns predicting the next word become weaker.

Regular Expressions

Regular Expressions are used to denote regular languages. An expression is regular if:

● ? is a regular expression for regular language ?.

● ? is a regular expression for regular language {?}.
● If a ? ? (? represents the input alphabet), a is regular expression with language
{a}.
● If a and b are regular expressions, a + b is also a regular expression with
language {a,b}.
● If a and b are regular expressions, ab (concatenation of a and b) is also regular.
● If a is regular expression, a* (0 or more times a) is also regular.

Finite Automation

The finite automatation are the machines which recognize regular languages, i.e, languages
represented by regular expressions (RE). Hence, there is relation of implication like this, F A →
REGLANG → RExp → F A, in other words, it is a bijection between all three entities. Let us
consider that, talk or sound of a sheep (sheeptalk language S) can be represented by strings S =
{baa!, baaa!, baaa!, ...},. Where each sentence in language is dependent on the length of ’a’
sound. The set of strings S can be represented by a regular expression: baa+!. However, we will
prefer it to be represented by a string /baa+!/ – a format of pronunciation. The sounds in set S can
be recognized by a FA shown in the Fig. 5.1. q0 q1 q2 q3 q4 b a a a ! b a a a ! Transition diagram
of FA Finite automata Finite control tape Read Head motion Figure 5.1: Finite Automata.
Formally, a FA is represented by M = (Q, Σ, q0, δ, F), where Q is finite set of states; here Q =
{q0, q1, q2, q3, q4} Σ is finite set of alphabets; here Σ = {a, b, !} δ is transition function; i.e., δ :
Q × Σ → Q, F is set of finite states, called accepting or final states, and here, F = {q4}, S = L(M)
= {baa!, baaa!, baaaa!, . . . }. The recognition process for the language string through FA, is
represented by the algorithm 1. For this, the FA’s tape is divided into squares, called index
positions, where each position is holding one symbol from alphabet Σ. We consider that w is
input string, and |w| = n is length of input [Jurafskyd]. The recognition of a string by FA is
searching a tree. Considering the recognition of some string, say length n, for the alphabet set Σ
= {a, b}, one needs to perform a worst case search of O(2n ) in space and time.

Finite state transducers

A finite state transducer (FST) is a finite state machine with two tapes: an input tape and an
output tape, with a finite number of states. The Fig. 5.3 shows the diagram where these (input
and output) strings are shown on the transitions, separated by “:”. This FST has been used as a
translator, as it translates the input sentence “” Hello World” to “Hey there krc”. Thus, an FST is
a directed graph, like finite automata, with, 5-4 Lecture 5: Finite Automata and Morphological
Parsing 0 1 2 3 Hello:Hey ε:there World:krc Figure 5.3: An FST as a translator • Edges /
transitions that have input / output labels, • some times there are empty labels indicated by ε, •
Traversing through to the end of an FST implies the translation of one string into another, or
generation of two strings, or relating one string to another, • There is a defined state, called
“start” state, and other ass “final” state. We define here, the FST as Mealy machine, which is an
extension of normal finite state (FS) machine. The formal representation of Mealy machine is
given by, M = (Q, Σ, q0, δ, F) (5.1) where, Q = {q0, q1, . . . , qN−1}, is finite set of states, Σ is
finite alphabet of complex symbols, and Σ ⊆ I × O, q0 is start state, δ : Q × Σ → Q, for example,
δ(q ′ , i : o) = qj . The I and O are input and output symbols, respectively, and both include the
symbol ε. For Σ = {a, b, !}, corresponding to the language discussed earlier, the FST has i : o set
as, {a:a, b:b, !:!, a:!, a:ε, ε:!}. The FSTs are useful for variety of applications: • Word inflections:
For example, finding the plural of the words, cat → cats, dog → dogs, goose → geese, etc. •
Morphological parsing: Extracting the properties of a word, e.g., cats → cat + [nouns] + [plural].
• Simple word translations: For example, US English to UK English. • Simple commands to the
computer.

English Language Morphology

Morphology is the study of how the words are constructed. Construction of English language
words through attachment of prefixes and suffixes (both together called affixes) are called
concatenative morphology, because a word is composed of a number of morphemes
concatenated together. A word may have more than one affix, for example rewrites (re+write+s),
unlikely (un+like+ly), etc. There are broadly two ways to form words using morphemes:

1. Inflection: Inflectional morphology forms the words using the same group word stem, e.g.,
write+s, word+ed, etc. The Table 5.1 shows the words constructed using inflectional
morphology.

2. Derivation: Derivational morphology produces a word of different stem, for example

computerization (noun) from computerize (verb) – the words belong to different groups.

Table 5.1: Inflectional Morphology.

The TYPE Regular nouns Irregular nouns

examples of
Singular cat, thrush mouse, ox
regular verbs
Plural cats, thrushes mice, oxen
are walk,
walks, walking, walked. Similarly, irregularly inflected verbs are: “eat, eats, eating, ate, eaten,
catch, catches, cut, cuts, cutting, caught,” etc.
The derivation is a combination of word stem with grammatical morpheme, usually resulting in a
word of different class. For example, formation of nouns from verbs and adjectives. The Table
5.2 shows the examples of derivational morphology

Table 5.2: Derivational Morphology

Suffix Base verb/adjective Derived Noun

-action computerize (V) Computerization

-ee appoint (V) appointee

-er kill (V) killer

-ness fuzzy (A) fuzziness

Morphology and Finite-state Transducers

To know the structure about a word when we perform the morphological parsing for that word.
Given a surface form (input form), e.g., “going” we might produce the parsed form: verb-go +
gerund-ing. Morphological parsing can be done with the help of finite-state transducers. A
morpheme is a meaning bearing unit of any language. For example, fox: has single morpheme,
fox and cats: has two morphemes, cat, -s. Similarly, eat, eat, eating, ate, eaten have different
morphemes. Some examples of mapping of certain words and corresponding morphemes are
given in Table 5.3. The mapping of input and output correspond to the input and output of finite
state machines. In speech recognition, when a word has been identified, like cats, dogs, it
becomes necessary to produce its morphological parsing, to find out its true meaning, in the form
of its structure, as well as to know how it is organised. These include the features, like N (noun),
V (verb), specify additional information about the word stem, e.g., +N means that word is noun,
+SG means singular, +PL for plural, etc.

We require following databases for building morphological parser:

1. Lexicon: List of stems, and affixes, plus additional information about them, like +N, +V.

Table 5.3: Mapping of input word to Morphemes

Input Words Morphological parsed output

cats cat +N +PL

cat cat +N +SG

cities city +N +PL

geese goose +N +PL

goose goose +V +3SG

caught catch +V +PAST-Part

2. Morphotactics rules: Rules about ordering of morphemes in a word, e.g. -ed is followed after a
verb (e.g., worked, studied), un (undo) precede a verb, for example, unlock, untie, etc.

3. Orthographic rules (spelling): For combining morphemes, e.g., city+ -s gives cities and not
citys.We can use the lexicons together with morphotactics (rules) to recognize the words with the
help of finite automata in the form of stem+affix+part-of-speech (N, V, etc). The Fig. 5.7 shows
the basic idea of parsing of nouns using morphological parsing. Recognition of nouns by FA is
subject to reaching the final state (marked by double circle in figure) of FA. Table 5.4 shows
some examples of regular and irregular nouns.
Morphological Parsing of nouns

Regular and Irregular nouns

Regnoun Plural Irregnoun Irreg-sg noun

fox -es goose geese

cat -s sheep sheep

dog -s mouse mice

A similar arrangement is possible for verb morphological parsing (see Fig. 5.8, and Table 5.5).
The lexicon for verbal inflection have three stem classes (reg-verb stem, irreg-verb stem, and
irreg-past-verb), with affix classes are: -ed for past and participle, -ing for continuous, and 3rd
person singular has -s. Adjectives can be parsed in the similar manner like, the nouns and verbs.
Some of the adjectives of English language are: big, bigger, biggest, clean, cleaner, cleanest,
happy, unhappy, happier, happiest, real, really, unreal, etc. The finite automata in Fig. 5.9 is
showing the morphological parsing for adjective words. At the next stage, the lexicon can be
expanded to sub-lexicons, i.e, individual letters, to be recognized by the finite automata. For
example, regular-noun in Fig. 5.7 can be expanded to letters “f o x” connected by three states in a
transition diagram. Similarly, the regular verb stem in Fig. 5.8 can be expanded by letters “w a l
k”, and so on, as shown in Fig. 5.10. Note that in the parsing of N, V, ADJ, and ADV discussed
above, for the sake of simplicity we have not shown the transitions separated by colon (‘:”),
however, the FST has two tapes as usual, for input and output.
Regular and Irregular verbs. Reg-verb Past Irreg-verb Irreg-past-v Cont

Reg-verb Past Irreg-verb Irreg-past-v Cont. 3sg

walk -ed catch caught -ing -s

fry -ed eat ate -ing -s

talk -ed sing eaten -ing -s

Tokenization

Tokenization is one of the first steps in any NLP pipeline. Tokenization is nothing but splitting
the raw text into small chunks of words or sentences, called tokens. If the text is split into words,
then it's called 'Word Tokenization' and if it's split into sentences then it's called 'Sentence
Tokenization'. Generally 'space' is used to perform the word tokenization and characters like
'periods, exclamation point and newline char are used for Sentence Tokenization. We have to
choose the appropriate method as per the task in hand. While performing the tokenization of a
few characters like spaces, punctuations are ignored and will not be part of the final list of
tokens.
Tokenization is Required

Every sentence gets its meaning by the words present in it. So by analyzing the words present in
the text we can easily interpret the meaning of the text. Once we have a list of words we can also
use statistical tools and methods to get more insights into the text. For example, we can use word
count and word frequency to find out the importance of words in that sentence or document.

Detecting and Correcting Spelling Errors, Minimum Edit Distance

To detect the spelling error in nlp is by using the techniques such as spell checking, phonetic
matching and incorporating language models that handles out- of-vocabulary words effectively.
1. Spell Checking:
● Dictionary-Based Approaches: Utilize a dictionary or lexicon to check if
each word in the text is spelled correctly. If a word is not found in the
dictionary, it is considered a potential spelling error.
2. Phonetic Matching:
● Soundex and Metaphone: Phonetic algorithms map words to phonetic
representations based on their pronunciation. Words with similar phonetic
representations are likely to be spelled similarly, even if spelled
differently. This technique helps in identifying spelling errors where words
sound alike but are spelled differently.
3. Language Models:
● Statistical Language Models: Use statistical models trained on large text
corpora to estimate the probability of a word sequence. Language models
can help in identifying likely corrections for misspelled words based on
the context of surrounding words.
● Neural Language Models: Modern neural language models like
Transformer-based models (e.g., BERT, GPT) are effective at predicting
and correcting spelling errors by considering the context of the entire
sentence. Fine-tuning these models on spelling correction tasks can yield
highly accurate results.
4. Rule-Based Approaches:
● Pattern Matching: Apply regular expressions or pattern-matching rules to
detect common types of spelling errors, such as repeated characters,
missing characters, or transposed letters.
● Language-Specific Rules: Develop language-specific spelling correction
rules based on common misspellings, phonetic patterns, or morphological
rules.

5. Ensemble Methods:
● Combining Multiple Approaches: Combine the outputs of different
spelling correction methods, such as spell checking, phonetic matching,
and language models, using ensemble techniques to improve accuracy and
robustness.
6. User Feedback:
● Interactive Correction: Allow users to provide feedback on suggested
corrections and incorporate this feedback to improve the spelling
correction system over time. This can be achieved through interactive
interfaces or feedback mechanisms in applications.
7. Domain-Specific Customization:
● Custom Dictionaries: Create domain-specific dictionaries or lexicons
containing relevant terms and vocabulary to improve the accuracy of
spelling correction in specific domains or industries.

Minimum edit distance

EDIT Distance:

● Edit Distance Algorithms: Algorithms such as Levenshtein distance or

Damerau-Levenshtein distance measure the minimum number of edits
(insertions, deletions, substitutions, or transpositions) required to
transform one word into another. Words with small edit distances to
known words in the dictionary can be suggested as corrections.
EX: Minimum edit distance
How to find the minimum edit distance:

Defining minimum edit distance:

the minimum edit distance (also known as the Levenshtein distance) is a way of measuring how
different two strings are by counting the minimum number of single-character edits required to
change one string into another. The allowable edits are usually insertions, deletions, or
substitutions of a single character. For example, to transform the word "kitten" into "sitting," you
would need to substitute 'k' with 's', substitute 'e' with 'i', and insert 'g', which gives a total edit
distance of 3.

The minimum edit distance has important applications in areas like spell checking, DNA
sequence analysis, and natural language processing (NLP). The process of calculating it is often
done using a dynamic programming approach, where a table is constructed to compare substrings
of the two words being analyzed. This method ensures that the solution is computed efficiently.

The time complexity for computing the minimum edit distance is O(n×m)O(n \times m)O(n×m),
where n and m are the lengths of the two strings. It provides a useful metric for understanding
the similarity between two strings and can help in tasks like text correction or similarity-based
searches.

It helps to determine how many single-character are needed to convert one string into another
in edits:

1.insertions,

2.deletions

3. Replace

Example of calculating the minimum edit distance between two words: "kitten" and "sitting."

We want to transform "kitten" into "sitting" using the fewest possible single-character edits
(insertions, deletions, or substitutions).

1. kitten → sitten (substitute 'k' with 's')

2. sitten → sittin (substitute 'e' with 'i')
3. sittin → sitting (insert 'g' at the end)

Here, three edits are required: two substitutions and one insertion. Therefore, the minimum
edit distance between "kitten" and "sitting" is 3.
Calculation Example:

For the words "intention" and "execution," the minimum edit distance would involve the
following edits:

1. Replace 'i' with 'e'

2. Replace 'n' with 'x'
3. Replace 't' with 'c'
4. Replace 'n' with 'u'

This results in a minimum edit distance of 5.

This is an example of the Levenshtein distance, where each operation has a cost of 1. The
dynamic programming approach would typically use a matrix to track the optimal number of
edits for each substring of the two words, ensuring an efficient calculation of the total edit
distance.

This concept is crucial for various NLP tasks, such as:

1. Spell Checking and Correction

In spell checkers, the minimum edit distance is used to find the closest valid words to a
misspelled word. For example, if a user types "spel", the system may suggest "spell" by
calculating the minimum edit distance between the input and possible dictionary words.

2. Machine Translation

In machine translation, minimum edit distance can be used to evaluate the accuracy of
translations. It compares the machine-generated translation to the human reference
translation by calculating the number of edits required to convert one sentence to another.

3. Text Similarity

It helps measure similarity between two text sequences, which is important in tasks like
paraphrase detection, text clustering, and plagiarism detection.

4. Speech Recognition

In automatic speech recognition, the minimum edit distance helps quantify the differences
between recognized words and the correct transcription.

NLP Basics for Computer Science Students
No ratings yet
NLP Basics for Computer Science Students
87 pages
Ai Unit-V Rtu
No ratings yet
Ai Unit-V Rtu
14 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
78 pages
DLNLP Chapter-1
No ratings yet
DLNLP Chapter-1
38 pages
Basic Terms NLP and Major Challenges
No ratings yet
Basic Terms NLP and Major Challenges
12 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
10 Key NLP Challenges Explained
No ratings yet
10 Key NLP Challenges Explained
12 pages
0 Unit-1 Introducntion To NLP
No ratings yet
0 Unit-1 Introducntion To NLP
41 pages
NLP MODULE 1 Chapter1 &2
100% (1)
NLP MODULE 1 Chapter1 &2
83 pages
1 - Introducntion To NLP
No ratings yet
1 - Introducntion To NLP
43 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
Unit 1
No ratings yet
Unit 1
35 pages
Unit I - Natural Language Processing
No ratings yet
Unit I - Natural Language Processing
34 pages
Harmonizing Humanity and Technology
No ratings yet
Harmonizing Humanity and Technology
10 pages
NLP Unit 1 1
No ratings yet
NLP Unit 1 1
67 pages
NLP Notes 2
No ratings yet
NLP Notes 2
137 pages
NLP Introduction
No ratings yet
NLP Introduction
36 pages
Transformational Grammar
No ratings yet
Transformational Grammar
11 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
Unit 4
No ratings yet
Unit 4
39 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
NLP Notes
No ratings yet
NLP Notes
73 pages
Brief History of NLP
No ratings yet
Brief History of NLP
7 pages
Unit 1
No ratings yet
Unit 1
99 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
1 NLP
No ratings yet
1 NLP
26 pages
AI Unit 5
No ratings yet
AI Unit 5
10 pages
Important Topics Explantion NLP
No ratings yet
Important Topics Explantion NLP
39 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
No ratings yet
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
55 pages
NLP Exam Prep Guide
No ratings yet
NLP Exam Prep Guide
27 pages
Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
No ratings yet
Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
315 pages
Lesson 1 Introduction To Natural Language Processing
No ratings yet
Lesson 1 Introduction To Natural Language Processing
93 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Unit V
No ratings yet
Unit V
16 pages
UNIT - 03 (All Topics)
No ratings yet
UNIT - 03 (All Topics)
54 pages
Akchukwu Wisdom Chidi Seminar Corrected Version
No ratings yet
Akchukwu Wisdom Chidi Seminar Corrected Version
17 pages
NLP Exam Notes
No ratings yet
NLP Exam Notes
15 pages
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
CL Unit 1
No ratings yet
CL Unit 1
11 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
Unit1 (Part1)
No ratings yet
Unit1 (Part1)
49 pages
Module 1
No ratings yet
Module 1
39 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
Introduction To Data Science - Week 7 - LAQ's
No ratings yet
Introduction To Data Science - Week 7 - LAQ's
4 pages
Unit - 1
No ratings yet
Unit - 1
55 pages
NLP2
No ratings yet
NLP2
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Module-1 - Introduction To Natural Language Processing
No ratings yet
Module-1 - Introduction To Natural Language Processing
70 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLP Insem
No ratings yet
NLP Insem
100 pages
Unit 5 (OS)
No ratings yet
Unit 5 (OS)
17 pages
Numerical Ques of Unit 2 Process
No ratings yet
Numerical Ques of Unit 2 Process
9 pages
Ai Unit 2
No ratings yet
Ai Unit 2
24 pages
Ai Unit 1
No ratings yet
Ai Unit 1
31 pages
Component Location: Generator Set 3512 GENERATOR SET ZAF00278 3512 Generator Set ZAF00001-UP
No ratings yet
Component Location: Generator Set 3512 GENERATOR SET ZAF00278 3512 Generator Set ZAF00001-UP
5 pages
Tutorial 5
No ratings yet
Tutorial 5
1 page
XI Physics - Mechanical Properties of Solids CET 1 - Q.papeR
No ratings yet
XI Physics - Mechanical Properties of Solids CET 1 - Q.papeR
3 pages
Math 2
No ratings yet
Math 2
5 pages
LFR
No ratings yet
LFR
29 pages
Proceedings Rockfall
No ratings yet
Proceedings Rockfall
131 pages
Engineering Fracture Mechanics
No ratings yet
Engineering Fracture Mechanics
16 pages
Design and Analysis of G+6 Storeyed Building
No ratings yet
Design and Analysis of G+6 Storeyed Building
31 pages
Additional Maths 0606 P1 3
No ratings yet
Additional Maths 0606 P1 3
17 pages
S RV Calculator
No ratings yet
S RV Calculator
4 pages
Odv-065r17e17k17k DS 0-0-2
No ratings yet
Odv-065r17e17k17k DS 0-0-2
1 page
Ame101 F17 PS5
No ratings yet
Ame101 F17 PS5
3 pages
Basic Civil Questions
No ratings yet
Basic Civil Questions
5 pages
Cantilever Bracket Load Analysis
No ratings yet
Cantilever Bracket Load Analysis
1 page
PowerUp PDF
No ratings yet
PowerUp PDF
1 page
Table of Comply Cisco-Hpe
No ratings yet
Table of Comply Cisco-Hpe
7 pages
Phase Locked Loop Working and Operatin Principle With Applications
No ratings yet
Phase Locked Loop Working and Operatin Principle With Applications
5 pages
Tray Sdrywer
No ratings yet
Tray Sdrywer
10 pages
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Since
No ratings yet
Since
12 pages
63а wifi relay with energy monitoring manual
No ratings yet
63а wifi relay with energy monitoring manual
6 pages
Max Fajardo Questions
No ratings yet
Max Fajardo Questions
17 pages
Solids, Liquids and Gases - Notes
No ratings yet
Solids, Liquids and Gases - Notes
4 pages
FM DEModulation
No ratings yet
FM DEModulation
6 pages
Electrodynamics Fiziks Notes
40% (5)
Electrodynamics Fiziks Notes
371 pages
Aerodynamics in Cars
No ratings yet
Aerodynamics in Cars
31 pages
Photocopiable Activities-Part 1
No ratings yet
Photocopiable Activities-Part 1
1 page
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
No ratings yet
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
9 pages
Ewsd - SYSTEM OVERVIEW
No ratings yet
Ewsd - SYSTEM OVERVIEW
35 pages
Worksheets - Unit 1 Answer Key
100% (1)
Worksheets - Unit 1 Answer Key
14 pages

NLP UNIT-1(s)

Uploaded by

NLP UNIT-1(s)

Uploaded by

Sagar Institute of research & Technology

Fall Semester AIML(5th)

Faculty: Prof. Haritima Mishra

Introduction: Origins and challenges of NLP

(1940-1960) - Focused on Machine Translation (MT)

The Natural Languages Processing started in the year 1940s.

(1960-1980) - Flavored with Artificial Intelligence (AI):

In this era the key developments were

Augmented Transition Networks (ATN): Augmented Transition Networks is a finite state

SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate

3. Development Time and Resource Requirements

● Complexity of the task: Tasks such as classification of text or analyzing the

4. Navigating Phrasing Ambiguities in NLP

It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of

● Contextual Understanding: Contextual information like previous sentences, topic

5. Misspellings and Grammatical Errors

● Spell Checking: Implement spell-check algorithms and dictionaries to find and

6. Mitigating Innate Biases in NLP Algorithms

7. Words with Multiple Meanings

● Semantic analysis: Implement Semantic analysis techniques to find the underlying

It is very important to address language diversity and multilingualism in Natural Language

9. Reducing Uncertainty and False Positives in NLP

10. Facilitating Continuous Conversations with NLP

Overcome NLP Challenges

Grammar Based LM, Statistical LM

1. Grammar based language models

2. Statistical Language Models

● ? is a regular expression for regular language ?.

Finite state transducers

English Language Morphology

2. Derivation: Derivational morphology produces a word of different stem, for example

Table 5.1: Inflectional Morphology.

The TYPE Regular nouns Irregular nouns

Table 5.2: Derivational Morphology

Suffix Base verb/adjective Derived Noun

-action computerize (V) Computerization

-ee appoint (V) appointee

-er kill (V) killer

-ness fuzzy (A) fuzziness

Morphology and Finite-state Transducers

We require following databases for building morphological parser:

Table 5.3: Mapping of input word to Morphemes

Input Words Morphological parsed output

cats cat +N +PL

cat cat +N +SG

cities city +N +PL

geese goose +N +PL

goose goose +V +3SG

caught catch +V +PAST-Part

Regular and Irregular nouns

Regnoun Plural Irregnoun Irreg-sg noun

fox -es goose geese

cat -s sheep sheep

dog -s mouse mice

Reg-verb Past Irreg-verb Irreg-past-v Cont. 3sg

walk -ed catch caught -ing -s

fry -ed eat ate -ing -s

talk -ed sing eaten -ing -s

Detecting and Correcting Spelling Errors, Minimum Edit Distance

Minimum edit distance

● Edit Distance Algorithms: Algorithms such as Levenshtein distance or

Defining minimum edit distance:

1. kitten → sitten (substitute 'k' with 's')

1. Replace 'i' with 'e'

This results in a minimum edit distance of 5.

This concept is crucial for various NLP tasks, such as:

1. Spell Checking and Correction

You might also like