NLP UNIT-1(s)
NLP UNIT-1(s)
Natural Language
Processing
UNIT-1
___
Notes
What is NLP?
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to organize
knowledge for performing tasks such as translation, automatic summarization, Named Entity
Recognition (NER), speech recognition, relationship extraction, and topic segmentation.
History of NLP
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck
College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer
science. Now, Chomsky developed his first book syntactic structures and claimed that language
is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in 1968. Case Grammar uses
languages such as English to express the relationship between nouns and verbs by using the
preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar
identifies Neha as an agent, mirror as a theme, and hammer as an instrument.In the year 1960 to
1980, key systems were:
SHRDLU
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used by
ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural language
expressions into database queries and handle 78% of requests without errors
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.
In the beginning of the 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text was introduced, which provided a
good resource for training and examining natural language programs. Other factors may include
the availability of computers with fast CPUs and more memory. The major factor behind the
advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it allows
artificial intelligence to gain knowledge of the world. Let's consider the example of AMAZON
ALEXA, using this robot you can ask the question to Alexa, and it will reply to you.
Challenges of NLP
Natural Language Processing (NLP) faces various challenges due to the complexity and diversity
of human language.
1. Language differences
The human language and understanding is rich and intricate and there are many languages
spoken by humans. Human language is diverse and thousands of human languages spoken
around the world have their own grammar, vocabulary and cultural nuances. Humans cannot
understand all the languages and the productivity of human language is high. There is ambiguity
in natural language since the same words and phrases can have different meanings and different
contexts. This is the major challenge in understanding natural language.
2.Training Data
Training data is a curated collection of input-output pairs, where the input represents the features
or attributes of the data, and the output is the corresponding label or target. Training data is
composed of both the features (inputs) and their corresponding labels (outputs). For NLP,
features might include text data, and labels could be categories, sentiments, or any other relevant
annotations.
Development Time and Resource Requirements for Natural Language Processing (NLP)
projects depend on various factors consisting of the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of experts involved. Here are some key
points:
Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis. Here
are some key points for solving misspelling and grammatical error in NLP:
It is a crucial step of mitigating innate biases in the NLP algorithm for conforming fairness,
equity, and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.
● Collection of data and annotation: It is very important to confirm that the training
data used to develop NLP algorithms is diverse, representative and free from biases.
● Analysis and Detection of bias: Apply bias detection and analysis method on
training data to find biases that are based on demographic factors such as race,
gender, age.
● Data Preprocessing: Data Preprocessing the most important process to train data to
mitigate biases like debiasing word embeddings, balance class distributions and
augmenting underrepresented samples.
● Fair representation learning: Natural Language Processing models are trained to
learn fair representations that are invariant to protect attributes like race or gender.
● Auditing and Evaluation of Models: Natural Language models are evaluated for
fairness and bias with the help of metrics and audits. NLP models are evaluated on
diverse datasets and perform post-hoc analyses to find and mitigate innate biases in
NLP algorithms.
Words with multiple meanings play a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meanings are known as polysemous or
homonymous and have different meanings based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings in
NLP:
8.Addressing Multilingualism
It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models
Facilitating continuous conversations with NLP includes the development of a system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines gives
the capability to analyze and interpret user input as it is received involving algorithms optimized
and systems for low latency processing to confirm quick responses to user queries and inputs.
Building an NLP model that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history tracking,
and generating relevant responses based on the ongoing dialogue. Apply intent recognition
algorithm to find the underlying goals and intentions expressed by users in their messages.
● Quantity and Quality of data: High quality of data and diverse data is used to train the
NLP algorithms effectively. Data augmentation, data synthesis, and crowdsourcing are
the techniques to address data scarcity issues.
● Ambiguity: The NLP algorithm should be trained to disambiguate the words and
phrases.
● Out-of-vocabulary Words: The techniques are implemented to handle
out-of-vocabulary words such as tokenization, character-level modeling, and vocabulary
expansion.
● Lack of Annotated Data: Techniques such as transfer learning and pre-training can be
used to transfer knowledge from a large dataset to specific tasks with limited labeled data.
● Selection of algorithm and development of model: It is difficult to choose the right
machine learning algorithms that are best for Natural Language Processing tasks.
● Evaluation and Training: It requires powerful computation resources that consist of
powerful hardware (GPUs or TPUs) and time for training the algorithm's iteration. It is
also important to evaluate the performance of the model with the help of suitable metrics
and validation techniques for conforming the quality of the results.
Language modelling:
Language modelling is the way of determining the probability of any sequence of words.
Language modelling is used in various applications such as Speech Recognition, Spam filtering,
etc. Language modelling is the key aim behind implementing many state-of-the-art Natural
Language Processing models.
Due to the smoothing techniques, bigram and trigram language models are robust and have
been successfully used more widely in speech recognition than conventional grammars like
context free or even context sensitive grammars. Although these grammars are expected to
better capture the inherent structures of the language, they have a couple of problems:
● robustness: the grammar must be able to handle a vocabulary of 10000 or more words,
and ultimately a non-zero probability must be assigned to each possible word sequence.
● ambiguity: while m-gram language models avoid any ambiguity in parsing, context
free grammars are typically ambiguous and thus produce more than a single parse tree .
N-Gram: This is one of the simplest approaches to language modeling. Here, a probability
distribution for a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size of
the gram (or sequence of words being assigned a probability). If n=4, a gram may look like: “can
you help me”. Basically, ‘n’ is the amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams, bigrams, trigrams, etc.
Unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently. Unigram
models commonly handle language processing tasks such as information retrieval. The unigram
is the foundation of a more specific model variant called the query likelihood model, which uses
information retrieval to examine a pool of documents and match the most relevant one to a
specific query.
Bidirectional: Unlike n-gram models, which analyze text in one direction (backwards),
bidirectional models analyze text in both directions, backwards and forwards. These models can
predict any word in a sentence or body of text by using every other word in the text. Examining
text bidirectionally increases result accuracy. This type is often utilized in machine learning and
speech generation applications. For example, Google uses a bidirectional model to process
search queries.
Exponential: This type of statistical model evaluates text by using an equation which is a
combination of n-grams and feature functions. Here the features and parameters of the desired
results are already specified. The model is based on the principle of entropy, which states that
probability distribution with the most entropy is the best choice. Exponential models have fewer
statistical assumptions which mean the chances of having accurate results are more.
Continuous Space: In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning weight to a word is known
as word embedding. This type of model proves helpful in scenarios where the data set of words
continues to become large and include unique words.
In cases where the data set is large and consists of rarely used or unique words, linear models
such as n-gram do not work. This is because, with increasing words, the possible word sequences
increase, and thus the patterns predicting the next word become weaker.
Regular Expressions
Regular Expressions are used to denote regular languages. An expression is regular if:
Finite Automation
The finite automatation are the machines which recognize regular languages, i.e, languages
represented by regular expressions (RE). Hence, there is relation of implication like this, F A →
REGLANG → RExp → F A, in other words, it is a bijection between all three entities. Let us
consider that, talk or sound of a sheep (sheeptalk language S) can be represented by strings S =
{baa!, baaa!, baaa!, ...},. Where each sentence in language is dependent on the length of ’a’
sound. The set of strings S can be represented by a regular expression: baa+!. However, we will
prefer it to be represented by a string /baa+!/ – a format of pronunciation. The sounds in set S can
be recognized by a FA shown in the Fig. 5.1. q0 q1 q2 q3 q4 b a a a ! b a a a ! Transition diagram
of FA Finite automata Finite control tape Read Head motion Figure 5.1: Finite Automata.
Formally, a FA is represented by M = (Q, Σ, q0, δ, F), where Q is finite set of states; here Q =
{q0, q1, q2, q3, q4} Σ is finite set of alphabets; here Σ = {a, b, !} δ is transition function; i.e., δ :
Q × Σ → Q, F is set of finite states, called accepting or final states, and here, F = {q4}, S = L(M)
= {baa!, baaa!, baaaa!, . . . }. The recognition process for the language string through FA, is
represented by the algorithm 1. For this, the FA’s tape is divided into squares, called index
positions, where each position is holding one symbol from alphabet Σ. We consider that w is
input string, and |w| = n is length of input [Jurafskyd]. The recognition of a string by FA is
searching a tree. Considering the recognition of some string, say length n, for the alphabet set Σ
= {a, b}, one needs to perform a worst case search of O(2n ) in space and time.
A finite state transducer (FST) is a finite state machine with two tapes: an input tape and an
output tape, with a finite number of states. The Fig. 5.3 shows the diagram where these (input
and output) strings are shown on the transitions, separated by “:”. This FST has been used as a
translator, as it translates the input sentence “” Hello World” to “Hey there krc”. Thus, an FST is
a directed graph, like finite automata, with, 5-4 Lecture 5: Finite Automata and Morphological
Parsing 0 1 2 3 Hello:Hey ε:there World:krc Figure 5.3: An FST as a translator • Edges /
transitions that have input / output labels, • some times there are empty labels indicated by ε, •
Traversing through to the end of an FST implies the translation of one string into another, or
generation of two strings, or relating one string to another, • There is a defined state, called
“start” state, and other ass “final” state. We define here, the FST as Mealy machine, which is an
extension of normal finite state (FS) machine. The formal representation of Mealy machine is
given by, M = (Q, Σ, q0, δ, F) (5.1) where, Q = {q0, q1, . . . , qN−1}, is finite set of states, Σ is
finite alphabet of complex symbols, and Σ ⊆ I × O, q0 is start state, δ : Q × Σ → Q, for example,
δ(q ′ , i : o) = qj . The I and O are input and output symbols, respectively, and both include the
symbol ε. For Σ = {a, b, !}, corresponding to the language discussed earlier, the FST has i : o set
as, {a:a, b:b, !:!, a:!, a:ε, ε:!}. The FSTs are useful for variety of applications: • Word inflections:
For example, finding the plural of the words, cat → cats, dog → dogs, goose → geese, etc. •
Morphological parsing: Extracting the properties of a word, e.g., cats → cat + [nouns] + [plural].
• Simple word translations: For example, US English to UK English. • Simple commands to the
computer.
Morphology is the study of how the words are constructed. Construction of English language
words through attachment of prefixes and suffixes (both together called affixes) are called
concatenative morphology, because a word is composed of a number of morphemes
concatenated together. A word may have more than one affix, for example rewrites (re+write+s),
unlikely (un+like+ly), etc. There are broadly two ways to form words using morphemes:
1. Inflection: Inflectional morphology forms the words using the same group word stem, e.g.,
write+s, word+ed, etc. The Table 5.1 shows the words constructed using inflectional
morphology.
To know the structure about a word when we perform the morphological parsing for that word.
Given a surface form (input form), e.g., “going” we might produce the parsed form: verb-go +
gerund-ing. Morphological parsing can be done with the help of finite-state transducers. A
morpheme is a meaning bearing unit of any language. For example, fox: has single morpheme,
fox and cats: has two morphemes, cat, -s. Similarly, eat, eat, eating, ate, eaten have different
morphemes. Some examples of mapping of certain words and corresponding morphemes are
given in Table 5.3. The mapping of input and output correspond to the input and output of finite
state machines. In speech recognition, when a word has been identified, like cats, dogs, it
becomes necessary to produce its morphological parsing, to find out its true meaning, in the form
of its structure, as well as to know how it is organised. These include the features, like N (noun),
V (verb), specify additional information about the word stem, e.g., +N means that word is noun,
+SG means singular, +PL for plural, etc.
2. Morphotactics rules: Rules about ordering of morphemes in a word, e.g. -ed is followed after a
verb (e.g., worked, studied), un (undo) precede a verb, for example, unlock, untie, etc.
3. Orthographic rules (spelling): For combining morphemes, e.g., city+ -s gives cities and not
citys.We can use the lexicons together with morphotactics (rules) to recognize the words with the
help of finite automata in the form of stem+affix+part-of-speech (N, V, etc). The Fig. 5.7 shows
the basic idea of parsing of nouns using morphological parsing. Recognition of nouns by FA is
subject to reaching the final state (marked by double circle in figure) of FA. Table 5.4 shows
some examples of regular and irregular nouns.
Morphological Parsing of nouns
A similar arrangement is possible for verb morphological parsing (see Fig. 5.8, and Table 5.5).
The lexicon for verbal inflection have three stem classes (reg-verb stem, irreg-verb stem, and
irreg-past-verb), with affix classes are: -ed for past and participle, -ing for continuous, and 3rd
person singular has -s. Adjectives can be parsed in the similar manner like, the nouns and verbs.
Some of the adjectives of English language are: big, bigger, biggest, clean, cleaner, cleanest,
happy, unhappy, happier, happiest, real, really, unreal, etc. The finite automata in Fig. 5.9 is
showing the morphological parsing for adjective words. At the next stage, the lexicon can be
expanded to sub-lexicons, i.e, individual letters, to be recognized by the finite automata. For
example, regular-noun in Fig. 5.7 can be expanded to letters “f o x” connected by three states in a
transition diagram. Similarly, the regular verb stem in Fig. 5.8 can be expanded by letters “w a l
k”, and so on, as shown in Fig. 5.10. Note that in the parsing of N, V, ADJ, and ADV discussed
above, for the sake of simplicity we have not shown the transitions separated by colon (‘:”),
however, the FST has two tapes as usual, for input and output.
Regular and Irregular verbs. Reg-verb Past Irreg-verb Irreg-past-v Cont
Tokenization
Tokenization is one of the first steps in any NLP pipeline. Tokenization is nothing but splitting
the raw text into small chunks of words or sentences, called tokens. If the text is split into words,
then it's called 'Word Tokenization' and if it's split into sentences then it's called 'Sentence
Tokenization'. Generally 'space' is used to perform the word tokenization and characters like
'periods, exclamation point and newline char are used for Sentence Tokenization. We have to
choose the appropriate method as per the task in hand. While performing the tokenization of a
few characters like spaces, punctuations are ignored and will not be part of the final list of
tokens.
Tokenization is Required
Every sentence gets its meaning by the words present in it. So by analyzing the words present in
the text we can easily interpret the meaning of the text. Once we have a list of words we can also
use statistical tools and methods to get more insights into the text. For example, we can use word
count and word frequency to find out the importance of words in that sentence or document.
To detect the spelling error in nlp is by using the techniques such as spell checking, phonetic
matching and incorporating language models that handles out- of-vocabulary words effectively.
1. Spell Checking:
● Dictionary-Based Approaches: Utilize a dictionary or lexicon to check if
each word in the text is spelled correctly. If a word is not found in the
dictionary, it is considered a potential spelling error.
2. Phonetic Matching:
● Soundex and Metaphone: Phonetic algorithms map words to phonetic
representations based on their pronunciation. Words with similar phonetic
representations are likely to be spelled similarly, even if spelled
differently. This technique helps in identifying spelling errors where words
sound alike but are spelled differently.
3. Language Models:
● Statistical Language Models: Use statistical models trained on large text
corpora to estimate the probability of a word sequence. Language models
can help in identifying likely corrections for misspelled words based on
the context of surrounding words.
● Neural Language Models: Modern neural language models like
Transformer-based models (e.g., BERT, GPT) are effective at predicting
and correcting spelling errors by considering the context of the entire
sentence. Fine-tuning these models on spelling correction tasks can yield
highly accurate results.
4. Rule-Based Approaches:
● Pattern Matching: Apply regular expressions or pattern-matching rules to
detect common types of spelling errors, such as repeated characters,
missing characters, or transposed letters.
● Language-Specific Rules: Develop language-specific spelling correction
rules based on common misspellings, phonetic patterns, or morphological
rules.
5. Ensemble Methods:
● Combining Multiple Approaches: Combine the outputs of different
spelling correction methods, such as spell checking, phonetic matching,
and language models, using ensemble techniques to improve accuracy and
robustness.
6. User Feedback:
● Interactive Correction: Allow users to provide feedback on suggested
corrections and incorporate this feedback to improve the spelling
correction system over time. This can be achieved through interactive
interfaces or feedback mechanisms in applications.
7. Domain-Specific Customization:
● Custom Dictionaries: Create domain-specific dictionaries or lexicons
containing relevant terms and vocabulary to improve the accuracy of
spelling correction in specific domains or industries.
The minimum edit distance has important applications in areas like spell checking, DNA
sequence analysis, and natural language processing (NLP). The process of calculating it is often
done using a dynamic programming approach, where a table is constructed to compare substrings
of the two words being analyzed. This method ensures that the solution is computed efficiently.
The time complexity for computing the minimum edit distance is O(n×m)O(n \times m)O(n×m),
where n and m are the lengths of the two strings. It provides a useful metric for understanding
the similarity between two strings and can help in tasks like text correction or similarity-based
searches.
It helps to determine how many single-character are needed to convert one string into another
in edits:
1.insertions,
2.deletions
3. Replace
Example of calculating the minimum edit distance between two words: "kitten" and "sitting."
We want to transform "kitten" into "sitting" using the fewest possible single-character edits
(insertions, deletions, or substitutions).
Here, three edits are required: two substitutions and one insertion. Therefore, the minimum
edit distance between "kitten" and "sitting" is 3.
Calculation Example:
For the words "intention" and "execution," the minimum edit distance would involve the
following edits:
This is an example of the Levenshtein distance, where each operation has a cost of 1. The
dynamic programming approach would typically use a matrix to track the optimal number of
edits for each substring of the two words, ensuring an efficient calculation of the total edit
distance.
In spell checkers, the minimum edit distance is used to find the closest valid words to a
misspelled word. For example, if a user types "spel", the system may suggest "spell" by
calculating the minimum edit distance between the input and possible dictionary words.
2. Machine Translation
In machine translation, minimum edit distance can be used to evaluate the accuracy of
translations. It compares the machine-generated translation to the human reference
translation by calculating the number of edits required to convert one sentence to another.
3. Text Similarity
It helps measure similarity between two text sequences, which is important in tasks like
paraphrase detection, text clustering, and plagiarism detection.
4. Speech Recognition
In automatic speech recognition, the minimum edit distance helps quantify the differences
between recognized words and the correct transcription.