0% found this document useful (0 votes)
68 views135 pages

NLP Merged

The document discusses syntactic ambiguity in natural language processing (NLP), detailing types such as structural, lexical, and attachment ambiguity, and the challenges they pose for parsing algorithms. It explains the role of Context-Free Grammar (CFG) in syntactic analysis, including its limitations and the importance of parsing techniques like shallow parsing and chunking. Additionally, the document introduces Conditional Random Fields (CRFs) as a statistical modeling method for structured prediction tasks in NLP.

Uploaded by

maujkick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views135 pages

NLP Merged

The document discusses syntactic ambiguity in natural language processing (NLP), detailing types such as structural, lexical, and attachment ambiguity, and the challenges they pose for parsing algorithms. It explains the role of Context-Free Grammar (CFG) in syntactic analysis, including its limitations and the importance of parsing techniques like shallow parsing and chunking. Additionally, the document introduces Conditional Random Fields (CRFs) as a statistical modeling method for structured prediction tasks in NLP.

Uploaded by

maujkick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

80

81
82
83
84
85
86
87
88
89

Syntax Analysis
Ambiguity in syntactic analysis refers to situations where a sentence or phrase can be interpreted in
more than one way due to its structure. This ambiguity can arise from the way words are grouped and
related to each other in a sentence, leading to multiple possible syntactic trees or parses. Ambiguity in
syntactic analysis is a significant challenge in natural language processing (NLP) and computational
linguistics.

Types of Syntactic Ambiguity:


● Structural Ambiguity (Syntactic Ambiguity):
Occurs when a sentence can be parsed in multiple ways, resulting in different syntactic
structures or parse trees. The ambiguity arises because the sentence's structure is not clear
without additional context.
● Lexical Ambiguity:
Involves words that have more than one meaning (polysemy) or different interpretations
depending on the context. While this is primarily a semantic issue, it can lead to syntactic
ambiguity if the word's role in the sentence changes based on its meaning.
● Attachment Ambiguity:
Occurs when it is unclear how a particular phrase or clause should attach to the rest of the
sentence. This is often seen with prepositional phrases or relative clauses.

Challenges of Syntactic Ambiguity in NLP:


● Parsing Complexity:
Ambiguity increases the complexity of parsing algorithms because they must consider
multiple potential parse trees for a given sentence, leading to higher computational costs.
● Disambiguation:
Determining the correct interpretation of an ambiguous sentence often requires additional
contextual information, semantic analysis, or probabilistic methods. Disambiguating
sentences automatically is a challenging task, especially in real-time applications.
● Impact on Downstream Tasks:
Ambiguity in syntactic analysis can affect other NLP tasks, such as machine translation,
information extraction, and question answering, by leading to incorrect interpretations of the
input text.
90
Context-Free Grammar
Context-Free Grammar (CFG) plays a crucial role in syntactic analysis, which is the process of
analyzing the structure of sentences in a language according to its grammar. In computational
linguistics and natural language processing (NLP), CFGs are foundational tools used to represent and
parse the syntactic structure of sentences.
Context free grammar is a formal grammar which is used to generate all possible strings in a given
formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.

Role of CFG in Syntactic Analysis:


● Defining Sentence Structure:
CFGs define the hierarchical structure of sentences, specifying how different parts of a
sentence (e.g., noun phrases, verb phrases) are organized and related to each other. This
structure is essential for understanding how sentences are constructed and how different
elements interact.
Example: A simple CFG might have rules like:
S → NP VP (A sentence consists of a noun phrase followed by a verb phrase)
NP → Det N (A noun phrase consists of a determiner followed by a noun)
VP → V NP (A verb phrase consists of a verb followed by a noun phrase)
● Parsing Sentences:
CFGs are used in parsing algorithms to analyze and derive the syntactic structure of
sentences. Parsing is the process of taking a sentence and constructing its parse tree, which
represents the hierarchical structure as defined by the CFG.
● Disambiguation:
In cases where a sentence can be interpreted in multiple ways (syntactic ambiguity),
CFG-based parsers generate multiple parse trees. These trees represent the different possible
interpretations of the sentence's structure. The parser can then choose the most appropriate
parse based on additional information, such as probabilities or context.
● Grammar Checking:
CFGs can be used to check whether a given sentence is grammatically correct according to
the rules of the language. If a sentence cannot be generated by the CFG, it is considered
ungrammatical within the context of that grammar.

Example: Sue, mouse & the cat:


91

SMC Example: Top-down parsing (A):

● Top-down parsing starts with the S symbol and tries to rewrite it into the sentence.

SMC Example: Bottom-up parsing (A):

● Bottom-up parsing starts with the words and tries to find symbols that generate them.

SMC Example: Top-down parsing (B) - using parse tree


92

SMC Example: Bottom-up parsing (B) - using parse tree

limitations and issues CFG:


● Ambiguity
○ Syntactic Ambiguity: CFGs can produce multiple valid parse trees for the same
sentence, leading to ambiguity in interpretation. For example, the sentence "I saw the
man with the telescope" can have different parse trees depending on whether "with
the telescope" modifies "the man" or "saw."
○ Structural Ambiguity: CFGs struggle with sentences that can be parsed in more than
one way due to ambiguous grammar rules.
● Overgeneration
CFGs can sometimes generate grammatically correct but semantically nonsensical sentences,
leading to overgeneration. This happens when the rules allow for combinations of words that,
while syntactically valid, do not make sense.
● Inflexibility in Grammar Development
Developing a CFG for a natural language is a labor-intensive process that requires extensive
linguistic knowledge. Moreover, because natural languages evolve, maintaining and updating
CFGs to accommodate new linguistic phenomena can be challenging.
● Difficulty in Modeling Natural Language Phenomena
○ Ellipsis and Anaphora: CFGs do not handle cases where parts of a sentence are
omitted but implied, like in "John can play the guitar, and Mary can too."
○ To mitigate these issues, more advanced parsing techniques are often used, such as:
○ Dependency Parsing: Focuses on the relationships between words, often providing a
more accurate representation of sentence structure.
93
○ Neural Network-Based Parsing: Modern NLP approaches use deep learning models to
parse sentences, allowing for more flexibility and better handling of context and
ambiguity.

SHALLOW PARSING

Parsing
Parsing is the process of analyzing sentences syntactically. It breaks down a sentence into its
constituent parts (like nouns, verbs, and phrases).
Without proper parsing, NLP systems would struggle to understand the grammatical structure of
language.
Parsing is usually an intermediate stage
– Used to uncover the structures that are used by later stages of processing
Full Parsing is a sufficient task in NLP but not a necessary intermediate stage for many NLP tasks.
Parsing often provides more information than what we needed.

Advantages of Full Parsing


● Syntactic rules are general, hence this process is domain-independent.
● So a syntactic analyzer of one domain can be easily ported to another.
● Efficient analysis algorithms, well Efficient analysis algorithms, and well-elaborated.
● Good for languages with fixed order.
Disadvantages of Full Parsing:
Searching for the most likely correct parse of a sentence as NL Searching for the most likely correct
parse of a sentence as NL sentences are structurally ambiguous.
● Difficulty in translation into semantics
● Long distance dependencies – example : wh example : wh-extraction
● Should Peter buy a book? Which book should Peter buy?
● There is a long-distance dependency between buy and which book
● Bad when it comes to freer word order – Even in English!
● Sometimes very slow for real time applications.

Shallow Parsing
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the
constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor
their role in the main sentence.
In shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow Parsing is also called light parsing or chunking.

Chunking is the process of grouping similar words together based on the nature of the word.

Chunking (aka. Shallow parsing) is to analyzing a sentence to identify the constituents (noun groups,
verbs, verb groups, etc.). However, it does not specify their internal structure, nor their role in the
main sentence.
94
"The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large
boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization,
which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the
pieces produced by a chunker do not overlap in the source text."

Difference Between POS Tagging and Shallow Parsing?


POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB,
DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My / PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./
Chunking output
[NP My Dog] [VP likes] [NP his food]

Shallow Parsing Approaches


● Light (or “partial”) parsing
● Chunk parsing (a type of light parsing)
○ Not all NLP applications require a complete syntactic analysis.
○ IR: sufficient to find NPs and VPs.
IE, Summary Generation,
QA: We are interested in information about specific syntactic information and specific
syntactic-semantic relations such as agent, object, location, time, etc (who did what to whom,
when, where, and why).

Light Parsing
Goal: To assign a partial structure to a sentence.
● Simpler solution space
● Local context
● Non-recursive
● Restricted (local) domain

Output from Light Parsing:


● What kind of partial structures should light parsing construct?
● Different structures are useful for different tasks:
○ Partial constituent structure
[NP I] [VP saw [NP a tall man in the park]].
○ Prosodic segments (The way a speaker's voice rises and falls)
[I saw] [a tall man] [in the park].
○ Content word groups
[saw] [a tall man] [in the park].

Chunk Parsing
Goal: To divide a sentence into a sequence of chunks
Group together connected items or words (called chunks) so that they can be stored or processed as
single concepts.
● Chunks are non-overlapping regions of a text
saw [a tall man] in [the park]
● Chunks are non-recursive
– A chunk cannot contain other chunks
● Chunks are non-exhaustive
95
– Not all words are included in the chunks
Examples:
● Noun-phrase chunking:
[I] saw [a tall man] in [the park].
● Verb-phrase chunking:
The man who [was in the park] [saw me].
● Prosodic chunking:
[I saw] [a tall man] [in the park].

Chunks and Constituency:


Constituents: [[a tall man] [ in [the park]]].
Chunks:[a tall man] in [the park].
● A constituent is part of some higher unit in the hierarchical syntactic parse
● Chunks are not constituents
○ Constituents are recursive
● But, chunks are typically sub-sequences of constituents
○ Chunks do not cross major constituent boundaries

Chunk Parsing: Accuracy


● Chunk parsing achieves high accuracy
● Small solution space
● Less word-order flexibility within chunks than
● between chunks
○ Fewer long-range dependencies
○ Less context dependence
● Better locality
● No need to resolve ambiguity
● Less error propagation

Chunk Parsing: Domain Specificity


● Chunk parsing is less domain specific
● Dependencies on lexical/semantic information tend to occur at levels “higher” than chunks:
○ Attachment (I saw a girl with telescope)
○ Argument selection112
○ Movement(Noun phrase)
● Fewer stylistic differences with chunks
96
Psycholinguistic Motivations
● Chunks are processing units
○ Humans tend to read texts one chunk at a time
○ Eye movement tracking studies
● Chunks are phonologically marked
○ Pauses
○ Stress patterns
● Chunking might be a first step in full parsing
○ Integration of shallow and deep parsing

Chunk Parsing: Efficiency


● Smaller solution space
● Relevant context is small and local
● Chunks are non-recursive
● Chunk parsing can be implemented with a finite state machine
○ Fast (linear)
○ Low memory requirement (no stacks)
● Chunk parsing can be applied to a very large text sources (e.g., the web)

● Chunk Parsing Techniques


Chunk parsers usually ignore lexical content
● Only need to look at part-of-speech tags
● Techniques for implementing chunk parsing
○ Regular expression matching
○ Chinking
○ Cascaded Finite state transducers

Regular Expression Matching


● Define a regular expression that matches the sequences of tags in a chunk
○ A simple noun phrase chunk regular expression using Determiner, Adjective and
Singular Noun:
■ <DT> ? <JJ> * <NN.?> ?- zero or one time, * - zero or more
time
● Chunk all matching subsequences:
○ In:
The /DT little /JJ cat /NN sat /VBD on /IN the /DT mat /NN
○ Out:
[The /DT little /JJ cat /NN] sat /VBD on /IN [the /DT mat /NN]
● If matching subsequences overlap, the first one gets priority.
● Regular expressions can be cascaded.

Chinking
● A chink is a subsequence of the text that is not a chunk.
● Define a regular expression that matches the sequences of tags in a chink.
○ A simple chink regexp for finding NP chunks:
(<VB.?> | <IN>)+
● Chunk anything that is not a matching subsequence:
97
the/DT little/JJ cat/NN sat/VBD on /IN the /DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on /IN [the /DT mat/NN]
chunk chink chunk

Chinking is the process of removing a sequence of tokens from a chunk.


This means we're removing from the chink one or more verbs, prepositions, determiners, or the word
'to'.

Finite State Approaches to Shallow Parsing


● Shallow Parsing with FSMs:
○ FSMs can be designed to recognize specific patterns or sequences of words.
○ For shallow parsing, we define FSMs to identify chunks like noun phrases, verb
phrases, or prepositional phrases.
● Benefits of Finite State Approaches:
○ Speed: FSM-based parsing is fast and suitable for real-time applications.
○ Memory Efficiency: FSMs require minimal memory compared to more complex
parsing techniques.
○ Customization: We can tailor FSMs to specific linguistic patterns or languages.
● Applications:
○ Shallow parsing using FSMs is commonly used in information extraction, named
entity recognition, and text mining.
● Finite-state approximation of sentence structures (Abney 1995)
○ finite-state cascades: sequences of levels of regular expressions
○ recognition approximation: tail-recursion replaced by iteration
○ interpretation approximation: embedding replaced by fixed levels
● Finite-state approximation of phrase structure grammars (Pereira/Wright 1997)
○ flattening of shift-reduce-recogniser
○ no interpretation structure (acceptor only)
○ used in speech recognition where syntactic parsing serves to rank hypotheses for
acoustic sequences
● Finite-state approximation (Sproat 2002)
○ bounding of centre embedding
○ reduction of recognition capacity
○ flattening of interpretation structure
98
99
100

Deep Vs Shallow Parsing


101

Conditional Random Fields


Conditional Random Fields (CRFs) are a type of statistical modeling method used primarily in natural
language processing (NLP) for tasks involving structured prediction. They are particularly effective
for problems where the output is a sequence of labels, such as part-of-speech tagging, named entity
recognition, or syntactic parsing.

Def: A CRF is a probabilistic graphical models used to model the conditional probability of a label
sequence given a sequence of input features. CRFs consider the dependencies between labels in a
sequence, which helps capture the context and relationships more effectively.

What is CRF?
● CRF is intended to do the task-specific predictions i.e. we have the input X (vector) and
predict the label y which are predefined.
● CRF is a probabilistic discriminative model that has a wide range of applications in Natural
Language Processing, Computer Vision and Bioinformatics.
● The conditional random field is used for predicting the sequences that use the contextual
information to add information which will be used by the model to make correct predictions.

Discriminative model makes predictions based on conditional probability and is either used for
classification or regression.

Goal: Sequence segmentation and labeling


● Computational linguistics
● Computer science
● Computational biology

Overview
● Generative models
○ HMM
● Discriminative model
○ Conditional models
■ Maximum entropy Markov models
■ Conditional random fields – Predict the seq. (PoS tag)
● Experiments

How CRFs differ from Hidden Markov Models


● Hidden Markov Models are generative, and give output by modeling the joint probability
distribution.
● Conditional Random Fields are discriminative, and model the conditional probability
distribution.
● CRFs don’t rely on the independence assumption (that the labels are independent of each
other), and avoid label bias.
102
Applications of CRFs
● Parts-of-Speech tagging
○ Parts of speech of a sentence rely on previous words, and by using feature functions
that take advantage of this, we can use CRFs to learn how to distinguish which words
of a sentence correspond to which POS.
● Named Entity recognition –
○ Extracting Proper nouns from sentences. Conditional Random Fields can be used to
predict any sequence in which multiple variables depend on each other.
○ NER is a problem of identifying the entities from the text and classifying the entities
into a person, location, organization and so on.
● Other applications include parts-recognition in Images and gene prediction, object detection
problems.

Generative Models
● HMMs and stochastic grammars
● Assign a joint probability to paired observation and label sequences
● Parameters are trained to maximize joint likelihood of training examples

● Need to enumerate all possible observation sequences


● To ensure tractability of inference problem, must make strong independence ssumptions (i.e.,
conditional independence given labels)

Conditional models
● Specify probabilities of label sequences given an observation sequence
● Does not expend modelling effort on the observations which are fixed at test time
● Conditional probability can be dependent on arbitrary, non-independent features of the
observation sequence

Example: MEMMs
● Maximum entropy Markov models
● Each source state has an exponential model that takes the observation feature as input and
outputs a distribution over possible next states
● Weakness: Label bias problem
● Weakness: Label bias problem (the transition probabilities of leaving a given state is
normalized for only that state.)
● The big difference between MEMM and CRF is that MEMM is locally renormalized and
suffers from the label bias problem, while CRFs are globally re-normalized. So, CRFs avoid
the label bias problem

Label Bias Problem


● Per-state normalization of transition scores implies “conservation of score mass”
● Bias towards states with fewer outgoing transitions
103
● State with single outgoing transition effectively ignores observation

Solving Label Bias


● Collapse states, and delay branching until get a discriminating observation
○ Not always possible or may lead to combinatorial explosion

● Start with fully-connected model and let training procedure figure out a good structure
○ Precludes use of prior structure knowledge
● CRFs use a globally normalized approach. The key difference is that CRFs compute all state
transitions jointly in a single model.
● The weights of different features in different states compete against each other, ensuring a
more balanced distribution.
● CRFs consider the entire sequence of labels, not just local context, which mitigates label bias.

Conditional Random Fields


● Undirected graph (random field)
● Construct conditional model p(Y|X)
● Does not explicitly model marginal p(X)-
● (Unconditional probability/Prior-Probability)
● Assumption: graph is fixed

CRF - Formulation

CRF Formulation for POS Tagging


This will involve feature extraction, model training, and inference
using the CRF model.
1. Feature Extraction
2. Model Training
3. Inference
104

3.CRF Model: The conditional probability of the tag sequence y given the word sequence x is given
by:

Training:
Objective: Train the CRF model to maximize the likelihood of the training data. This
involves optimizing the weights and best fit the observed data.
Testing:
Given a sequence of words x, find the most likely sequence of tags y.

Problem Description
Given (observations), find (predictions)
For example,

● The relational connection occurs in many applications, NLP, Computer Vision, Signal
Processing, ….

● Traditionally in graphical models,


○ Modeling the joint distribution can lead to difficulties

○ rich local features occur in relational data,


○ features may have complex dependencies,
● constructing probability distribution over them is difficult
105

● Solution: directly model the conditional,


○ is sufficient for classification!

● CRF is simply a conditional distribution with an associated graphical structur

Discriminative Vs. Generative

Generative Model: A model that generate observed data randomly


Naïve Bayes: once the class label is known, all the features are independent

Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination”

between different outputs


MaxEnt classifier: linear combination of feature function in the exponent,

Both generative models and discriminative models describe distributions over (y , x), but they work in
different directions.

Generative Vs Discriminative
106
CRFs vs. MEMM vs. HMM

CRFs: Example Features

● Corresponding parameters λ and μ similar to the (logarithms of the) HMM parameters p(y’|y)
and p(x|y)

CRFs: Parameter Estimation


● Maximize log-likelihood objective function

● Paper uses iterative scaling to find optimal parameter vector

Experiment :
Part-of-speech Tagging
● Each word to be labeled with one of 45 syntactic tags.
● 50%-50% train-test split
● out-of-vocabulary (oov) words: not observed in the training set
● Second set of experiments: add small set of orthographic features (whether word is
capitalized, whether word ends in –ing, -ogy, -ed, -s, -ly …)
● Overall error rate reduced by 25% and oov error reduced by around 50%
● Usually start training with zero parameter vector (corresponds to uniform distribution)
● Use optimal MEMM parameter vector as starting point for training corresponding CRF
● MEMM+ trained to convergence in around 100 iterations; CRF+ took additional 1,000
iterations
● When starting from uniform distribution, CRF+ had not converged after 2,000 iterations
107
Further Aspects of CRFs
● Automatic feature selection
○ Start from feature-generating rules and evaluate the benefit of the generated features
automatically on data

Conclusions
● CRFs do not suffer from the label bias problem!
● Parameter estimation guaranteed to find the global optimum
● Limitation: Slow convergence of the training algorith

Dependency Parsing
● Dependency parsing is a broader term for analysing a sentence's grammatical structure by
establishing relationships between words. In a dependency parse, each word in the sentence is
linked to other words, indicating how they relate to each other in terms of syntax and
grammar.
● Purpose: To determine the syntactic structure of a sentence by identifying dependencies
between words.
● Output: A dependency tree or graph where nodes represent words and edges represent
grammatical relations (e.g., subject, object, modifier).
● Types: There are various methods for dependency parsing, including rule-based, statistical,
and neural network approaches.

Why dependency parsing is essential


1. Understanding Sentence Structure
a. Grammatical Relationships: Dependency parsing helps in understanding the
grammatical relationships between words in a sentence. It identifies which words
govern or modify other words, providing insights into how different parts of a
sentence fit together.
2. Improving Text Analysis Tasks
a. Named Entity Recognition (NER): Dependency parsing can enhance NER by
providing context about the relationships between entities in a sentence, leading to
more accurate identification and classification of named entities.
b. Sentiment Analysis: By understanding the syntactic structure, dependency parsing
can improve sentiment analysis by helping to determine how different parts of a
sentence contribute to the overall sentiment.
3. Enhancing Machine Translation
● Contextual Understanding: In machine translation, understanding the syntactic
structure helps in translating sentences more accurately by preserving grammatical
relationships between words in different languages.
4. Information Extraction
● Relationship Extraction: Dependency parsing is used to extract relationships between
entities in a text. For example, it can identify relationships between people, places,
and organizations in news articles.
5. Question Answering Systems
● Understanding Queries: Dependency parsing helps in understanding the syntactic
structure of questions and queries, improving the accuracy of answers provided by
question-answering systems. Why dependency parsing is essential
108
6. Text Generation
● Grammar Consistency: For text generation tasks, dependency parsing ensures that the
generated text maintains grammatical consistency and adheres to the correct syntactic
structure.
7. Syntax-Based Text Summarization
● Effective Summarization: Dependency parsing helps in generating summaries by
identifying the core components of a text and their relationships, leading to more
coherent and meaningful summaries.
8. Speech Recognition
● Contextual Understanding: In speech recognition systems, understanding the
grammatical relationships between words can help improve the accuracy of
transcriptions by providing context.

Dependency relations

Each dependency relationship is labelled with a specific tag that describes the nature of the
relationship. These labels are essential for understanding the syntactic structure of sentences
and can be accessed through the token.dep attribute in spaCy.

Dependency relations – Universal Dependency Relation


109
Constituency vs Dependency structure

Dependency structure

● Consists of relations between lexical items, normally binary, asymmetric relations (“arrows”)
called dependencies
● The arrows are commonly typed with the name of grammatical relations (subject,
prepositional object, apposition, etc)
● The arrow connects a head (governor) and a dependent (modifier)
● Usually, dependencies form a tree (single-head, connected, acyclic)

Advantages of dependency structure


● More suitable for free word order languages
110
Dependency parsing

1. Understanding Dependency Parsing


● Dependencies: In dependency parsing, dependencies represent the relationships
between words in a sentence.
● For example, in the sentence "She eats an apple," "eats" is the main verb and "She" is
the subject, while "apple" is the object of the verb "eats."
● Dependency Tree: A dependency tree visualizes these relationships. Each word is a
node, and directed edges indicate dependencies between words.
2. Steps to Find Dependencies
● Tokenization : Tokenize the Sentence: Break down the sentence into individual words
or tokens. For example, "She eats an apple" becomes ["She", "eats", "an", "apple"].
● POS Tagging: Part-of-Speech Tagging: Assign POS tags to each token. For "She eats
an apple," the tags might be ["PRP", "VBZ", "DT", "NN"].
3. Parsing: Parse the Sentence: Use a dependency parser to analyze the syntactic structure.
● There are several tools available for dependency parsing, such as:
○ spaCy: A popular library in Python for NLP that includes pre-trained models
for dependency parsing.
○ Stanford NLP: Provides models and tools for dependency parsing.
○ NLTK: Offers some functionality for parsing and analyzing syntactic
structure.
4. Extract Dependencies
● Identify Dependencies: After parsing, each word will have a head (the word it
depends on) and a dependency relation (the type of relationship). For example, in the
sentence "She eats an apple," the word "eats" might have:
Head: "eats"
Dependency Relations:
"She" (subject of "eats")
"apple" (object of "eats")
Analyze the Output
Dependency Relations: Review the output from the parser to understand how words relate to each
other. The result is often represented as a tree or graph.
111

ADP – Adposition - Preposition

● The dependency relations might look like this:


| Token | Dependency Relation | Head | Head POS |
| The | det | cat | NOUN |
| cat | nsubj | sat | VERB |
| sat | ROOT | sat | VERB |
| on | prep | sat | VERB |
| the | det | mat | NOUN |
| mat | pobj | on | ADP |

● Explanation of Dependencies
● The (DT) -> cat (NN): "The" is a determiner (det) for "cat".
● cat (NN) -> sat (VBD): "cat" is the subject (nsubj) of "sat".
● sat (VBD) -> sat (VBD): "sat" is the root of the sentence (ROOT).
● on (IN) -> sat (VBD): "on" is a preposition (prep) modifying "sat".
● the (DT) -> mat (NN): "The" is a determiner (det) for "mat".
● mat (NN) -> on (IN): "mat" is the object of the preposition "on" (pobj).

In dependency parsing, the verb is often the root of a sentence because it serves as the central
element that anchors the sentence's structure. Here’s why the verb typically plays this central role:
1. Central to Sentence Structure
● Core Predicate: The verb functions as the core predicate of the sentence. It describes
the main action or state that the sentence is about. The other elements in the
sentence—such as subjects, objects, and complements—are organized around the
verb, making it the focal point of the sentence's meaning.
2. Syntactic Dependency
● Head of Clauses In traditional syntactic theory, sentences are structured around a
head, and the verb is usually this head. All other elements in the sentence (like the
subject or object) depend on the verb. This makes the verb the "root" because it is the
central element upon which the grammatical structure of the sentence hinges.
3. Grammatical Relationships
● Subject-Verb Agreement: The verb often determines the agreement in number and
person with the subject. This relationship is crucial for understanding who is
performing the action and who is affected by it.
● Object and Complement Relationships: The verb also governs how objects and
complements are connected to the main action or state described by the sentence.
4. Dependency Tree Construction
● Root Node: In a dependency tree, the root node is the head of the main clause. For
simple sentences, this is usually the main verb. For example, in the sentence "She eats
an apple," "eats" is the verb and the root because it connects the subject ("She") and
the object ("an apple") through its dependencies.
112
Consider the sentence "She writes a letter":
1. Sentence Analysis:
"She": Subject (nsubj) of the verb.
"writes": Main verb (ROOT) of the sentence.
"a letter": Direct object (dobj) of the verb.
2. Dependency Parsing:
"writes" is the ROOT because it is the central action or state around which the sentence is
built.
"She" (subject) depends on "writes" because it describes who is performing the action.
"a letter" (object) depends on "writes" because it describes what is being written.

Visual Representation
● In a dependency tree, the verb "writes" would be the root node with branches extending to the
subject
"She" and the object "a letter":
writes
/\
She letter
/
A
● This tree structure illustrates that "writes" is central, and all other elements are related to it.

Special Cases -
● Complex Sentences: In complex sentences with multiple clauses, each clause might have its
own root verb. For instance, in "She writes a letter because she wants to inform him," "writes"
is the root of the main clause, while "wants" is the root of the subordinate clause.
● Imperative Sentences: For imperative sentences (e.g., "Close the door"), the verb is still the
root even though the subject may be implied (often "you").
● In summary, the verb is typically the root in dependency parsing because it is the main action
or state around which the sentence is organized. It serves as the focal point of the sentence’s
structure, and all other elements are linked to it through various grammatical dependencies.

Transition-based parsing
● Transition-based dependency parsing is a popular approach for dependency parsing that uses
a sequence of transitions to build a dependency parse tree incrementally. It is particularly
known for its efficiency and effectiveness in parsing sentences.
● Transition-based parsing involves processing a sentence word-by-word and applying a
sequence of actions (transitions) to construct the dependency tree. The key idea is to
transform a data structure (usually a stack and a buffer) step-by-step until the entire sentence
is parsed.

Key Components
● Data Structures:
○ Stack: Holds partially parsed words or constituents.
○ Buffer: Contains the remaining words yet to be processed.
○ Parse Tree: The final structure that will be built by applying transitions.
● Transitions: Transition-based parsers use a predefined set of actions to modify the stack,
buffer, and dependency structure. Common transitions include:
○ Shift: Move the next word from the buffer to the stack.
○ Left Arc (or Left Attachment): Create a dependency from the top word on the stack to
the word below it, then remove the top word from the stack.
113
○ Right Arc (or Right Attachment): Create a dependency from the word below the top
word on the stack to the top word, then remove the top word from the stack.
● Transition Sequence: The goal is to find a sequence of transitions that correctly builds the
dependency structure for the entire sentence. This is typically formulated as a search problem,
where the parser explores possible sequences of transitions to achieve the desired structure.

Transition-based parsing algorithm - Steps

Example:1 Transition-based dependency parsing

Example:2 Transition-based dependency parsing


Consider the sentence: “John eats an apple.”
The transition-based parser might process it as follows:
1. Initial State:
Stack: []
Buffer: [John, eats, an, apple]
2. Shift: Move “John” to the stack.
Stack: [John]
Buffer: [eats, an, apple]
3. Shift: Move “eats” to the stack.
Stack: [John, eats]
Buffer: [an, apple]
4. Right Arc: Create a dependency from “eats” to “John” (John is the subject of eats).
Stack: [eats]
Buffer: [an, apple]
Dependencies: {eats → John}
114
5. Shift: Move “an” to the stack.
Stack: [eats, an]
Buffer: [apple]
6. Shift: Move “apple” to the stack.
Stack: [eats, an, apple]
Buffer: []
7. Right Arc: Create a dependency from “an” to “apple” (apple is the object of an).
Stack: [eats, an]
Dependencies: {eats → John, an → apple}
8. Left Arc: Create a dependency from “eats” to “an” (an is the object of eats).
Stack: [eats]
Dependencies: {eats → John, an → apple, eats → an}
9. Left Arc: Create a dependency from the root to “eats”.
Stack: []
Dependencies: {eats → John, an → apple, eats → an, root → eats}

Projectivity
• Arc from head to dependent is projective
• If there is a path from head to every word between head and dependent
• Dependency tree is projective
• If all arcs are projective
• Or equivalently, if it can be drawn with no crossing edges
• Projective trees make computation easier
• But most theoretical frameworks do not assume projectivity
• Need to capture long-distance dependencies, free word order

Advantages
• Efficiency: Transition-based parsers are typically efficient and can handle large datasets due to their
linear time complexity in practice.
• Simplicity: The transition-based approach is conceptually simple and easy to implement.
• Flexibility: Can be adapted to different types of dependency grammars and can be enhanced with
various features (e.g., part-ofspeech tags, word embeddings) to improve performance.
115

Semantics

What does "semantic" mean?


Semantic:
Of or pertaining to meaning, esp. meaning in language
Of, relating to, or according to the science of semantics
Semantics:
Linguistics: The study or science of meaning in language forms, esp. with regard to its historical
change.
Logic: The study of relationships between signs symbols and what thev represent.

How do you do semantics?

● Generalization –. It occurs when one specific experience represents a whole class of


experiences
- organizing concepts by kind
● Aggregation
- Aggregating complexes into simpler concepts
● Common Properties
- Relationships (connecting properties)
- Attributes (flat properties)
● Naming Conventions
- Terms / Phrases
- Language

What else do semantics provide?

● Contextual Meaning
● Inferred Relationships
● Causality (Casual Relations between words)
● Granularity - Breaking down an event into smaller parts

The Problem of Semantic Ambiguity

Did you say you were looking for mixed nuts? (context can be food or hardware)
People use context to derive the correct meaning.
116

Definition

Semantic processing deals with rewriting a parse tree into a “meaning representation”
Logic, SQL, Knowledge base
Poorly understood compared to syntax
applications that need complex semantics, like database front ends or high-quality Machine
Translation had limited success in the past.
How can we represent the meaning of an English
sentence?
Programming languages: “meaning” is equivalent machine code
a = b +c means
Load a,b,c
add b,c
store a

Meanings of words

Lexemes - basic unit of meaning in the lexicon


Examples:
- Red, n: the color of blood or a ruby
- Blood, n: the red liquid that circulates in the heart, arteries and veins of animals
- Right, adj: located nearer the right hand esp. being on the right when facing the same
direction as the observer
Do dictionaries gives us definitions??

Three Perspectives on Meaning

1. Lexical Semantics
The meanings of individual words
2. Formal Semantics (or Compositional Semantics or Sentential Semantics)
How those meanings combine to make meanings for
individual sentences or utterances (manner of speaking; power of speaking)
3. Discourse or Pragmatics
How those meanings combine with each other and with other facts about various kinds of context to
make meanings for a text or discourse
Dialog or Conversation is often lumped together with Discourse

Relations among words


• Homonymy (Words with identical forms but
different meanings)
– Example: "pair" (a couple) vs. "pear" (a type of
fruit).
– Other examples: be/bee?,
wood/would?
• Homophones (Sum, Some) (Son, Sun)
• Homographs (lead - to go first with followers behind/a type of metal)
• Applications: spelling correction, speech recognition, text-to-speech

Relationships between word meanings


117
• Homonymy
• Polysemy - word having several meanings.
• Synonymy
• Antonymy
• Hypernomy - A word with a broad meaning constituting a category into which words with more
specific meanings. For example, colour is a hypernym of red.
• Hyponomy - Relationship between more general term Ex: Red is hyponym of colour, car is a
hyponym of vehicle
• Meronomy - A meronym refers to a part of a whole. Engineshave parts: carburetor, Headlights …
finger' is a meronym of 'hand

Homonymy
• Homonymy:
– Lexemes that share a form
• Phonological, orthographic or both
– But have unrelated, distinct meanings
– Can be homophones, homographs, or both:
• Homophones:
– Write and right
– Piece and peace

Homonymy causes problems for NLP applications


• Text-to-Speech
– Same orthographic form but different phonological form
• bass vs bass
• Bass, when pronounced as base, means the lowest pitch, an
adult male who sings in the lowest pitch, any number of
instruments that play the lowest pitch.
• Information retrieval
– Different meanings same orthographic form
• QUERY: bat care
• Basic bat care. Any bat that is found on the ground, or in an
exposed area, especially during the day, is likely to need help.
• Machine Translation
• Speech recognition

Polysemy
• Polysemy is the property of a word having several meanings.
• The bank is constructed from red brick
I withdrew the money from the bank
• Are those the same sense?
• He left the bank five minutes ago.
• He caught a fish at the bank.
• A single lexeme with multiple related meanings
(bank the building, bank the financial institution)
• Most non-rare words have multiple meanings
– The number of meanings is related to its frequency
– Verbs tend more to polysemy
– Distinguishing polysemy from homonymy isn’t always
easy (or necessary)
118
Metaphor and Metonymy
• Metaphor and metonymy are similar in various aspects but the major difference is that if a metaphor
substitutes a concept with another, a metonymy selects a related term. So, if metaphor is for
substitution, metonymy is for association.
• For example, the sentence ‘he has a heart of stone’ is a metaphor. Here the word heart is used in
substitution for displaying unyielding or unfeeling character of the person.
• The sentence ‘the tiger called his employees to the meeting room’ is a metonymy. Here there is no
substitution; instead the person is associated with a tiger for his nature.
• Metaphor and Metonymy are specific Types of
Polysemy.
• Metaphor:(A metaphor is a figure of speech containing an implied comparison. With metaphors,
words or phrases that are ordinarily applied to one thing are applied to something you wouldn't
necessarily pair it with. Here's a metaphor example:
“The curtain of night fell upon us.”
• Her eyes were like diamonds
• Metonymy (Metonymy is a figure of speech that replaces the name of a thing with the name of
something else with which it is closely associated).
A famous example is, "The pen is mightier than the sword," from Edward Bulwer Lytton's play
Richelieu.
This sentence has two metonyms:
"Pen" stands for "the written word."
"Sword" stands for "military aggression."

Synonyms
• Words that have the same meaning in some or all contexts.
– Look / see
– Happy / Joyful
– big / large
– automobile / car
– Beautiful / Pretty/Attractive
– Water / H20
• Two lexemes are synonyms if they can be successfully substituted
for each other in all situations
– If so they have the same propositional meaning
• But there are few (or no) examples of
perfect synonymy.
– Why should that be?
– Even if many aspects of meaning are identical
– Still may not preserve the acceptability based
on notions of politeness, slang, register,
genre, etc.
• Example:
– Water and H20
Synonymy
• Principle of substitutability
• How big is this plane?
• Would I be flying on a large or small plane?
• Miss Nelson, for instance, became a kind of big sister to Mrs. Van Tassel’s son, Benjamin.
• ?? Miss Nelson, for instance, became a kind of large sister to Mrs. Van Tassel’s son, Benjamin.

Synonymy is a relation between senses rather than words


119
• Consider the words big and large
• Are they synonyms?
– How big is that plane?
– Would I be flying on a large or small plane?
• How about here:
– Miss Nelson, for instance, became a kind of big sister to Benjamin.
– ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
• Why?
– big has a sense that means being older, or grown up
– large lacks this sense

Antonyms
• Senses that are opposites with respect to
one feature of their meaning
• Otherwise, they are very similar!
– dark / light
– short / long
– hot / cold
– up / down
– in / out
• More formally: antonyms can
– define a binary opposition or at opposite ends

Hyponymy
• A term that denotes a subcategory of a more general class:
“Chair” and “table” are hyponyms of “furniture.”
• One sense is a hyponym of another if the first sense is
more specific, denoting a subclass of the other
– car is a hyponym of vehicle
– dog is a hyponym of animal
– mango is a hyponym of fruit
• Conversely
– vehicle is a hypernym/superordinate of car
– animal is a hypernym of dog
– fruit is a hypernym of mango
superordinate vehicle fruit furniture mammal
hyponym car mango chair dog

Hypernymy more formally


• A hypernym describes a more broad term
• Extensional:
– The class denoted by the superordinate
– extensionally includes the class denoted by
the hyponym
• Entailment:
– A sense A is a hyponym of sense B if being
an A entails being a B
• Hyponymy is usually transitive
– (A hypo B and B hypo C entails A hypo C)

EXAMPLE OF HYPONYMS AND HYPERONYMS


120

Semantic ambiguity
• A sentence may have a single syntactic structure, but
multiple semantic structures
• “The park is near the bank.”
• Ambiguous between a financial institution and the side of a river.
• “He heard her voice through the door.”
• Unclear whether “through the door” refers to hearing from inside or outside.
Semantics
Mary untangled a library.
• What’s wrong with this sentence?
• It is syntactically correct, but semantically weird
• Since Untangled stands for free from a tangled
or twisted state.
Pragmatics
• The context disambiguates the
meaning.
• This is a complicated cognitive feat!
Mary was hired as the new library director.
She had a hard job because the library was a mess.
Mary untangled a library.
Discourse
• The prior discourse provides references for
the pronouns.
• In general, temporal sequences require
discourse knowledge to be understood.
Mary was hired as the new library director.
Mary had a hard job because the library was a mess.
She untangled it.
World Knowledge
• World knowledge allows you to understand what this conversation is about.
• It includes knowledge about jobs, office politics, the structure of argument, and more.
Speaker A:
Mary was hired as the new library director.
She had a hard job because the library was a mess.
Mary untangled a library.
Speaker B:
It doesn’t matter.
She’s still fired.

Semantic Analysis
121
To Understand the meaning of a sentence, there are three types of semantic analysis that we can use
1. We can determine the Semantic Feature of key words in the
sentence
2. We can analyze the Lexical Relation of these words in the
sentence
3. We can analyze the Semantic Roles (Thematic Role / Theta
Role) of key words in the sentence(i.e. the roles they fulfill in
the situation described in the sentence.

List of Basic Thematic Roles


• AGENT: the initiator of some action, capable of acting with volition.
– Jack ate the beans.
• PATIENT: the entity undergoing the effect of some action, often
undergoing some change of state.
– Sue mowed the lawn.
• THEME: the entity which is moved by an action, or whose location is
described.
– Fred threw the rock.
• EXPERIENCER: the entity which is aware of the action or state
described by the predicate but which is not in control of the action or
state.
– Kim saw the deer.
• BENEFICIARY: the entity for whose benefit the
action was performed.
– Mary studied hard for her mother.
• INSTRUMENT: the means by which an action is
performed or something comes about.
– Fred opened the lock with a paper clip.
• LOCATION: the place in which something is situated
or takes place.
– The picture hangs above the fireplace.
• GOAL: the entity towards which something moves,
either literally or metaphorically.
– Lee walked to school.
• SOURCE: the entity from which something moves,
either literally or metaphorically.
– Sue ran from the policeman.
ACTOR: the entity which performs, effects,
instigates, or controls the situation denoted by the
predicate (supertype of AGENT):
– The bus hit a pedestrian.
• RECIPIENT: a subtype of GOAL involved in actions
describing changes of possession.
– Bill sold the car to Mary
• PERCEPT/STIMULUS: the entity which is perceived
or experienced.
– Mary fears thunder.

Semantic Roles (Thematic Roles)


Agent = the animate entity that performs the action
122
Theme /Patient= the entity that undergoes (or receives) the
action.

Basim kicked the ball

Although agents are typically human, they can also be non-human


forces, machines or creatures.

The dog caught the ball

The ball was kicked by Basim.


123

Word Sense Disambiguation

Word Sense Disambiguation (WSD) is a crucial task in natural language processing (NLP) that
involves determining which meaning of a word is being used in a particular context. Given that
many words have multiple meanings (polysemy) or can refer to different concepts (homonymy), WSD
aims to resolve this ambiguity to improve understanding and processing of text.
Applications:
WSD is critical for various NLP applications, including:
• Information Retrieval: Improving search engine accuracy by
understanding user queries.
• Machine Translation: Ensuring correct translation of words based
on their intended meanings.
• Text Analysis: Enhancing sentiment analysis and summarization
tasks by accurately interpreting text.
124
The word bank in the first sentence refers to the commercial (finance) banks, while in second
sentence, it refers to the river bank. The ambiguity that arises due to this, is tough for a machine to
detect and resolve. Detection of ambiguity is the first issue and resolving it and displaying the correct
output is the second issue.
125

WordNet Dictionary is used


126
WordNet
127

Major Lexical Relations


• Homonymy
• Polysemy - word having several meanings.
• Synonymy
• Antonymy
• Hypernomy - A word with a broad meaning constituting a category into which words with more
specific meanings. For example, colour is a hypernym of red.
• Hyponomy - Relationship between more general term
Ex: Red is hyponym of colour, car is a hyponym of vehicle
• Meronomy - A meronym refers to a part of a whole. Engines
have parts: carburetor, Headlights

Uses:
Word sense disambiguation.
* Information retrieval.
Automatic text classification.
* Automatic text summarization.
* Machine translation
* Automatic crossword puzzle generation.
* Improve search engine results

Semantic Role Labeling (SRL)

• Semantic Role Labeling is the process that assigns labels to words or phrases in a sentence that
indicates their semantic role in the sentence, such as that of an agent, goal.
• It is the task of automatically finding the thematic roles for each predicate in a sentence.
• It answers the who did what to whom, when, where, why, how and so on. Finding these relations is
preliminary to question answering and information extraction.
• For each clause, determine the semantic role played by each noun phrase that is an argument to the
verb.
128
agent patient source destination instrument
– John drove Mary from Austin to Dallas in his Toyota Prius.
– The hammer broke the window.
• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic parsing”

Semantic Roles
• Origins in the linguistic notion of case (Fillmore,1968)
• A variety of semantic role labels have been
proposed, common ones are:
– Agent: Actor of an action
– Patient: Entity affected by the action
– Instrument: Tool used in performing action.
– Beneficiary: Entity for whom action is performed
– Source: Origin of the affected entity
– Destination: Destination of the affected entity

Use of Semantic Roles


• Semantic roles are useful for various tasks.
• Question Answering
– “Who” questions usually use Agents
– “What” question usually use Patients
– “How” and “with what” questions usually use Instruments
– “Where” questions frequently use Sources and Destinations.
– “For whom” questions usually use Beneficiaries
– “To whom” questions usually use Destinations
• Machine Translation Generation
– Semantic roles are usually expressed using particular, distinct
syntactic constructions in different languages.

SRL and Syntactic Cues


• Frequently semantic role is indicated by a particular syntactic position (e.g. object of a particular
preposition).
– Agent: subject
– Patient: direct object
– Instrument: object of “with” PP
– Beneficiary: object of “for” PP
– Source: object of “from” PP
– Destination: object of “to” PP
• However, these are preferences at best:
– The hammer hit the window.
– The book was given to Mary by John.
– John went to the movie with Mary.
– John bought the car for $21K.
– John went to work by bus.

Selectional Restrictions
• Selectional restrictions are constraints that certain verbs place on the filler of certain semantic roles.
• Sub-categorize the thematic rule
– Agents should be animate
– Beneficiaries should be animate
– Instruments should be tools
129
– Patients of “eat” should be edible
– Sources and Destinations of “go” should be places.
– Sources and Destinations of “give” should be animate.
• Taxanomic abstraction hierarchies or ontologies (e.g.
hypernym links in WordNet) can be used to determine if
such constraints are met.
– “John” is a “Human” which is a “Mammal” which is a
“Vertebrate” which is an “Animate”

Use of Sectional Restrictions


• Selectional restrictions can help rule in or out certain semantic role assignments.
– “John bought the car for $21K”
• Beneficiaries should be Animate
• Instrument of a “buy” should be Money
– “John went to the movie with Mary”
• Instrument should be Inanimate
“The chef cut the vegetables with a knife.”
Predicate: cut
Agent: The chef (the one performing the action)
Theme: the vegetables (the entity affected by the action)
Instrument: a knife (the tool used to perform the action)
– “John drove Mary to school in the van”
“John drove the van to work with Mary.”
• Instrument of a “drive” should be a Vehicle

Selectional Restrictions and Syntactic Ambiguity


• Many syntactic ambiguities like PP attachment can be resolved using selectional
restrictions.
– “John ate the spaghetti with meatballs.”
“John ate the spaghetti with chopsticks.”
• Instruments should be tools
• Patients of “eat” must be edible
– “John hit the man with a dog.”
“John hit the man with a hammer.”
• Instruments should be tools

Selectional Restrictions and Word Sense Disambiguation


• Word Sense Disambiguation (WSD) is the problem of determining which "sense" (meaning) of a
word is activated by the use of the word in a particular context
• Many lexical ambiguities can be resolved using selectional restrictions.
• Ambiguous nouns
– “John wrote it with a pen.”
• Instruments of “write” should be WritingImplements
– “The bat ate the bug.”
• Agents (particularly of “eat”) should be animate
• Patients of “eat” should be edible
• Agent: “The bat” (the one performing the action of eating)
• Patient: “the bug” (the entity affected by the action, being eaten)
• Ambiguous verbs
– “John fired the secretary.”
“John fired the rifle.”
130
• Patients of DischargeWeapon should be Weapons
• Patients of CeaseEmploment should be Human

Empirical Methods for SRL


• Difficult to acquire all of the selectional restrictions and taxonomic knowledge needed for SRL.
• Difficult to efficiently and effectively apply knowledge in an integrated fashion to simultaneously
determine correct parse trees, word senses, and semantic roles.
• Statistical/empirical methods can be used to automatically acquire and apply the knowledge needed
for effective and efficient SRL.

SRL as Sequence Labeling


• SRL can be treated as an sequence labeling
problem.
• For each verb, try to extract a value for each
of the possible semantic roles for that verb.
• Employ any of the standard sequence
labeling methods
– Token classification
– HMMs
– CRFs

SRL with Parse Trees


• Parse trees help identify semantic roles through exploiting syntactic clues like “the agent is usually
the subject of the verb”.
• Parse tree is needed to identify the true subject.
131
“The man by the store near the dog ate an apple.”
“The man” is the agent of “ate” not “the dog”.

• Assume that a syntactic parse is available.


• For each predicate (verb), label each node in the parse tree as either not-a-role or one of the possible
semantic roles.

SRL as Parse Node Classification


• Treat problem as classifying parse-tree nodes.
• Can use any machine-learning classification method.
• Critical issue is engineering the right set of features for the classifier to use.

Features for SRL


• Phrase type: The syntactic label of the candidate role filler (e.g. NP).
• Parse tree path: The path in the parse tree between the predicate and the candidate role filler.

Parse Tree Path Feature: Example 1

Parse Tree Path Feature: Example 2


132

Features for SRL


• Phrase type: The syntactic label of the candidate role filler (e.g. NP).
• Parse tree path: The path in the parse tree between the predicate and the candidate role filler.
• Position: Does candidate role filler precede or follow the predicate in the sentence?
• Voice: Is the predicate an active or passive verb?
• Head Word: What is the head word of the candidate role filler?

Head Word Feature Example


• There are standard syntactic rules for determining which word in a phrase is the head.

Complete SRL Example

Issues in Parse Node Classification


• Many other useful features have been proposed.
– If the parse-tree path goes through a PP, what is the preposition?
• Results may violate constraints like “an action has at most one agent”?
– Use some method to enforce constraints when making final decisions. i.e. determine the most likely
assignment of roles that also satisfies a set of known constraints.
• Due to errors in syntactic parsing, the parse tree is likely to be incorrect.
– Try multiple top-ranked parse trees and somehow combine results.
– Integrate syntactic parsing and SRL.

More Issues in Parse Node Classification


• Break labeling into two steps:
– First decide if node is an argument or not.
133
– If it is an argument, determine the type.

SRL Datasets
• FrameNet:
– Developed at Univ. of California at Berkeley
– Based on notion of Frames
• PropBank:
– Developed at Univ. of Pennsylvania
– Based on elaborating their Treebank
• Salsa:
– Developed at Universität des Saarlandes
– German version of FrameNet

FrameNet
• Project at UC Berkeley led by Chuck Fillmore for developing a
database of frames, general semantic concepts with an associated set of
roles.
• Roles are specific to frames, which are “invoked” by multiple words,
both verbs and nouns.
– JUDGEMENT frame
• Invoked by: V: blame, praise, admire; N: fault, admiration
• Roles: JUDGE, EVALUEE, and REASON
• Specific frames chosen, and then sentences that employed these frames
selected from the British National Corpus and annotated by linguists
for semantic roles.
• Initial version: 67 frames, 1,462 target words, _
49,013 sentences, 99,232 role fillers
• Frames are an artificial intelligence data structure used to divide
knowledge into substructures by representing "stereotyped situations".

FrameNet Results
• Gildea and Jurafsky (2002) performed SRL experiments with initial FrameNet data.
• Assumed correct frames were identified and the task was to fill their roles.
• Automatically produced syntactic analyses using Collins (1997) statistical parser.
• Used simple Bayesian method with smoothing to classify parse nodes.
• Achieved 80.4% correct role assignment.
Increased to 82.1% when frame-specific roles
were collapsed to 16 general thematic categories.

PropBank
• Project at U Penn lead by Martha Palmer to add
semantic roles to the Penn treebank.
• Roles (Arg0 to ArgN) specific to each individual
verb to avoid having to agree on a universal set.
– Arg0 basically “agent”
– Arg1 basically “patient”
• Annotated over 1M words of Wall Street Journal
text with existing gold-standard parse trees.
• Statistics:
– 43,594 sentences 99,265 propositions (verbs + roles)
– 3,324 unique verbs 262,281 role assignments
134
CONNL SRL Shared Task
• CONLL (Conference on Computational Natural Language Learning) is the annual meeting for the
SIGNLL (Special Interest Group on Natural Language Learning) of ACL.
• Each year, CONLL has a “Shared Task” competition.
• PropBank semantic role labeling was used as the Shared Task for CONLL-04 and CONLL-05.
• In CONLL-05, 19 teams participated.

CONLL-05 Learning Approaches


• Maximum entropy (8 teams)
• SVM (7 teams)
• SNoW (1 team) (ensemble of enhanced Perceptrons)
• Decision Trees (1 team)
• AdaBoost (2 teams) (ensemble of decision trees)
• Nearest neighbor (2 teams)
• Tree CRF (1 team)
• Combination of approaches (2 teams)

CONLL Experimental Method


• Trained on 39,832 WSJ sentences
• Tested on 2,416 WSJ sentences
• Also tested on 426 Brown corpus sentences to test generalizing beyond financial news.
• Metrics:
– Precision: (# roles correctly assigned) / (# roles assigned)
– Recall: (# roles correctly assigned) / (total # of roles)
– F-measure: harmonic mean of precision and recall

Best Result from CONLL-05


• Univ. of Illinois system based on SNoW with global constraints enforced using Integer Linear
Programming.

Issues in SRL
• How to properly integrate syntactic parsing, WSD, and role assignment so they all aid each other.
• How can SRL be used to aid end-use applications:
– Question answering
– Machine Translation
– Text Mining
135
136
Word Embedding

● Word embedding is a technique in natural language processing (NLP) that represents words as
vectors in a continuous vector space. This allows for capturing semantic relationships and
similarities between words based on their context in large text corpora.
● Some word embedding models are Word2vec (Google), Glove (Stanford), and fastest
(Facebook).

● Word Embedding is also called a distributed semantic model or distributed represented or
semantic vector space or vector space model.
● The similar words can be grouped together. For example, fruits like apples, mango, and
banana should be placed close whereas books will be far away from these words.
● In a broader sense, word embedding will create the vector of fruits which will be placed far
away from the vector representation of books.

Importance - Word Embedding

● Semantic Understanding - semantic meaning of words


● Contextual Relationships - Embeddings can capture the context in which words appear.
● Dimensionality Reduction - Word embeddings reduce the dimensionality, making
computations more efficient and models faster.
● Transfer Learning: Pre-trained word embeddings can be used.
● Improved Performance: Using word embeddings often leads to better performance in various
NLP tasks such as text classification, sentiment analysis, and machine translation. They help
models generalize better by capturing nuanced relationships between words.

Applications

● NLP Tasks: Used in sentiment analysis, machine translation, information retrieval, and more.
● Transfer Learning: Pre-trained models can be fine-tuned for specific tasks, leveraging
knowledge from vast corpora.
● Word Embedding Limitations
○ Out-of-Vocabulary Words: Word2Vec does not handle words that were not present in
the training data well.
○ Lack of Contextual Awareness: It generates a single vector for each word regardless
of its context (e.g., "bank" in "river bank" vs. "financial bank").

Word2Vec

Word2Vec is a popular technique for creating word embeddings, developed by a team at Google led
by Tomas Mikolov. It represents words in a continuous vector space, allowing machines to understand
their meanings based on context. Here are the main components of Word2Vec:

Word2Vec - Architectures

● Continuous Bag of Words (CBOW):


Predicts the target word from the surrounding context words.
For example, given the context words "the," "sat," "on," it predicts "cat."
Generally faster and more suitable for smaller datasets.
● Skip-gram Model:
137
Predicts surrounding context words given a target word.
For example, given the target word "cat," it tries to predict words like "sat," "on," "the," etc.
Effective for capturing semantic relationships and works well with large datasets.

Architecture of the CBOW model

Architecture of Skip-gram Model


138

CBOW - Example
139
140
141

Step 7: Identify the Predicted Word


The predicted word is the one with the highest probability. In this case, if "king" has the highest
probability (approximately 0.276), it will be the predicted target word.
Summary of Steps
● Identify the Target: Target word "king" and context words "man" and "woman."
● Vector Representation: Assign 4D vectors to all words.
● Average Context: Calculate the average of context word vectors.
● Score Calculation: Compute dot products with weights for each word.
● Softmax: Convert scores to probabilities.
● Prediction: Identify the word with the highest probability.

Summary:
● In CBOW, context words are chosen based on proximity to the target word, which is
determined by the window size.
● Including additional words like "queen" depends on expanding the context window.
● By adjusting the context, you can influence the prediction, capturing more semantic
relationships.

Skip-gram model - Ex
142
143
144

Identify the Predicted Context Words:

Threshold: You can set a threshold to consider only those context words whose probabilities exceed a
certain value.
Top-N Selection: Alternatively, you can select the top-N context words with the highest probabilities.
This is common when you expect multiple context words.

Pragmatic & Discourse Analysis

Stages in NLP
145

Discourse Analysis (Before Pragmatic Analysis)


Objective: Understand how sentences and phrases connect for coherent meaning.
Focus on tasks like anaphora resolution and maintaining coherence.
Analyze how meaning is structured over larger sections of text.
Why It Comes First: Links different parts of the text to ensure meaning flow.
Establishes textual context before analyzing intent or context.
Example: "John went to the store. He bought milk." Discourse analysis links "He" to "John."

Pragmatic Analysis (After Discourse Analysis)

Objective: Understand meaning based on situational context, speaker intentions, and social factors.
Focus on implicature (implied meaning) and speech acts (e.g., requests, promises).
Analyze how language is used in different contexts.
Why It Comes After: Pragmatic analysis requires a higher level of understanding because it often
depends on resolving the structure and coherence of the discourse first. Once the text is analyzed for
structural and relational meaning (discourse), pragmatics can then interpret speaker intent and context
to understand deeper, implied meanings.
Interprets speaker intent and context to uncover deeper meanings.
Example: "Can you close the door?" interpreted as a request, not a literal question about ability.

Pragmatic analysis and semantic analysis


Pragmatic analysis and semantic analysis are two approaches to studying language that focus on
different aspects of meaning.
1. Semantic Analysis
Focus: Semantics deals with the literal meaning of words, phrases, and sentences. It is concerned with
meaning as it is encoded in the structure of the language itself, without considering the external
context in which language is used.

Semantic Analysis
Key Concepts:
146
Word Meaning (Lexical Semantics): The study of individual word meanings and how they relate to
one another (e.g., synonyms, antonyms).
Sentence Meaning (Compositional Semantics): How the meaning of individual words combines to
form the meaning of larger units, such as phrases and sentences.
Truth Conditions: Whether a sentence can be judged as true or false based on its semantic content.
Ambiguity: How a word or sentence can have multiple meanings depending on its linguistic
structure (e.g., "bank" can mean a financial institution or the side of a river).

Key Differences Between Pragmatics and Semantics:

Differences Between Pragmatics and Semantics:


• Examples to Highlight Differences:
• Sentence: "It’s raining."
• Semantic Analysis: This sentence means that water is falling from the sky. It
can be judged true or false based on whether rain is actually occurring.
• Pragmatic Analysis: In one context, “It’s raining” could be a simple weather
report. In another context, it could imply that someone should bring an
umbrella or postpone outdoor plans. Pragmatics looks at why the speaker is
saying this.

• Semantic analysis is concerned with the literal meaning of words and


sentences, focusing on truth conditions and how meaning is
constructed from linguistic structures.
• Pragmatic analysis goes beyond literal meaning and looks at context,
considering speaker intentions, social norms, and how meaning is
shaped by the interaction between speakers and their environment.

Pragmatic Analysis
• Pragmatic analysis examines how context influences the
interpretation of meaning in language use. It deals with the
relationship between language and its users.
• Context: A conversation between two people at a restaurant.
147
Speaker A: "Can you pass the salt?"
Speaker B: Passes the salt.
Literal meaning: Speaker A is asking if Speaker B is physically capable of passing
the salt (a yes/no question).
Discourse Analysis
• Discourse analysis looks at language use across longer stretches of text (written or spoken) to
examine how meaning is constructed in context. It emphasizes the structure and patterns of
communication in social and cultural contexts.
Context: A political speech advocating for education reform.
Speaker: "Education is the backbone of our nation’s future.
Without strong schools, we risk everything. We must invest in teachers
and give every child a chance to succeed."

Discourse analysis
• Discourse analysis in NLP involves examining how sentences and utterances form coherent and
meaningful texts. This process is crucial for understanding and generating natural language, as it helps
models grasp the context and relationships between different parts of a text.

• Here are some key aspects of discourse analysis in NLP:

• Coherence: Ensuring that the text makes sense as a whole. Coherence involves
logical connections between sentences and ideas, making the text flow naturally.

• Coreference Resolution: Identifying when different expressions refer to the same


entity. For example, recognizing that “John” and “he” in a text refer to the same
Person.

• Discourse Structure: Analyzing the organization of a text, including how it is


segmented into paragraphs or sections and how these segments relate to each
other.
• Topic Structure: Understanding the main topics discussed in a text and how they
shift throughout the discourse.
• Conversation Structure: In dialogues, this involves understanding the turn-taking
and the flow of conversation between participants.

Summary of Differences:
• Pragmatic Analysis
Example: Focuses on the implied meaning of a single sentence within a
conversation, emphasizing how context shapes interpretation (e.g., a polite
request in conversation).
• Discourse Analysis
Example: Looks at a larger text, examining how various elements (cohesion,
power, social context) work together to create an argument and convey
meaning over the course of a speech or extended discourse.
148

Anaphora resolution
• Anaphora resolution is a process in linguistics and natural language
processing (NLP) that involves identifying and linking anaphoric
expressions (like pronouns or noun phrases) to the correct
antecedents (the words or phrases they refer to) within a text. This is
essential for understanding the meaning of sentences and
maintaining coherence in both written and spoken discourse.

Anaphora resolution in Discourse Analysis:

• Anaphora resolution is generally considered a part of discourse


analysis, but it also overlaps with pragmatic analysis in some cases.

• Discourse analysis focuses on how meaning is constructed across


sentences, paragraphs, or longer texts. Anaphora resolution fits
within this because it deals with cohesion, which is crucial for
understanding how sentences are linked to one another.

• In discourse analysis, anaphora resolution is used to track referents


(e.g., who or what "he," "she," "it," or "they" refers to) across a text,
ensuring that the text flows coherently. For example, resolving the
reference of pronouns or noun phrases helps maintain coherence
within conversations, essays, stories, etc.

• Example: In the sentences "John went to the park. He sat on a


bench," discourse analysis would resolve that "he" refers to "John,"
establishing a continuous and meaningful link between the two
Sentences.

Anaphora Resolution in Pragmatic Analysis:


149
• Pragmatic analysis focuses on how meaning is influenced by context,
including the relationship between speakers, their intentions, and the
Situation.

• Anaphora resolution intersects with pragmatics when contextual


clues are needed to resolve ambiguous references. Sometimes, the
resolution of an anaphor (like a pronoun) depends not just on the text
but on shared knowledge or external context.

• Example: In the sentence "John told Mark that he should leave," the
antecedent for "he" (John or Mark) might depend on the
conversational context or the speaker’s intentions. Pragmatic analysis
would help by considering social or contextual cues outside the
immediate text.

Key Distinctions: Anaphora Resolution inPragmatic Analysis and Discourse


Analysis:

• Discourse Analysis: Focuses on text-level coherence and how anaphors


maintain connection between sentences in a larger discourse.

• Pragmatic Analysis: Focuses on how context beyond the text helps resolve
references, especially when there is ambiguity or the meaning relies on
shared knowledge or inference.

• Anaphora resolution primarily falls under discourse analysis because it


deals with maintaining cohesion within a text. However, it can involve
pragmatic analysis when the resolution depends heavily on external
context, shared knowledge, or speaker intention. Therefore, it can be seen
as bridging both areas depending on the complexity and context of the
language being analyzed.

Example: Anaphora resolution


• Sentence: John was hungry. He went to the kitchen.
• Anaphora: "He"
• Antecedent: "John"
• Anaphora Resolution: Linking "He" to "John" to understand that John is the
person being referred to.

Types of Anaphora:
• Pronominal Anaphora: The most common type, where pronouns like
he, she, it, they are used.
• Example: Sarah found her keys. She was relieved.
• Anaphora: "She"
• Antecedent: "Sarah
• Nominal Anaphora: When a noun phrase is repeated or replaced with
another noun phrase.
• Example: The dog was barking loudly. The animal seemed upset.
• Anaphora: "The animal"
• Antecedent: "The dog"
• Zero Anaphora: This occurs when a referent is implied but not
150
explicitly stated, common in languages like Japanese.
• Example: Atsui desu ne? (It's hot, isn't it?) Here, the subject ("the
weather") is not explicitly mentioned but understood through
Context.

Importance of Anaphora Resolution:

• Coherence and Understanding: Resolving anaphora is crucial for


understanding the relationships between sentences and maintaining
the coherence of the text.Example: Mary saw Jane at the park. She
waved at her.

• Without anaphora resolution, it could be unclear who waved at whom.

• Natural Language Processing (NLP): Anaphora resolution is a key


component in tasks like machine translation, information retrieval,
question answering, and text summarization. Correctly identifying the
referent of a pronoun or noun phrase helps algorithms to make sense
of texts and interact with users in a meaningful way.

Challenges in Anaphora Resolution:


• Ambiguity: Sometimes it’s unclear which antecedent a pronoun
refers to, especially when there are multiple possible candidates.
• Example: John told Mark that he should leave. (Who should leave,
John or Mark?)

• Long-Distance Anaphora: When the antecedent is far from the


anaphor, it can be challenging to maintain the link.
• Example: In a long paragraph, "he" might refer back to someone
mentioned much earlier.

• Complex Sentences: In more complicated structures, resolving


anaphora can be more difficult, especially when multiple potential
antecedents are present or when ellipsis occurs (i.e., some parts of
the sentence are omitted).
• Example: The boys wanted to play soccer, but they couldn’t because
the field was closed.
• "They" refers to "the boys," but the resolution might be less clear in longer,
more complex discourse.
151
152
153

Coreference Phenomena: Linguistic Background


• Types of Referring Expressions
• Indefinite Noun Phrases
• Indefinite reference generally introduces into the discourse context entities that are new to
the hearer
• Ex: He had gone round one day to bring her some walnuts.

• Definite Noun Phrases


• Definite reference, such as via NPs that use the English article the, refers to an entity that
is identifiable to the hearer. An entity can be identifiable to the hearer because it has
been mentioned previously in the text and thus is already represented in the discourse
Model
• Ex: It concerns a white stallion which I have sold to an officer. But the pedigree of the
white stallion was not fully established.

• Pronouns:
• Another form of definite reference is pronominalization, used for entities that
are extremely salient in the discourse
• Emma smiled and chatted as cheerfully as she could
• Pronouns can also participate in cataphora, in which they are mentioned
before their referents are
• Even before she saw it, Dorothy had been thinking about the Emerald City
• Pronouns also appear in quantified contexts in which they are considered to
be bound, as in
• Every dancer brought her left arm forward
• Under the relevant reading, her does not refer to some woman in context, but instead
behaves like a variable bound to the quantified expression every dancer
154
Linguistic Properties of the Coreference Relation
• Number Agreement
• Referring expressions and their referents must generally
• agree in number; English she/her/he/him/his/it are singular,
we/us/they/them are plural, and you is unspecified for number.
• So a plural antecedent like the chefs cannot generally corefer with a singular
anaphor like she
• semantically plural entities can be referred to by either it or they:
• IBM announced a new machine translation product yesterday. They have been
working on it for 20 years.
• Person Agreement:
• English distinguishes between first, second, and third person, and a pronoun’s
antecedent must agree with the pronoun in person
• Thus a third person pronoun show have a third person antecedent
• Quotations can cause exceptions
• “I voted for Nader because he was most aligned with my values,” she said.
• Gender or Noun Class Agreement:
• Maryam has a theorem. She is exciting. (she=Maryam, not the theorem)
• Maryam has a theorem. It is exciting. (it=the theorem, not Maryam)
• Binding Theory Constraints:
• The binding theory is a name for syntactic constraints on the relations
between a mention and an antecedent in the same sentence
• Oversimplifying a bit, reflexive pronouns like himself and herself corefer with
the subject of the most immediate clause that contains them, whereas
nonreflexives cannot corefer with this subject
• Janet bought herself a bottle of fish sauce. [herself=Janet]
• Janet bought her a bottle of fish sauce. [her6=Janet]
• Grammatical Role:
• Entities mentioned in subject position are more salient than those in object Position
• Billy Bones went to the bar with Jim Hawkins. He called for a glass of rum. [ he =
Billy ]
• Jim Hawkins went to the bar with Billy Bones. He called for a glass of rum. [ he = Jim ]
• Verb Semantics:
• Some verbs semantically emphasize one of their arguments, biasing the
interpretation of subsequent pronouns
• John telephoned Bill. He lost the laptop.
• John criticized Bill. He lost the laptop.
• Selectional Restrictions:
• the selectional restrictions that a verb places on its arguments can help
eliminate referents
• I ate the soup in my new bowl after cooking it for hours
155

Mention detection: not so simple


• Making all pronouns, named entities, and noun phrases as mentions over-generates mentions

Methods for Anaphora Resolution (in NLP):

• Rule-Based Approaches: These methods rely on predefined linguistic rules, such as the proximity
between the anaphor and its antecedent, syntactic constraints, or grammatical agreements (e.g., gender
and number).

• Statistical/Probabilistic Models: These use large datasets to calculate the probability of various
antecedents being linked to an anaphor based on patterns found in the data.

• Machine Learning Approaches: Modern approaches use neural networks and deep learning to
automatically learn the patterns for anaphora resolution from large corpora, often outperforming
traditional rule-based or statistical methods.

Four kinds of coreference models


● •Rule-based (pronominal anaphora resolution)
● Machine learning models:
○ Mention pair
○ Mention ranking
○ Clustering-based
156
Hobbs algorithm

● An algorithm that searches parse trees for antecedents of a pronoun


○ Starting at the NP node immediately dominating the pronoun
○ Search previous trees, in order of recency, left-to-right,
Breadth-first
○ Looking for the first match of the correct gender and number
(male vs female, singular vs plural)

● 72.7% accuracy based on the 100 examples he analyzed

• We start at the target pronoun in the tree, and crawl up the parse tree to the root node(S).
• It performs a breadth first left to right search of the node’s children to the left of the target for
each noun phrase or ‘S’ node it finds.
• The algorithm begins with the parse tree of sentence 2 and works its way up to the root node
S2.
• After that, it performs a breadth-first search to locate the noun phrase (NP). The algorithm
157
discovers its first noun phrase for the word ‘Jill’ here

Because Jill “likes” the person of the pronoun, the pronoun cannot obviously be Jill.
• Binding theory states that: A reflexive can refer to the subject of the most immediate clause in which
it appears, whereas nonreflexive cannot corefer this subject.. Words such as himself, herself,
themselves, etc. are known as reflexive.

• So, in our scenario, ‘him’ will not refer to Jill because of the binding theory limitation

• Also, Jill will not be the accepted referent for the pronoun ‘he’ due to the gender agreement
requirement.
• As a result, the algorithm now begins its search in the previous sentence’s syntax tree.

• Hobbs distance property states that entities in a subject position are more likely the possible
substitute for the pronoun than in the object position.
• Following the Hobbs distance property, we perform a breadth first left to right search of
the node’s children for each noun phrase it discovers.
• As a result, the sentence’s subject Jack, who is an engineer, is explored before the object engineer,
and Jack is finally the resolved referent for the pronoun him.
158

The detailed algorithm is as follows


1. Begin at the noun phrase (NP) node immediately dominating the pronoun.
2. Go up the tree to the first NP or sentence (S) node encountered.
3. Call this node X, and call the path used to reach it p.
4. Traverse all branches below node X to the left of path p in a left-to-right, breadth-first fashion.
5. Propose as the antecedent any NP node that is encountered which has an NP or S node between it
and X.
6. If node X is the highest S node in the sentence, traverse the surface parse trees of previous
sentences in the text in order of recency, the most recent first; each tree is traversed in a left-to-right,
breadth-first manner, and when an NP node is encountered, it is proposed as antecedent.
7. If X is not the highest S node in the sentence, continue to step 5.
8. From node X, go up the tree to the first NP or S node encountered. Call this new node X, and call
the path traversed to reach it p.
9. If X is an NP node and if the path p to X did not pass through the Nominal node that X immediately
dominates, propose X as the antecedent.
10.Traverse all branches below node X to the left of path p in a left-to-right, breadth- first manner.
Propose any NP node encountered as the antecedent.
11.If X is an S node, traverse all branches of node X to the right of path p in a left-to- right,
breadth-first manner, but do not go below any NP or S node encountered. Propose any NP node
encountered as the
antecedent.
12.Go to Step4

The Mention-Pair Models


The mention-pair architecture is based around a classifier that is given a pair of mentions, a candidate
anaphor and a candidate antecedent, and makes a binary classification decision: coreferring or not
159

For each prior mention (Victoria Chen, Megabucks Banking, her, etc.), the binary classifier computes
a probability: whether or not the mention is the antecedent of she.
• We want this probability to be high for actual antecedents (Victoria Chen, her, the 38-year-old) and
low for non-antecedents (Megabucks Banking, her pay).
• Early classifiers used hand-built features; more recent classifiers use neural representation learning

• For training, we need a heuristic for selecting training samples; since most pairs of mentions in a
document are not coreferent, selecting every pair would lead to a massive overabundance of negative
samples.
• The most common heuristic, is to choose the closest antecedent as a positive example, and all pairs
in between as the negative examples.
• More formally, for each anaphor mention mi we create
• one positive instance (mi ;mj ) where mj is the closest antecedent to mi, and
• negative instance (mi ;mk ) for each mk between mj and mi

• Thus for the anaphor she, we would choose (she, her) as the positive example and no negative
examples.
• Similarly, for the anaphor the company we would choose
• (the company, Megabucks) as the positive example and
• (the company, she) (the company, the 38-year-old) (the company, her pay)
and (the company, her) as negative examples.

• Once the classifier is trained, it is applied to each test sentence in a


clustering step.
• For each mention i in a document, the classifier considers each of the
prior i-1 mentions.
• In closest-first clustering, the classifier is run right to left (from
mention i-1 down to mention 1) and the first antecedent with
probability > .5 is linked to i. If no antecedent has probably > 0.5, no
antecedent is selected for i.
• In best-first clustering, the classifier is run on all i-1 antecedents and
the most probable preceding mention is chosen as the antecedent for
i.
160
161

• Advantage
• Simplicity
• Disadvantages
• The classifier doesn’t directly compare candidate antecedents to each other,
so it’s not trained to decide, between two likely antecedents, which one is in
fact better.
• It ignores the discourse model, looking only at mentions, not entities. Each
classifier decision is made completely locally to the pair, without being able to
take into account other mentions of the same entity

The Mention-Rank Architecture


• The mention ranking model directly compares candidate antecedents
162
to each other, choosing the highest-scoring antecedent for each
anaphor
• In early formulations, for mention i, the classifier decides which of the
{1,…,i-1} prior mentions is the antecedent. But suppose i is in fact not
anaphoric, and none of the antecedents should be chosen?
• Such a model would need to run a separate anaphoricity classifier on
i. Instead, it turns out to be better to jointly learn anaphoricity
detection and coreference together with a single loss
163

● Current candidate cluster merges depend on previous


ones it already made
○ We can’t use supervised learning.
164
○ Researchers used reinforcement learning to train
the model
○ Reward function is the change in a coreference
evaluation metric
● •Some nice attempts but NOT the dominating approach
165
166

Testing time:
• O(T^2) spans of text in a document
• O(T^4) possible pairs of spans — too expensive!
• Use sm(i) for aggressive pruning
167
Vector Semantics & Embeddings

Computational models of word meaning


Can we build a theory of how to represent word meaning, that accounts for at least some key aspects?
We'll introduce vector semantics
The standard model in language processing!
Handles many of our goals!

Ludwig Wittgenstein
PI #43:
"The meaning of a word is its use in the language"

Let's define words by their usages


One way to define "usage":
words are defined by their environments (the words around them)

Zellig Harris (1954):


If A and B have almost identical environments we say that they
are synonyms.

What does recent English borrowing ongchoi mean?


Suppose you see these sentences:
• Ong choi is delicious sautéed with garlic.
• Ong choi is superb over rice
• Ong choi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard greens
◦ We could conclude this based on words like "leaves" and "delicious" and "sauteed"

Idea 1: Defining meaning by linguistic distribution


Let's define the meaning of a word by its distribution in language use, meaning its
neighboring words or grammatical environments

Idea 2: Meaning as a point in space (Osgood et al. 1957)


3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
168

Hence the connotation of a word is a vector in 3-space

Defining meaning as a point in space based on distribution Each word = a vector (not just "good" or
"w45") Similar words are "nearby in semantic space" We build this space automatically by seeing
which words are nearby in text

We define meaning of a word as a vector


Called an "embedding" because it's embedded into a space (see textbook)

The standard way to represent meaning in NLP


Every modern NLP algorithm uses embeddings as the representation of word meaning
Fine-grained model of meaning for similarity

Intuition: why vectors?


Consider sentiment analysis:
◦ With words, a feature is a word identity
◦ Feature 5: 'The previous word was "terrible"'
◦ requires exact same word to be in training and test

◦ With embeddings:
◦ Feature is a word vector
◦ 'The previous word was vector [35,22,17…]
◦ Now in the test set we might see a similar vector [34,21,14]
◦ We can generalize to similar but unseen words!!!

We'll discuss 2 kinds of embeddings


tf-idf
◦ Information Retrieval workhorse!
169
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of nearby
words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings

From now on:Computing with meaning representations instead of string


representations
170

Vectors are similar for the two comedies


But comedies are different than the other two
Comedies have more fools and wit and fewer battles.

Idea for word meaning: Words can be vectors too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V" fool is "the kind of word that
occurs in comedies, especially Twelfth Night"
171
Computing word similarity: Dot product and cosine

The dot product between two vectors is a scalar:

The dot product tends to be high when the two vectors have large values in the same dimensions
Dot product can thus be a useful similarity metric between vectors

Problem with raw dot-product


Dot product favors long vectors
Dot product is higher if a vector is longer (has higher
values in many dimension)
Vector length:

Frequent words (of, the, you) have long vectors (since


they occur many times with other words).
So dot product overly favors frequent words

Alternative: cosine for computing word similarity

Cosine as a similarity metric


172

But since raw frequency values are non-negative, the cosine for term-term matrix vectors ranges from
0–1
173
Word Embedding
• Word embedding is a technique in natural language processing (NLP) that represents words as
vectors in a continuous vector space. This allows for capturing semantic relationships and similarities
between words based on their context in large text corpora.
• Some word embedding models are Word2vec (Google), Glove (Stanford), and fastest (Facebook).
• Word Embedding is also called a distributed semantic model or distributed represented or semantic
vector space or vector space model.
• The similar words can be grouped together. For example, fruits like apples, mango, and banana
should be placed close whereas books will be far away from these words.
• In a broader sense, word embedding will create the vector of fruits which will be placed far away
from the vector representation of books.

fastText
• fastText introduces a pivotal shift by considering words as composed of character n-grams, enabling
it to build representations for words based on these subword units
• This approach allows the model to understand and generate embeddings for words not seen in the
training data, offering a substantial advantage in handling morphologically rich languages and
rare words.

Difference Between fastText and Word2Vec


• Handling of Out-of-Vocabulary (OOV) Words
• Word2Vec: Word2Vec operates at the word level, generating embeddings for
individual words. It struggles with out-of-vocabulary words as it cannot
represent words it hasn’t seen during training.

• fastText: In contrast, fastText introduces subword embeddings by considering


words to be composed of character n-grams. This enables it to handle out-ofvocabulary words
effectively by breaking terms into subword units and
generating embeddings for these units, even for unseen words. This capability
makes fastText more robust in dealing with rare or morphologically complex
Expressions.

Representation of Words
• Word2Vec: Word2Vec generates word embeddings based solely on the words without considering
internal structure or morphological information.
• fastText: fastText captures subword information, allowing it to understand word meanings based on
their constituent character ngrams. This enables fastText to represent words by considering their
morphological makeup, providing a richer representation, especially for morphologically rich
languages or domains with specialised jargon.

Training Efficiency
• Word2Vec: The training process in Word2Vec is relatively faster than older methods but might be
slower than fastText due to its word-level approach.
• fastText: fastText is known for its exceptional speed and scalability, especially when dealing with
large datasets, as it operates efficiently at the subword level.

Use Cases
• Word2Vec: Word2Vec’s word-level embeddings are well-suited for
tasks like finding similar words, understanding relationships between
words, and capturing semantic similarities.
174
• fastText: fastText’s subword embeddings make it more adaptable in
scenarios involving out-of-vocabulary words, sentiment analysis,
language identification, and tasks requiring a deeper understanding of
morphology.
175

Language models
A language model is the core component of modern Natural Language Processing (NLP). It’s a
statistical tool that analyzes the pattern of human language for the prediction of words.
Language models analyze bodies of text data to provide a basis for their word predictions. They are
used in natural language processing (NLP) applications
Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the
probability of a given sequence of words occurring in a sentence.
Language model is a process to represent the text to a understandable form from the machine point of
view.
The goal of a language model is to compute a probability of a token (e.g. a sentence or a sequence of
words).
Language Model (LM) actually a grammar of a language as it gives the probability of word that will
follow
Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a
language.
For NLP, a probabilistic model of a language that gives a probability that a string is a member of a
language is more useful.
To specify a correct probability distribution, the probability of all sentences in a language must sum to
1.

Uses of Language Models


Speech recognition
“I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”
OCR & Handwriting recognition
More probable sentences are more likely correct readings.
Machine translation
More likely sentences are probably better translations.
Ex: P(High wind) > P(large wind)
Generation
More likely sentences are probably better NL generations.
Context sensitive spelling correction
P(about fifteen minutes from)>P(about fifteen minuets from)
Completion Prediction

A language model also supports predicting the completion of a sentence.


Please turn off your cell _____
Your program does not ______
Predictive text input systems can guess what you are typing and give choices on how to complete
it.
176

Estimate probability of each word given prior context.


P(phone | Please turn off your cell)
Number of parameters required grows exponentially with the number of words of prior context.
An N-gram model uses only N1 words of prior context.
Unigram: P(phone)
Bigram: P(phone | cell)
Trigram: P(phone | your cell)
The Markov assumption is the presumption that the future behavior of a dynamical system only
depends on its recent history. In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model

Probabilistic Language Model

Chain Rule

N-Gram Model Formulas

Word sequences

Chain rule of probability

Bigram approximation
177

N-gram approximation

Estimating Probabilities

N-gram conditional probabilities can be estimated from raw text based on the relative frequency of
word sequences.

Bigram:

N-gram:

To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.

Word Prediction
Guess the next word...
... I notice three guys standing on the ???

There are many sources of knowledge that could be helpful, including world knowledge.
But we can do pretty well by simply looking at the preceding words and keeping track of simple
counts
Formalize this task using N-gram models. N-grams are token sequences of length N.
Given N-grams counts, we can guess likely next words in a sequence.

More formally, we will assess the conditional probability of candidate words.


Also, we will assess the probability of an entire sequence of words.
Being able to predict the next word (or any linguistic unit) in a sequence is very common in many
applications

It lies at the core of the following applications


● Automatic speech recognition
● Handwriting and character recognition
● Spelling correction
● Machine translation
● Augmentative communication
● Word similarity, generation, POS tagging, etc.

Counting
He stepped out into the hall, was delighted to encounter a water brother.
178
13 tokens, 15 if we include “,” and “.” as separate tokens.
Assuming we include the comma and period, how many bigrams are there? 14
12 Bigram without punctuation adn non-alhpanumeric characters
he stepped
stepped out
out into
into the
the hall
hall was
was delighted
delighted to
to encounter
encounter a
a water
water brother

Counting: Types and Tokens

They picnicked by the pool, then lay back on the grass and looked at the stars.
18 tokens (again counting punctuation)
only 16 types (again counting punctuation)
only 14 types (excluding punctuation)
In going forward, we’ll sometimes count types and sometimes tokens.

Word Types - A unique word form regardless of how many times it appears in a text

Counting: Wordforms
Should “cats” and “cat” count as the same?
How about “geese” and “goose”?
● Lemma: cats & cat and geese & goose have the same lemmas
● Wordform: fully inflected surface form
Again, we’ll have occasion to count both lemmas and wordforms

Counting: Corpora

Corpus: body of text


Google
● Crawl of 1,024,908,267,229 English tokens
● 13,588,391 wordform types
● That seems like a lot of types... Even large dictionaries of English have only around 500k
types. Why so many here?

● Numbers
● Misspellings
● Names
● Acronyms
● etc

Language Modelling
To Perform word prediction:
Assess the conditional probability of a word given the previous words in the sequence
P(wn|w1,w2…wn-1).
179
We’ll call a statistical model that can assess this a Language Model.
One way to calculate them is to use the definition of conditional probabilities, and get the counts from
a large corpus.
Counts are used to estimate the probabilities.
Unfortunately, for most sequences and for most text collections we won’t get good estimates from this
method.
We’ll use the chain rule of probability
and an independence assumption.

P(its water was so transparent)=


P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)
So, that’s the chain rule
We still have the problem of estimating probabilities of word sequences from counts; we’ll
never have enough data.
The other thing we mentioned an independence assumption
What type of assumption would make sense?

Independence Assumption

Make the simplifying assumption


P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)
Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

This kind of independence assumption is called a Markov assumption after the Russian mathematician
Andrei Markov.

For each component in the product replace with the approximation (assuming a prefix of N)

Bigram Version

The Maximum Likelihood Estimate (MLE)

Example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
180

Bigram Simple Example

Assume here we have a very small corpus from an author XYZ


<s> I am a human </s>
<s> I am not a stone </s>
<s>I I live in Lahore </s>
What is probability that the following sentence was also written by the same author XYZ
“I I am not”

P(I I am not) = P(I|<s>) * P(I|I) * P(am|I) * P(not|am) *P(</s>|not)


= 3/3 * 1/4 * 2/4 * 1/2 *0/1
= 1 *0.25 * 0.5 *0.5 * 0
=0
P(I|<s> = count(<s>I /count(<s>)
= 3/3

Training corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Vellore do engineering</s>
Test data:
<s> students are from Vellore </s>

Method 1
P(<s> students are from Vellore </s>)
= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
181

P(<s> students are from Vellore </s>)


= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
= 1/4 * 1/2 * 1/2 * 2/3 * 1/2 = 0.0208

Method 2

Formal way of estimating the bigram probability of a word sequence:


The bigram probabilities of the test sentence can be calculated by constructing Unigram and bigram
probability count matrices and bigram probability matrix as follows;
Unigram count matrix

Bigram count matrix

Bigram probability matrix (normalized by unigram counts)

P(<s> students are from Vellore </s>)


= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
= 1/4 * 1/2 * 1/2 * 2/3 * 1/2 = 0.0208

Berkeley Restaurant Project Sentences


can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
182
Bigram Counts

Out of 9222 sentences


Eg. “I want” occurred 827 times

Bigram Probabilities

Divide bigram counts by prefix unigram counts to get probabilities.

827/2533=0.326=>0.33

What is the probability of the sentence “I want want to”


P(I want want to) = P(want|I)*P(want|want)*P(to|want)
=827/2533 * 0/927 *608/927
=0

Bigram Estimates of Sentence Probabilites


P(i|<s>) = 0.25 P(english|want) = 0.0011
P(want|I)=.33 P(food|english) = 0.5 P(</s>|food) = 0.68
P(<s> I want english food </s>) =
= P(I|<s>)* P(want|I)* P(english|want)* P(food|english)* P(</s>|food)
= 0.25*0.33*0.0011*0.5*0.68
=.000031

Kinds of Knowledge

As crude as they are, N-gram probabilities capture a range of interesting facts about language.

Word Knowledge Syntax Discourse

P(english|want) = .0011
183
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25

The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it
also brings about the meaning of immediately succeeding sentence.

Discourse Analysis

Discourse analysis in NLP involves examining how sentences and utterances form coherent and
meaningful texts. This process is crucial for understanding and generating natural language, as it helps
models grasp the context and relationships between different parts of a text.
Here are some key aspects of discourse analysis in NLP:
Coherence: Ensuring that the text makes sense as a whole. Coherence involves logical connections
between sentences and ideas, making the text flow naturally.
Coreference Resolution: Identifying when different expressions refer to the same entity. For example,
recognizing that “John” and “he” in a text refer to the same person.
Discourse Structure: Analyzing the organization of a text, including how it is segmented into
paragraphs or sections and how these segments relate to each other.
Topic Structure: Understanding the main topics discussed in a text and how they shift throughout the
discourse.
Conversation Structure: In dialogues, this involves understanding the turn-taking and the flow of
conversation between participants.

Shannon’s Visualization Method

Let’s turn the model around and use it to generate random sentences that are like the sentences from
which the model was derived.
Illustrates:
Dependence of N-gram models on the specific training corpus
Greater N better model
Generally attributed to
Claude Shannon.

Sample a random bigram (<s>, w) according to its probability


Now sample a random bigram (w, x) according to its probability
And so on until we randomly choose a (y, </s>)
Then string the words together
184
I want to eat Chinese food

<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>

Shakespeare as a Corpus
N=884,647 tokens, V=29,066 (word types)
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams...
So, 99.96% of the possible bigrams were never seen (have zero entries in the table)
This is the biggest problem in language modeling; we’ll come back to it.
Quadrigrams(visualization data tool) are worse: What's coming out looks like Shakespeare because it
is Shakespeare​

The Wall Street Journal is Not Shakespeare

Evaluation
185

How do we know if one model is better than another?


Shannon’s game gives us an intuition.
The generated texts from the higher order models sound more like the text the model was
obtained from.
Standard method
Train parameters of our model on a training set.
Look at the model’s performance on some new data: a test set. A dataset which is different
than our training set, but is drawn from the same source
Then we need an evaluation metric to tell us how well our model is doing on the test set.
One such metric is perplexity (to be introduced below)

Lower perplexity means a better model

Training 38 million words, test 1.5 million words, WSJ

Perplexity

The best language model is one that best predicts an unseen test set.
Gives the highest P(sentence)
Perplexity is the inverse probability of the test set, normalized by the number of words.

Perplexity is the probability of the test set (assigned by the language model), normalized by the
number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


The best language model is one that best predicts an unseen test set
186

Assume here we have a very small corpus from an author XYZ


<s> I am a here </s>
<s> Who am I </s>
<s>I would like to know </s>
<s> I I am not </s>
Using bigram model calculate the perplexity of the sentence
“I I am not”

P(I I am not) = P(I|<s>) * P(I|I)* P(am|I) * P(not|am)* P(</s>|not)


= ¾ * 1/5 * 2/5* 1/3 * 1/1
= 0.02

PP(I I am not) = (0.02)-1/4 ( N=4, no of tokens)


= 2.659

Evaluating N-Gram Models

Best evaluation for a language model


Put model A into an application
For example, a speech recognizer
Evaluate the performance of the application with model A
Put model B into the application and evaluate
Compare performance of the application with the two models
Extrinsic evaluation

Difficulty of extrinsic (in-vivo) evaluation of N-gram models


Extrinsic evaluation
This is really time-consuming
Can take days to run an experiment
187
So
As a temporary solution, in order to run experiments
To evaluate N-grams we often use an intrinsic evaluation, an approximation called perplexity
But perplexity is a poor approximation unless the test data looks just like the training data
So is generally only useful in pilot experiments (generally is not sufficient to publish)
But is helpful to think about.

Smoothing in NLP

Unknown Words = Zero probabilities

Language models may still be subject to the problem of Sparsity.


For any n-gram that occurred a sufficient number of times, we might have a good estimate of
its probability. But because any corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it.

That is, we’ll have many cases of putative “zero probability n-grams” that should really have
some non- zero probability.

Zero Counts

Back to Shakespeare
Recall that Shakespeare produced 300,000 bigram types out of V2= 844 million possible
bigrams...
So, 99.96% of the possible bigrams were never seen (have zero entries in the table)
Does that mean that any sentence that contains one of those bigrams should have a
probability of 0?

Smoothing
Smoothing in natural language processing (NLP) refers to a set of techniques used to adjust
probability estimates in models to handle the problem of zero probabilities for unseen events. This is
especially important in the context of language modeling, where we often deal with n-grams.

This modification is called smoothing or discounting.


Add-one smoothing, (Laplace Smoothing)
Add-k smoothing,
Backoff, and Kneser-Ney smoothing.

Add-One Smoothing

When constructing a probabilistic language model, particularly with n-grams, you often encounter
situations where certain word sequences (n-grams) have not been observed in the training data.
Without any adjustments, the probability of these unseen sequences would be zero, which can
severely impact the performance of the model.

Add-One Smoothing modifies the probability estimation to ensure that no sequence has a zero
probability. It does this by adding one to the count of each n-gram before normalizing:

For unigrams:
Add 1 to every word (type) count
Normalize by N (tokens) /(N (tokens) +V (types))
188
Smoothed count (adjusted for additions to N) is

Normalize by N to get the new unigram probability:

Add delta smoothing


189

Berkeley Restaurant Project Sentences

Can you tell me about any good cantonese restaurants close by


mid priced thai food is what i’m looking for tell me about chez panisse

Can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast when is caffe venezia open during the day
190

Note that add-one smoothing has made a very big change to the counts. C(want|to) changed from 608
to 238!
191
We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to
.26 in the smoothed case.
Chunking or shallow parsing

What is Chunking?

Chunk represents a span of words.

Chunking is a process of extracting phrases from unstructured text. It divides the text into

syntactically correlated parts of words. Instead of just simple tokens which may not represent

the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single

word instead of ‘South’ and ‘Africa’ separate words.

Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output.

Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase(NP), Verb Phrase

(VP), etc. Chunking is very important when you want to extract information from text such as

Locations, Person Names etc. In NLP it is called Named Entity Extraction.

Unlike full parsing or deep parsing, which involves analyzing the grammatical structure of a

sentence, shallow parsing focuses on identifying individual phrases or constituents, such as

noun phrases, verb phrases, and prepositional phrases.

For example, in the sentence “The black cat sat on the mat,” the noun phrase “the black cat”

can be identified and extracted using noun phrase chunking. Another common type of shallow

parsing is verb phrase chunking, which involves identifying and extracting all the verb phrases

in a sentence.

Chunking or shallow parsing is sufficient for closely related languages rather than full parsing.

Rule based chunking:

Let’s consider Noun Phrase Chunking and we search for chunks corresponding to an individual
noun phrase. In order to create NP chunk, we define the chunk grammar using POS tags. We

will define this using a single regular expression rule.

The rule states that whenever the chunk finds an optional determiner (DT) followed by any

number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be

formed.
grammar for NP: {<DT>?<JJ>*<NN>}

Varalakshmi M, SCOPE
Neural Network based chunking:

Eamples:

1) [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP

only # 1.8 billion] [PP in] [NP September]

2) [NP Mr. John] [VP had been] [NP organizing chair] [PP of] [NP NLP Conference]

The chunking task as shown in the above example can be transformed to a sequence labelling

task so that RNN and other similar networks can be used.

For this, we use BIO notation to represent a chunk.

A chunk NP is represented by a begin of a chunk (B-NP) and an inside of a chunk (I-NP).

Tokens that do not belong to a chunk are represented by O labels (to denote ‘outside’).

The tagging for the example 2 using BIO notation is as follows.

Mr. John Had Been Organizing Chair Of NLP Conference

B-NP I-NP B-VP I-VP B-NP I-NP B-PP B-NP B-NP

Now each word is provided a label which makes it suitable for neural networks to be used.

Conditional Random Fields


A Conditional Random Field (CRF) is a type of probabilistic graphical model often used in
Natural Language Processing (NLP) and computer vision tasks. It is used to model
conditional probability distribution p(y l x) where x and y are vector-valued variables. This is
also a sequence modelling algorithm. This not only assumes that features are dependent on
each other, but also considers the future observations while learning a pattern. This combines
the best of both HMM and MEMM. In terms of performance, it is considered to be the best
method for entity recognition problem. CRFs are used for structured prediction tasks, where
the goal is to predict a structured output based on a set of input features. For example, in
NLP, a commonly structured prediction task is Part-of-Speech (POS) tagging, where the goal
is to assign a part-of-speech tag to each word in a sentence. CRFs can also be used for Named
Entity Recognition (NER), chunking, and other tasks where the output is a structured
sequence.

CRF uses the labels of nearby words.


For example, In POS tagging, the goal is to label a sentence (a sequence of words or tokens)
with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob
(NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

So, to build a conditional random field to label sentences with their parts of speech, we have
to decide on a set of feature functions fi.

Varalakshmi M, SCOPE
Feature Functions in a CRF
In a CRF, each feature function is a function that takes in as input:
• a sentence s
• the position i of a word in the sentence
• the label li of the current word
• the label li-1 of the previous word
and outputs a real-valued number (though the numbers are often just either 0 or 1).

(Note: A linear-chain CRF is built by restricting our features to depend on only the current and
previous labels, rather than arbitrary labels throughout the sentence)
For example, one possible feature function could measure with how much probability, the
current word should be labeled as an adjective given that the previous word is “very”.

Defining Feature function.


In order to convert a sentence into a sequence of features that can be used as input to a CRF
model, a feature function that extracts relevant information from each word in the sentence
can be defined. Here’s an example feature function that extracts the following features for
each word in the sentence:
• The word itself.
• The word is in lowercase.
• The word is in uppercase.
• The length of the word.
• Whether the word contains a hyphen.
• Whether the word is the first word in the sentence.
• Whether the word is the last word in the sentence.
• The previous word in the sentence.
• The next word in the sentence.
features = {
'word': word,
'is_first': i == 0, #if the word is a first word
'is_last': i == len(sentence) - 1, #if the word is a last word
'is_capitalized': word[0].upper() == word[0],
'is_all_caps': word.upper() == word, #word is in uppercase
'is_all_lower': word.lower() == word, #word is in lowercase
#prefix of the word
'prefix-1': word[0],
'prefix-2': word[:2],
prefix-3': word[:3],
suffix of the word
'suffix-1': word[-1],
'suffix-2': word[-2:],
'suffix-3': word[-3:],
#extracting previous word
'prev_word': '' if i == 0 else sentence[i-1][0],
#extracting next word
'next_word': '' if i == len(sentence)-1 else sentence[i+1][0],
'has_hyphen': '-' in word, #if word has hypen

Varalakshmi M, SCOPE
'is_numeric': word.isdigit(), #if word is in numeric
'capitals_inside': word[1:].lower() != word[1:]
}

How to convert features to Probabilities?


Next, assign each feature function fi a weight λj . Weights will be learned using Gradient
Descent algorithm. Given a sentence s, we can now score a labeling l of s by adding up the
weighted features over all words in the sentence:

(The first sum runs over each feature function j, and the inner sum runs over each position i of
the sentence.)
Finally, we can transform these scores into probabilities p(l|s) between 0 and 1 by
exponentiating and normalizing:

To sum up: to build a conditional random field, just define a bunch of feature functions (which
can depend on the entire sentence, a current position, and nearby labels), assign them weights,
and add them all together, transforming at the end to a probability if necessary.

It Looks like Logistic Regression and HMM


However, CRFs can model a much richer set of label distributions as well, for two main
reasons:

Varalakshmi M, SCOPE
- CRFs can define a much larger set of features. Whereas HMMs are necessarily local in
nature (because they’re constrained to binary transition and emission feature functions, which
force each word to depend only on the current label and each label to depend only on the
previous label), CRFs can use more global features. For example, one of the features in our
POS tagger above increased the probability of labelings that tagged the first word of a sentence
as a VERB if the end of the sentence contained a question mark.
- CRFs can have arbitrary weights.

Learning Weights
Gradient descent can be used.
Assume there is a bunch of training examples (sentences and associated part-of-speech labels).
Randomly initialize the weights of the CRF model. To shift these randomly initialized weights
to the correct ones, for each training example…

• Go through each feature function fi, and calculate the gradient of the log probability
of the training example with respect to

Note that the first term in the gradient is the contribution of feature fi under the true
label, and the second term in the gradient is the expected contribution of
feature fi under the current model.
• Move λi in the direction of the gradient:

where α is some learning rate.


• Repeat the previous steps until some stopping condition is reached (e.g., the updates
fall below some threshold).
In other words, every step takes the difference between what we want the model to learn and
the model’s current state, and moves λi in the direction of this difference.
Finding the Optimal Labeling
After training the CRF model, how to label a new sentence?

The naive way is to calculate p(l|s) for every possible labeling l, and then choose the label that
maximizes this probability. However, since there are km possible labels for a tag set of size k
and a sentence of length m, this approach would have to check an exponential number of labels.
A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property and
so a (polynomial-time) dynamic programming algorithm, similar to the Viterbi algorithm for
HMMs can be used to find the optimal label.

Following is not required to be studied for theory exam

There are a lot of libraries which gives phrases out-of-box such as Spacy or TextBlob. NLTK

just provides a mechanism using regular expressions to generate chunks.

Varalakshmi M, SCOPE
conda install -c anaconda nltk

import nltk
from nltk import pos_tag,word_tokenize
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
sentence = "the little yellow dog barked at the cat"
grammar = ('''
NP: {<DT>?<JJ>*<NN>} # NP
''')
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees():
print(subtree)

tree.draw()

#output:
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
(NP the/DT little/JJ yellow/JJ dog/NN)
(NP the/DT cat/NN)

Here "tree" variable is an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree.
We can reference these by doing something like tree.subtrees. We can then iterate through
these subtrees like so:

for subtree in tree.subtrees():


print(subtree)
But, if we are interested in getting just the Noun phrases, ignoring the rest, we can use the filter
parameter in the tree.subtrees() call (as shown below).

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):


print(subtree)

sentence = "the little yellow dog barked at the cat"

grammar = ('''

NP: {<DT>?<JJ>*<NN>} # NP

''')

Varalakshmi M, SCOPE
chunkParser = nltk.RegexpParser(grammar)

tagged = nltk.pos_tag(nltk.word_tokenize(sentence))

tree = chunkParser.parse(tagged)

print("output")

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):

print(subtree)

print("done")

#tree.draw()

#output:
(NP the/DT little/JJ yellow/JJ dog/NN)
(NP the/DT cat/NN)

This is to extract only the verb phrases and print them


sentence = "the little yellow dog barked at the cat"

"""grammar = ('''

NP: {<DT>?<JJ>*<NN>} # NP

''')

"""

grammar=('''VB: {<VB.?>}''')

chunkParser = nltk.RegexpParser(grammar)

tagged = nltk.pos_tag(nltk.word_tokenize(sentence))

tree = chunkParser.parse(tagged)

print("output")

for subtree in tree.subtrees(filter=lambda t: t.label() == 'VB'):

print(subtree)

print("done")

Varalakshmi M, SCOPE
#tree.draw()

#output:
(VB barked/VBD)

Varalakshmi M, SCOPE
Parse Tree Generation
Syntax: It refers to the way words are arranged together in a sentence.
Syntactic constituency: Group of words behaving as single units or constituents
Noun phrase: A sequence of words surrounding atleast one noun eg., John the servant, the
lighthouse
An entire noun phrase can occur before a verb eg., John the servant came out.
Similarly, a prepositional phrase can be placed at the beginning (preposed), at the end (postposed)
or even placed in the middle of the sentence. But the individual words cannot be split up.
Eg., “on January 3rd”
On January 3rd, I would like to go for a trip. (preposed)
I would like to go for a trip on January 3rd. (postposed)
I would like to go on January 3rd for a trip (placed in the middle)
I would like to go on January for a trip 3rd ( this is wrong… a prepositional phrase cannot be split)

Context-free grammar (CFG) or Phrase-Structure Grammar:


- It is used to model constituent structure.
- Consists of a set of rules or productions, each of which expresses the ways that words of the
language can be grouped and ordered together
- Also consists of a lexicon of words and symbols
- Consists of 2 types of symbols – terminals (symbols that correspond to words in the
language eg., the, horse) and non-terminals (symbols that express abstractions over these
terminals)
- Consists of a Start symbol which is also a member of non-terminals
Eg., NP ->DET NOUN | ADJ NOUN
Which means a noun phrase derives (consists of) a determiner followed by a noun or and adjective
followed by a noun

The sequence of rule expansions is called a derivation of the string of words. A parse tree is generally
used to represent a derivation.
Eg.,

Syntactic parsing: It is the problem of mapping from a string of words to its parse tree. It determines
if the structure of a sentence is according to the grammar of the language.
There are several approaches to construct a parse tree -top-down, bottom-up.
Ambiguous sentences lead to construction of more than one parse tree.

Dr. Varalakshmi M, Associate Professor, SCOPE, VIT


CYK or CKY (Cockie-Kasami-Younger) algorithm is a DP (Dynamic Programming) technique used to
efficiently generate all possible parse trees for a given word sequence and a grammar.
Note: For the CYK algorithm to be applied, the grammar should be in CNF (Chomsky Normal form).
A grammar is said to be in CNF, if each rule derives a single non-terminal or two non-terminals.
Eg., S-> NP VP, VP->V NP, PP->P V, DET->a, P->with ---> all these are in CNF
NP->the noun, DET->a the , NP -> N P V ----> all these are not in CNF

How to convert a CFG into CNF grammar?


Introduce a dummy non-terminal wherever necessary.
Eg.,i) NP-> the NOUN
Can be written as
NP->DET N
DET-> the
ii) NP-> DET NOUN PP can be written as
NP->DET NOM
NOM->NOUN PP

Parse tree Construction:


Example-1:
Consider the following CNF and the sentence “the flight includes a meal”.
S->NP VP
NP->DET N
VP->V NP
V->includes
DET->the
DET->a
N->meals
N->flight

The Flight Includes A meals

01 02 03 04 05

DET->the NP->DET,N ------- ------ S->NP,VP

11 12 13 14 15
N->flight --- ----- ------

21 22 23 24 25
V->includes --- VP->V NP

31 32 33 34 35
DET->a NP->DET,N

41 42 43 44 45
N->meals

Dr. Varalakshmi M, Associate Professor, SCOPE, VIT


I) X02=x01+x12= DET,N =NP (since there is a production in the grammar where NP derives
DET,N)
X13=x12+x23=N,V= null (as there is no production that derives N,V)
X24=x23+x34=V,DET=null
X35=x34+x45=DET,N =NP
II) X03=(x01+x13) or (x02+x23)
=(DET,null) or (NP,V)
=(null) or (null)
X14=(x12+x24) or (x13+x34)
=(N,null) or (null, DET)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(V,NP) or (null,N)
=(VP->V,NP) or (null)
III) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(DET,null) or (NP,null) or (null,DET)
=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(N,VP) or (null,NP) or (null,N)
=null
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(DET,null) or (NP,VP) or (null,NP) or (null,N)
=S->NP,VP

Note: At last, if the start symbol is obtained, it infers that the sentence is correct according to the
given CNF grammar. If the start symbol is not reached, then the sentence is syntactically wrong.
Ie., arrangement of the words is not correct.

Now, to construct the parse tree, start with the cell ‘S->NP, VP’. Look for the cell in the same row
that has rule for NP and look for the cell in the same column that has rule for VP and proceed with
the same way until all the leaf nodes contain the terminal symbols.

Example-2:
Let’s consider an example grammar and sentence that lead to multiple parse trees.
Note: If there are 2 productions for the same non-terminal, let us label them as rule 1 and rule2.
Eg., in the CNF grammar given below, there are two rules for VP. So, label them as VP1 and VP2

Dr. Varalakshmi M, Associate Professor, SCOPE, VIT


S → NP VP
VP1 → V NP
VP2 → VP PP
PP → P NP
V → eat
NP → NP PP
NP → we
NP → fish
NP → fork
P → with
“We eat fish with fork”

we eat fish With Fork

01 02 03 04 05

NP->we ---- S->NP,VP1 ------ S->NP,VP1


S->NP,VP2
11 12 13 14 15
V->eat VP1->V,NP ------- VP1->V,NP
VP2->VP,PP
21 22 23 24 25
NP->fish --- NP->NP,PP

31 32 33 34 35
P->with PP->P,NP

41 42 43 44 45
NP->fork

I) X02=x01+x12= NP,V =null


X13=x12+x23=V,NP=VP1
X24=x23+x34=NP,P=null
X35=x34+x45=P,NP=PP
II) X03=(x01+x13) or (x02+x23)
=(NP,VP1) or (null,NP)
=(S->NP,VP1) or (null)
X14=(x12+x24) or (x13+x34)
=(N,null) or (VP1, P)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(NP,PP) or (null,NP)
=(NP->NP,PP) or (null)
III) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(NP,null) or (null,null) or (S,P)

Dr. Varalakshmi M, Associate Professor, SCOPE, VIT


=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(V,NP) or (VP1,PP) or (null,NP)
=(VP1->V,NP) or (VP2->VP,PP)
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(NP,VP1 and NP,VP2) or (null,NP) or (S,PP) or (null,NP)
=S->NP,VP1 and S->NP, VP2
Parse tree generation:
i) First construct the tree with the first production of ‘S’, S->NP,VP1
(Note: Follow the lines from the rules encircled in the table given below.

This results in the interpretation “we eat, fish with fork” ie., we eat that fish which has a fork..
ii) Now construct the tree with the second production of ‘S’, S->NP,VP2
(Note: Follow the lines from the rules encircled in the table given below.

This results in the interpretation “we eat fish, with fork” ie., we eat fish using a fork.

Dr. Varalakshmi M, Associate Professor, SCOPE, VIT


Language Models
Models that assign probabilities to sequence of words are called Language models.
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The
items can be phonemes, syllables, letters, words or base pairs according to the application
which are typically collected from a text or speech corpus. It is useful for spelling and
grammar correction and translation.
A 2-gram is a two-word sequence of words like “turn off” and a 3-gram is a three-word
sequence of words like “take the test”.
n-gram models can be used to estimate the probability of the last word of an n-gram, given
the previous words and also to assign probabilities to entire sequences.
The term ‘n-gram’ can be used to mean either the word sequence itself or the predictive
model that assigns it a probability.
The joint probability of A, B can be written as
P(A,B)=p(A).p(B|A)
This joint probability formula can be extended for multiple variables as follows.

P(x1,x2,x3….xn) =p(x1).p(x2|x1).p(x3|x1x2).p(x4|x1.x2.x3)……..p(xn| x1.x2.x3….. xn-1)

Therefore, the joint probability of words in sentences is as follows.

P(w1,w2,w3….wn) =p(w1).p(w2|w1).p(w3|w1w2).p(w4|w1.w2.w3)……..p(wn| w1.w2.w3….. wn-1)

Eg., the probability of occurrence of the word sequence “school students do their homework”
is
P(school students do their homework)=p(school).p(students | school).p(do | school
students).p(their | school students do).p(homework | school students do their) ……(1)
But the drawback here is that the longer the sequence, the less likely we are to find it in a
training corpus.
This can be resolved using n-gram models.
The intuition of the n-gram model is that instead of computing the probability of a word
given its entire history, we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn | w1:n-1) by using only the conditional probability of the
preceding word P(wn | wn-1).
In other words, instead of computing the probability, p(homework | school students do their),
we approximate it with the probability, p(homework | their).
This way of assuming that the probability of a word depends only on the previous word is
called a Markov assumption.
We can generalize the bigram (which looks one word into the past) to the trigram (which
looks two words into the past) and thus to the n-gram (which looks n-1 words into the past).

Dr. Varalakshmi M
Thus, the general equation for this N-gram approximation to the conditional probability of the
next word in a sequence is
P(wn|w1:n-1) ≈ P(wn | wn-N+1:n-1)
In a unigram (1-gram) model, no history is used. In a bigram, one word history is used and in
a n-gram, n-1 words history is used.
Based on this, Equation (1) can be written as follows.
P(school students do their homework)=p(school)p(students | school).p(do | students).p(their |
do).p(homework | their) ……………….(2)

Maximum Likelihood Estimation (MLE) can be used to find the probabilities of n-grams that
uses the count of occurrences of the n-grams.
Eg, for the bigram model, probability of a bigram is calculated by taking the count of the
number of times a given bigram (𝑤𝑛−1 𝑤𝑛 ) occurs in a corpus and normalizing it by the total
number of bigrams that share the same word (𝑤𝑛−1 ), in the corpus.
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑤𝑛 )
P(wn | wn-1)= ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1
𝑛−1𝑤 )

But the count of the n-grams that contain the word is equal to the unigram count of the word.
So, the formula can be changed into
𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1𝑤𝑛 )
P(wn | wn-1)= ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1 )

𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘")
ie., for p(homework | their) =
𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟")

Likewise, the probability should be calculated for all the other components p(students |
school), p(do | students) and p(their | do) given in equation (2).
Next word estimation using bi-gram model:
Conditional Probability is given by
𝑝(𝐴,𝐵)
P(B|A) = ………….. (3)
𝑝(𝐴)

Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, based on eqn (3), the formula can be written
as
P(regularly | school students do their homework)
𝑝(school students do their homework regularly)
= ----- from eqn (3)
p(school students do their homework)

p(school)p(students | school).p(do | students).p(their | do).p(homework | their).p(regularly | homework)


=
p(school)p(students | school).p(do | students).p(their | do).p(homework | their)

……. From eqn (2)


(after cancelling out the common terms highlighted with colors)

Dr. Varalakshmi M
P(regularly | school students do their homework) = p(regularly | homework) =
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘)

Thus, to find the probability of the next word in the sequence “school students do their
homework” to be “regularly”, it is enough if we find the probability of “regularly” given just
the previous word, “homework”.
Trigram:
The same idea for bigram model can be extended to trigram, four-gram and in general, n-
gram models.
𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 𝑤3 )
 P(w1,w2,w3)=p(w3 | w1,w2)= 𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 )

P(school students do their homework)=p(school).p(students | school,<s>).p(do | school,


students).p(their | students do).p(homework | do their)
Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, using trigram model, the formula can be
written as
P(regularly | school students do their homework)
𝑝(school students do their homework regularly)
= ----- from eqn (3)
p(school students do their homework)

=
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).p(regularly | their homewo
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).

(after cancelling out the common terms highlighted with colors)


P(regularly | school students do their homework) = p(regularly | their homework) =
𝑐𝑜𝑢𝑛𝑡(𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)

Thus, in a trigram model, to find the probability of the next word in the sequence “school
students do their homework” to be “regularly”, it is enough if we find the probability of
“regularly” given just the previous two words, “their homework”.

In practice, unigram models are not commonly used because the next word will be predicted
as the highest frequency word in the corpus.
Eg., Assume the next word has to be predicted for the sequence “ covishield is the vaccine for
_______”. If the word “and” occurs, the highest number of times in a corpus, then the
unigram model will predict “and” as the next word resulting in “covishield is the vaccine for
and”.
Bigram, trigram and four-gram models are usually preferred. For models with higher values
of ‘n’, large corpus is required.

Dr. Varalakshmi M
Examples:
Consider the following tiny corpus with seven sentences.
<s> I am Henry</s>
<s> I like college</s> Word frequency
<s> Do Henry like college</s> <s> 7
<s> Henry I am</s> </s> 7
<s> Do I like Henry</s> I 6
<s> Do I like college </s> Am 2
<s> I do like Henry </s> Henry 5
Like 5
College 3
do 4

i) Using bigram model, predict the most probable next word for the sequence,
<s> do _______
Next word Probability of next word
P(</s> | do) 0/4
P(I | do) 2/4 -> the most probable next word is “I”
P(am | do) 0/4
P(Henry | do) ¼
P( like | do) ¼
P(college| do) 0/4
P(do | do) 0/4

ii) Using bigram model, predict the most probable next word for the sequence,
<s> I like Henry _______

Next word Probability of next word


P(</s> | Henry) 3/5 -> the most probable next word is “</s>”
P(I | Henry) 1/5
P(am | Henry) 0
P(Henry | Henry) 0
P( like | Henry) 1/5
P(college| Henry) 0
P(do | Henry) 0

iii) Using trigram model, predict the most probable next word for the sequence,
<s> Do I like _______
Note: p(I like)=3
Next word Probability of next word
P(</s> | I like) 0/3

Dr. Varalakshmi M
P(I | I like) 0/3
P(am | I like) 0/3
P(Henry | I like) 1/3
P( like | I like) 0/3
P(college| I like) 2/3 -> college is more probable
P(do | I like) 0/3

iv) Using 4-gram model, predict the most probable next word for the sequence,
<s> Do I like college _______
Note: p(I like college)=2
Next word Probability of next word
P(</s> | I like college) 2/2 -> </s> is more probable
P(I | I like college) 0/2
P(am | I like college) 0/2
P(Henry | I like college) 0/2
P( like | I like college) 0/2
P(college| I like college) 0/2
P(do | I like college) 0/2

v) Using bigram model and the corpus mentioned above, predict the most probable
sentence out of the following two.
a) <s> I like college </s>
b) <s> do I like Henry</s>

a) <s> I like college </s>


= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
b) <s> do I like Henry</s>
= p(do | <s>) * p(I | do) * p(like | I) * p( Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5=9/350 = 0.0257
From the above two values, it is inferred that <s> I like college </s> is more probable to occur.
Underflow Problem:
Since probabilities are (by definition) less than or equal to 1, the more probabilities we multiply
together, the smaller the product becomes. Multiplying enough n-grams together would result
in numerical underflow. By using log probabilities instead of raw probabilities, we get numbers
that are not as small. Adding in log space is equivalent to multiplying in linear space, so we
combine log probabilities by adding them. The result of doing all computation and storage in
log space is that we only need to convert back into probabilities if we need to report them at
the end; then we can just take the exp of the logprob:
p1 * p2* p3 * p4 = exp(log p1+log p2+log p3+log p4)

Dr. Varalakshmi M
Eg., in the calculation above (b), the result is 0.0257 and if we keep multiplying the
probabilities of few more bigrams, the result will become smaller and smaller leading to
underflow. To avoid that, the same two calculations can be done as follows.
a) <s> I like college </s>
= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3
=log(3/7) + log(3/6)+log( 3/5)+log(3/3) = -2.0513

b) <s> do I like Henry</s>


= p(do | <s>) * p(I | do) * p(like | I) * p( Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5
=log(3/7)+log(2/4)+log(3/6)+log(2/5)+log(3/5) = -3.6607

Even with logarithmic calculations, the first sentence is found to be more probable.
Zero-Probability problem:
Let us assume that we have to calculate the probability of the word sequence
“<s> like college </s>”
= p(like | <s>) * p(college | like) * p( </s> | college)
= 0/7*3/5*3/3 = 0
Probability is evaluated as 0. Though “ like college” is present thrice in the corpus but
because “<s>like” doesn’t occur even once, the first term becomes 0 and so, the answer is 0.
This is termed as zero-probability problem.

This problem can be solved using smoothing algorithms.


I) Laplace or add-one smoothing:
𝐶
Instead of calculating the probability as p=𝑁 , where c is the total count of the bigram and
N is the total count of the preceding word, add 1 to the numerator to make it non-zero and
add the total number of unique words, V to the denominator. [Adding V to the
denominator is done to normalize so that the probability value comes within the range 0-
1]
𝐶+1
p=𝑁+𝑉
Add-one smoothing for unigram model Add-one smoothing for bigram model
(Assume a tiny corpus with N=20 and V=4 )
Excluding </s>, as it never comes in bigram
calculations, the total number of unique
Word Freq unsmo New New words in the aforementioned corpus =7
othed freq p So, V=7
p P(<s> like college </s>)
Eat 10 10/20 11 11/24 = p(like | <s>) * p(college | like) * p( </s> |
=0.46 college)
citrus 4 4/20 5 5/24= = (0+1)/(7+7)*(3+1)/(5+7)*(3+1)/(3+7)
0.21 = 1/14 * 4/12 * 4/10 = 0.0095
Fruits 6 6/20 7 7/24=
0.29

Dr. Varalakshmi M
daily 0 0/20 1 1/24=
0.04

Without using Add-one smoothing With using Add-one sm


<s> do I like Henry</s> <s> do I like Henry</s>
= p(do | <s>) * p(I | do) * p(like | I) * p( = p(do | <s>) * p(I | do) * p(like | I) * p(
Henry | like) * p(</s> | Henry) Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5=9/350 = 0.0257 =(3+1)/(7+7) * (2+1)/(4+7)
*(3+1)/(6+7)*(2+1)/(5+7)*(3+1)/(5+7)
=4/11 * 3/11 * 4/13 * 3/12 * 4/12
= 0.0020
Observe the sharp change in the probability
after applying add-one smoothing.
It has changed from 0.0257 to 0.0020

Drawbacks of add-one smoothing:


- Leads to sharp changes in the probability
- Too much probability mass goes to unseen events

II) Good turing algorithm


Uses the count of things we have seen once to help estimate the count of things we have
never seen.
Let Nc denote the count of the things which occurred with frequency ‘c’ (ie., count of
things we have seen ‘c’ times)
N1 -> Number of words which occurred once in the corpus
N3 -> Number of words which occurred thrice in the corpus
Eg., “to be or not to be phrase”
c(to) = 2, c(be) =2, c(or)=1, c(not)=1, c(phrase)=1
Therefore, N1=3, N2=2 because there are 3 words which occur only once and 2 words
occurring twice.
Assume the following words in a corpus
C(cricket)=10, c(hockey)=3, c(chess)=2, c(badminton)=1, c(tennis)=1, c(volleyball)=1
Total count of words = 18
P(tennis)=1/18
P(basketball)=0/18=0
As “basketball” word is not present in the corpus, the probability of the words which
occurred once will be used to estimate the probability of the word “basketball”.
PGT (unseen words) =N1/N PGT (existing words) =c*/N, where
(c+1)Nc+1
c* = Nc
Eg., PMLE(basketball)=0/18=0 PMLE(tennis) =1/18, c(tennis)=1
But, PGT(basketball) = N1/N = 3/18 (c+1)Nc+1
C*(tennis) =
(N1=3, as there are 3 words with frequency Nc
(1+1)N1+1 2N2 2∗1 2
1) = N1 = 3 = 3 =3

Dr. Varalakshmi M
2
3 2 1
PGT=18 = 54 == 27

Therefore, the probability of existing word “tennis” is discounted from 1/18 to 1/27 and
that extra probability mass is used to account for the unseen words.

III) Witten Bell algorithm


This algorithm also models the probability of the unseen word by the probability of the word
occurring once.
For a bigram model, the probability of bigram is interpolated as
P(wi | wi-1) = λ2PMLE(wi | wi-1) +(1-λ2)P(wi)
λ2 refers to the λ for bigram model.

How to choose λ value?


For those grams that already exist, a large λ value will be better; for unseen grams, small λ
will be better. So, an optimum λ value is chosen in the Witten-Bell method as follows.
u(𝑤𝑖−1)
λ (wi-1) =1- 𝑢(𝑤 where u(wi-1) denotes the number of unique words after wi-1
𝑖−1) +𝑐(𝑤𝑖−1)
Eg., Let c(Delhi)=3, u(Delhi)=2
2
 λ (Delhi)=1- =3/5 =0.6
2+3

Evaluating Language Models

Perplexity (PP) is an intrinsic evaluation measure to evaluate the evaluating language models.
A language model is the best if it predicts an unseen test set.
Perplexity is the inverse probability of the test data which is normalized by the number of
words.
PP(w)=p(w1,w2,w3…wn)-1/n
Lower the value of perplexity, better will be the model.
More the value of perplexity, confused will be the model for prediction.

Perplexity calculation for bigram and trigram models based on the sentence,
<s> I like college </s>:
i) Bigram model:
= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
PP(w)=(1/0.13)1/4 = 1.67
ii) Trigram model:
= p(like | <s> I) * p(college | I like) * p( </s> | like college)
= 1/3 * 2/3 * 3/3 =2/9 = 0.22
PP(w)=(1/0.22)1/3 = 1.66 (since this value is smaller, the trigram model is better, for
this example).

****************************

Dr. Varalakshmi M
Dr. Varalakshmi M
Dr. Varalakshmi M

You might also like