NLP Merged
NLP Merged
81
82
83
84
85
86
87
88
                                                                                                       89
Syntax Analysis
Ambiguity in syntactic analysis refers to situations where a sentence or phrase can be interpreted in
more than one way due to its structure. This ambiguity can arise from the way words are grouped and
related to each other in a sentence, leading to multiple possible syntactic trees or parses. Ambiguity in
syntactic analysis is a significant challenge in natural language processing (NLP) and computational
linguistics.
● Top-down parsing starts with the S symbol and tries to rewrite it into the sentence.
● Bottom-up parsing starts with the words and tries to find symbols that generate them.
SHALLOW PARSING
Parsing
Parsing is the process of analyzing sentences syntactically. It breaks down a sentence into its
constituent parts (like nouns, verbs, and phrases).
Without proper parsing, NLP systems would struggle to understand the grammatical structure of
language.
Parsing is usually an intermediate stage
– Used to uncover the structures that are used by later stages of processing
Full Parsing is a sufficient task in NLP but not a necessary intermediate stage for many NLP tasks.
Parsing often provides more information than what we needed.
Shallow Parsing
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the
constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor
their role in the main sentence.
In shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow Parsing is also called light parsing or chunking.
Chunking is the process of grouping similar words together based on the nature of the word.
Chunking (aka. Shallow parsing) is to analyzing a sentence to identify the constituents (noun groups,
verbs, verb groups, etc.). However, it does not specify their internal structure, nor their role in the
main sentence.
                                                                                                       94
"The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large
boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization,
which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the
pieces produced by a chunker do not overlap in the source text."
Light Parsing
Goal: To assign a partial structure to a sentence.
   ● Simpler solution space
   ● Local context
   ● Non-recursive
   ● Restricted (local) domain
Chunk Parsing
Goal: To divide a sentence into a sequence of chunks
Group together connected items or words (called chunks) so that they can be stored or processed as
single concepts.
    ● Chunks are non-overlapping regions of a text
        saw [a tall man] in [the park]
    ● Chunks are non-recursive
        – A chunk cannot contain other chunks
    ● Chunks are non-exhaustive
                                                                                             95
      – Not all words are included in the chunks
Examples:
   ● Noun-phrase chunking:
      [I] saw [a tall man] in [the park].
   ● Verb-phrase chunking:
      The man who [was in the park] [saw me].
   ● Prosodic chunking:
      [I saw] [a tall man] [in the park].
Chinking
   ● A chink is a subsequence of the text that is not a chunk.
   ● Define a regular expression that matches the sequences of tags in a chink.
         ○ A simple chink regexp for finding NP chunks:
                                       (<VB.?> | <IN>)+
   ● Chunk anything that is not a matching subsequence:
                                                                                                 97
              the/DT little/JJ cat/NN     sat/VBD on /IN the /DT mat/NN
       [the/DT little/JJ cat/NN]                  sat/VBD on /IN [the /DT mat/NN]
                        chunk                 chink             chunk
Def: A CRF is a probabilistic graphical models used to model the conditional probability of a label
sequence given a sequence of input features. CRFs consider the dependencies between labels in a
sequence, which helps capture the context and relationships more effectively.
What is CRF?
  ● CRF is intended to do the task-specific predictions i.e. we have the input X (vector) and
        predict the label y which are predefined.
  ● CRF is a probabilistic discriminative model that has a wide range of applications in Natural
        Language Processing, Computer Vision and Bioinformatics.
  ● The conditional random field is used for predicting the sequences that use the contextual
        information to add information which will be used by the model to make correct predictions.
Discriminative model makes predictions based on conditional probability and is either used for
classification or regression.
Overview
   ● Generative models
         ○ HMM
   ● Discriminative model
         ○ Conditional models
                 ■ Maximum entropy Markov models
                 ■ Conditional random fields – Predict the seq. (PoS tag)
   ● Experiments
Generative Models
   ● HMMs and stochastic grammars
   ● Assign a joint probability to paired observation and label sequences
   ● Parameters are trained to maximize joint likelihood of training examples
Conditional models
   ● Specify probabilities of label sequences given an observation sequence
   ● Does not expend modelling effort on the observations which are fixed at test time
   ● Conditional probability can be dependent on arbitrary, non-independent features of the
       observation sequence
Example: MEMMs
   ● Maximum entropy Markov models
   ● Each source state has an exponential model that takes the observation feature as input and
      outputs a distribution over possible next states
   ● Weakness: Label bias problem
   ● Weakness: Label bias problem (the transition probabilities of leaving a given state is
      normalized for only that state.)
   ● The big difference between MEMM and CRF is that MEMM is locally renormalized and
      suffers from the label bias problem, while CRFs are globally re-normalized. So, CRFs avoid
      the label bias problem
    ●   Start with fully-connected model and let training procedure figure out a good structure
            ○ Precludes use of prior structure knowledge
    ●   CRFs use a globally normalized approach. The key difference is that CRFs compute all state
        transitions jointly in a single model.
    ●   The weights of different features in different states compete against each other, ensuring a
        more balanced distribution.
    ●   CRFs consider the entire sequence of labels, not just local context, which mitigates label bias.
CRF - Formulation
3.CRF Model: The conditional probability of the tag sequence y given the word sequence x is given
by:
Training:
         Objective: Train the CRF model to maximize the likelihood of the training data. This
         involves optimizing the weights and best fit the observed data.
Testing:
         Given a sequence of words x, find the most likely sequence of tags y.
Problem Description
Given (observations), find (predictions)
For example,
    ●   The relational connection occurs in many applications, NLP, Computer Vision, Signal
        Processing, ….
Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination”
Both generative models and discriminative models describe distributions over (y , x), but they work in
different directions.
Generative Vs Discriminative
                                                                                               106
CRFs vs. MEMM vs. HMM
   ●   Corresponding parameters λ and μ similar to the (logarithms of the) HMM parameters p(y’|y)
       and p(x|y)
Experiment :
Part-of-speech Tagging
   ● Each word to be labeled with one of 45 syntactic tags.
   ● 50%-50% train-test split
   ● out-of-vocabulary (oov) words: not observed in the training set
   ● Second set of experiments: add small set of orthographic features (whether word is
        capitalized, whether word ends in –ing, -ogy, -ed, -s, -ly …)
   ● Overall error rate reduced by 25% and oov error reduced by around 50%
   ● Usually start training with zero parameter vector (corresponds to uniform distribution)
   ● Use optimal MEMM parameter vector as starting point for training corresponding CRF
   ● MEMM+ trained to convergence in around 100 iterations; CRF+ took additional 1,000
        iterations
   ● When starting from uniform distribution, CRF+ had not converged after 2,000 iterations
                                                                                                107
Further Aspects of CRFs
   ● Automatic feature selection
          ○ Start from feature-generating rules and evaluate the benefit of the generated features
              automatically on data
Conclusions
   ● CRFs do not suffer from the label bias problem!
   ● Parameter estimation guaranteed to find the global optimum
   ● Limitation: Slow convergence of the training algorith
Dependency Parsing
   ●   Dependency parsing is a broader term for analysing a sentence's grammatical structure by
       establishing relationships between words. In a dependency parse, each word in the sentence is
       linked to other words, indicating how they relate to each other in terms of syntax and
       grammar.
   ●   Purpose: To determine the syntactic structure of a sentence by identifying dependencies
       between words.
   ●   Output: A dependency tree or graph where nodes represent words and edges represent
       grammatical relations (e.g., subject, object, modifier).
   ●   Types: There are various methods for dependency parsing, including rule-based, statistical,
       and neural network approaches.
Dependency relations
Each dependency relationship is labelled with a specific tag that describes the nature of the
relationship. These labels are essential for understanding the syntactic structure of sentences
and can be accessed through the token.dep attribute in spaCy.
Dependency structure
   ●   Consists of relations between lexical items, normally binary, asymmetric relations (“arrows”)
       called dependencies
   ●   The arrows are commonly typed with the name of grammatical relations (subject,
       prepositional object, apposition, etc)
   ●   The arrow connects a head (governor) and a dependent (modifier)
   ●   Usually, dependencies form a tree (single-head, connected, acyclic)
    ●   Explanation of Dependencies
    ●   The (DT) -> cat (NN): "The" is a determiner (det) for "cat".
    ●   cat (NN) -> sat (VBD): "cat" is the subject (nsubj) of "sat".
    ●   sat (VBD) -> sat (VBD): "sat" is the root of the sentence (ROOT).
    ●   on (IN) -> sat (VBD): "on" is a preposition (prep) modifying "sat".
    ●   the (DT) -> mat (NN): "The" is a determiner (det) for "mat".
    ●   mat (NN) -> on (IN): "mat" is the object of the preposition "on" (pobj).
In dependency parsing, the verb is often the root of a sentence because it serves as the central
element that anchors the sentence's structure. Here’s why the verb typically plays this central role:
    1. Central to Sentence Structure
            ● Core Predicate: The verb functions as the core predicate of the sentence. It describes
                the main action or state that the sentence is about. The other elements in the
                sentence—such as subjects, objects, and complements—are organized around the
                verb, making it the focal point of the sentence's meaning.
    2. Syntactic Dependency
            ● Head of Clauses In traditional syntactic theory, sentences are structured around a
                head, and the verb is usually this head. All other elements in the sentence (like the
                subject or object) depend on the verb. This makes the verb the "root" because it is the
                central element upon which the grammatical structure of the sentence hinges.
    3. Grammatical Relationships
            ● Subject-Verb Agreement: The verb often determines the agreement in number and
                person with the subject. This relationship is crucial for understanding who is
                performing the action and who is affected by it.
            ● Object and Complement Relationships: The verb also governs how objects and
                complements are connected to the main action or state described by the sentence.
    4. Dependency Tree Construction
            ● Root Node: In a dependency tree, the root node is the head of the main clause. For
                simple sentences, this is usually the main verb. For example, in the sentence "She eats
                an apple," "eats" is the verb and the root because it connects the subject ("She") and
                the object ("an apple") through its dependencies.
                                                                                                     112
Consider the sentence "She writes a letter":
   1. Sentence Analysis:
       "She": Subject (nsubj) of the verb.
       "writes": Main verb (ROOT) of the sentence.
       "a letter": Direct object (dobj) of the verb.
   2. Dependency Parsing:
       "writes" is the ROOT because it is the central action or state around which the sentence is
       built.
       "She" (subject) depends on "writes" because it describes who is performing the action.
       "a letter" (object) depends on "writes" because it describes what is being written.
Visual Representation
    ● In a dependency tree, the verb "writes" would be the root node with branches extending to the
        subject
        "She" and the object "a letter":
        writes
        /\
        She letter
        /
        A
    ● This tree structure illustrates that "writes" is central, and all other elements are related to it.
Special Cases -
   ● Complex Sentences: In complex sentences with multiple clauses, each clause might have its
        own root verb. For instance, in "She writes a letter because she wants to inform him," "writes"
        is the root of the main clause, while "wants" is the root of the subordinate clause.
   ● Imperative Sentences: For imperative sentences (e.g., "Close the door"), the verb is still the
        root even though the subject may be implied (often "you").
   ● In summary, the verb is typically the root in dependency parsing because it is the main action
        or state around which the sentence is organized. It serves as the focal point of the sentence’s
        structure, and all other elements are linked to it through various grammatical dependencies.
Transition-based parsing
   ● Transition-based dependency parsing is a popular approach for dependency parsing that uses
        a sequence of transitions to build a dependency parse tree incrementally. It is particularly
        known for its efficiency and effectiveness in parsing sentences.
   ● Transition-based parsing involves processing a sentence word-by-word and applying a
        sequence of actions (transitions) to construct the dependency tree. The key idea is to
        transform a data structure (usually a stack and a buffer) step-by-step until the entire sentence
        is parsed.
Key Components
   ● Data Structures:
          ○ Stack: Holds partially parsed words or constituents.
          ○ Buffer: Contains the remaining words yet to be processed.
          ○ Parse Tree: The final structure that will be built by applying transitions.
   ● Transitions: Transition-based parsers use a predefined set of actions to modify the stack,
      buffer, and dependency structure. Common transitions include:
          ○ Shift: Move the next word from the buffer to the stack.
          ○ Left Arc (or Left Attachment): Create a dependency from the top word on the stack to
               the word below it, then remove the top word from the stack.
                                                                                                     113
            ○ Right Arc (or Right Attachment): Create a dependency from the word below the top
                word on the stack to the top word, then remove the top word from the stack.
    ●   Transition Sequence: The goal is to find a sequence of transitions that correctly builds the
        dependency structure for the entire sentence. This is typically formulated as a search problem,
        where the parser explores possible sequences of transitions to achieve the desired structure.
Projectivity
• Arc from head to dependent is projective
         • If there is a path from head to every word between head and dependent
• Dependency tree is projective
         • If all arcs are projective
         • Or equivalently, if it can be drawn with no crossing edges
• Projective trees make computation easier
• But most theoretical frameworks do not assume projectivity
• Need to capture long-distance dependencies, free word order
Advantages
• Efficiency: Transition-based parsers are typically efficient and can handle large datasets due to their
linear time complexity in practice.
• Simplicity: The transition-based approach is conceptually simple and easy to implement.
• Flexibility: Can be adapted to different types of dependency grammars and can be enhanced with
various features (e.g., part-ofspeech tags, word embeddings) to improve performance.
                                                                                                     115
Semantics
    ●   Contextual Meaning
    ●   Inferred Relationships
    ●   Causality (Casual Relations between words)
    ●   Granularity - Breaking down an event into smaller parts
Did you say you were looking for mixed nuts? (context can be food or hardware)
People use context to derive the correct meaning.
                                                                                              116
Definition
Semantic processing deals with rewriting a parse tree into a “meaning representation”
Logic, SQL, Knowledge base
Poorly understood compared to syntax
applications that need complex semantics, like database front ends or high-quality Machine
Translation had limited success in the past.
How can we represent the meaning of an English
sentence?
Programming languages: “meaning” is equivalent machine code
a = b +c means
Load a,b,c
add b,c
store a
Meanings of words
    1. Lexical Semantics
The meanings of individual words
    2. Formal Semantics (or Compositional Semantics or Sentential Semantics)
How those meanings combine to make meanings for
individual sentences or utterances (manner of speaking; power of speaking)
    3. Discourse or Pragmatics
How those meanings combine with each other and with other facts about various kinds of context to
make meanings for a text or discourse
Dialog or Conversation is often lumped together with Discourse
Homonymy
• Homonymy:
– Lexemes that share a form
• Phonological, orthographic or both
– But have unrelated, distinct meanings
– Can be homophones, homographs, or both:
• Homophones:
– Write and right
– Piece and peace
Polysemy
• Polysemy is the property of a word having several meanings.
• The bank is constructed from red brick
I withdrew the money from the bank
• Are those the same sense?
• He left the bank five minutes ago.
• He caught a fish at the bank.
• A single lexeme with multiple related meanings
(bank the building, bank the financial institution)
• Most non-rare words have multiple meanings
– The number of meanings is related to its frequency
– Verbs tend more to polysemy
– Distinguishing polysemy from homonymy isn’t always
easy (or necessary)
                                                                                                    118
Metaphor and Metonymy
• Metaphor and metonymy are similar in various aspects but the major difference is that if a metaphor
substitutes a concept with another, a metonymy selects a related term. So, if metaphor is for
substitution, metonymy is for association.
• For example, the sentence ‘he has a heart of stone’ is a metaphor. Here the word heart is used in
substitution for displaying unyielding or unfeeling character of the person.
• The sentence ‘the tiger called his employees to the meeting room’ is a metonymy. Here there is no
substitution; instead the person is associated with a tiger for his nature.
• Metaphor and Metonymy are specific Types of
Polysemy.
• Metaphor:(A metaphor is a figure of speech containing an implied comparison. With metaphors,
words or phrases that are ordinarily applied to one thing are applied to something you wouldn't
necessarily pair it with. Here's a metaphor example:
“The curtain of night fell upon us.”
• Her eyes were like diamonds
• Metonymy (Metonymy is a figure of speech that replaces the name of a thing with the name of
something else with which it is closely associated).
A famous example is, "The pen is mightier than the sword," from Edward Bulwer Lytton's play
Richelieu.
This sentence has two metonyms:
"Pen" stands for "the written word."
"Sword" stands for "military aggression."
Synonyms
• Words that have the same meaning in some or all contexts.
– Look / see
– Happy / Joyful
– big / large
– automobile / car
– Beautiful / Pretty/Attractive
– Water / H20
• Two lexemes are synonyms if they can be successfully substituted
for each other in all situations
– If so they have the same propositional meaning
• But there are few (or no) examples of
perfect synonymy.
– Why should that be?
– Even if many aspects of meaning are identical
– Still may not preserve the acceptability based
on notions of politeness, slang, register,
genre, etc.
• Example:
– Water and H20
Synonymy
• Principle of substitutability
• How big is this plane?
• Would I be flying on a large or small plane?
• Miss Nelson, for instance, became a kind of big sister to Mrs. Van Tassel’s son, Benjamin.
• ?? Miss Nelson, for instance, became a kind of large sister to Mrs. Van Tassel’s son, Benjamin.
Antonyms
• Senses that are opposites with respect to
one feature of their meaning
• Otherwise, they are very similar!
– dark / light
– short / long
– hot / cold
– up / down
– in / out
• More formally: antonyms can
– define a binary opposition or at opposite ends
Hyponymy
• A term that denotes a subcategory of a more general class:
“Chair” and “table” are hyponyms of “furniture.”
• One sense is a hyponym of another if the first sense is
more specific, denoting a subclass of the other
– car is a hyponym of vehicle
– dog is a hyponym of animal
– mango is a hyponym of fruit
• Conversely
– vehicle is a hypernym/superordinate of car
– animal is a hypernym of dog
– fruit is a hypernym of mango
superordinate vehicle fruit furniture mammal
hyponym car mango chair dog
Semantic ambiguity
• A sentence may have a single syntactic structure, but
multiple semantic structures
• “The park is near the bank.”
• Ambiguous between a financial institution and the side of a river.
• “He heard her voice through the door.”
• Unclear whether “through the door” refers to hearing from inside or outside.
Semantics
Mary untangled a library.
• What’s wrong with this sentence?
• It is syntactically correct, but semantically weird
• Since Untangled stands for free from a tangled
or twisted state.
Pragmatics
• The context disambiguates the
meaning.
• This is a complicated cognitive feat!
Mary was hired as the new library director.
She had a hard job because the library was a mess.
Mary untangled a library.
Discourse
• The prior discourse provides references for
the pronouns.
• In general, temporal sequences require
discourse knowledge to be understood.
Mary was hired as the new library director.
Mary had a hard job because the library was a mess.
She untangled it.
World Knowledge
• World knowledge allows you to understand what this conversation is about.
• It includes knowledge about jobs, office politics, the structure of argument, and more.
Speaker A:
Mary was hired as the new library director.
She had a hard job because the library was a mess.
Mary untangled a library.
Speaker B:
It doesn’t matter.
She’s still fired.
Semantic Analysis
                                                                                                  121
To Understand the meaning of a sentence, there are three types of semantic analysis that we can use
1. We can determine the Semantic Feature of key words in the
sentence
2. We can analyze the Lexical Relation of these words in the
sentence
3. We can analyze the Semantic Roles (Thematic Role / Theta
Role) of key words in the sentence(i.e. the roles they fulfill in
the situation described in the sentence.
Word Sense Disambiguation (WSD) is a crucial task in natural language processing (NLP) that
involves determining which meaning of a word is being used in a particular context. Given that
many words have multiple meanings (polysemy) or can refer to different concepts (homonymy), WSD
aims to resolve this ambiguity to improve understanding and processing of text.
Applications:
WSD is critical for various NLP applications, including:
• Information Retrieval: Improving search engine accuracy by
understanding user queries.
• Machine Translation: Ensuring correct translation of words based
on their intended meanings.
• Text Analysis: Enhancing sentiment analysis and summarization
tasks by accurately interpreting text.
                                                                                                     124
The word bank in the first sentence refers to the commercial (finance) banks, while in second
sentence, it refers to the river bank. The ambiguity that arises due to this, is tough for a machine to
detect and resolve. Detection of ambiguity is the first issue and resolving it and displaying the correct
output is the second issue.
                             125
Uses:
Word sense disambiguation.
* Information retrieval.
Automatic text classification.
* Automatic text summarization.
* Machine translation
* Automatic crossword puzzle generation.
* Improve search engine results
• Semantic Role Labeling is the process that assigns labels to words or phrases in a sentence that
indicates their semantic role in the sentence, such as that of an agent, goal.
• It is the task of automatically finding the thematic roles for each predicate in a sentence.
• It answers the who did what to whom, when, where, why, how and so on. Finding these relations is
preliminary to question answering and information extraction.
• For each clause, determine the semantic role played by each noun phrase that is an argument to the
verb.
                                                                                                          128
agent patient source destination instrument
– John drove Mary from Austin to Dallas in his Toyota Prius.
– The hammer broke the window.
• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic parsing”
Semantic Roles
• Origins in the linguistic notion of case (Fillmore,1968)
• A variety of semantic role labels have been
proposed, common ones are:
– Agent: Actor of an action
– Patient: Entity affected by the action
– Instrument: Tool used in performing action.
– Beneficiary: Entity for whom action is performed
– Source: Origin of the affected entity
– Destination: Destination of the affected entity
Selectional Restrictions
• Selectional restrictions are constraints that certain verbs place on the filler of certain semantic roles.
• Sub-categorize the thematic rule
– Agents should be animate
– Beneficiaries should be animate
– Instruments should be tools
                                                                                               129
– Patients of “eat” should be edible
– Sources and Destinations of “go” should be places.
– Sources and Destinations of “give” should be animate.
• Taxanomic abstraction hierarchies or ontologies (e.g.
hypernym links in WordNet) can be used to determine if
such constraints are met.
– “John” is a “Human” which is a “Mammal” which is a
“Vertebrate” which is an “Animate”
SRL Datasets
• FrameNet:
– Developed at Univ. of California at Berkeley
– Based on notion of Frames
• PropBank:
– Developed at Univ. of Pennsylvania
– Based on elaborating their Treebank
• Salsa:
– Developed at Universität des Saarlandes
– German version of FrameNet
FrameNet
• Project at UC Berkeley led by Chuck Fillmore for developing a
database of frames, general semantic concepts with an associated set of
roles.
• Roles are specific to frames, which are “invoked” by multiple words,
both verbs and nouns.
– JUDGEMENT frame
• Invoked by: V: blame, praise, admire; N: fault, admiration
• Roles: JUDGE, EVALUEE, and REASON
• Specific frames chosen, and then sentences that employed these frames
selected from the British National Corpus and annotated by linguists
for semantic roles.
• Initial version: 67 frames, 1,462 target words, _
49,013 sentences, 99,232 role fillers
• Frames are an artificial intelligence data structure used to divide
knowledge into substructures by representing "stereotyped situations".
FrameNet Results
• Gildea and Jurafsky (2002) performed SRL experiments with initial FrameNet data.
• Assumed correct frames were identified and the task was to fill their roles.
• Automatically produced syntactic analyses using Collins (1997) statistical parser.
• Used simple Bayesian method with smoothing to classify parse nodes.
• Achieved 80.4% correct role assignment.
Increased to 82.1% when frame-specific roles
were collapsed to 16 general thematic categories.
PropBank
• Project at U Penn lead by Martha Palmer to add
semantic roles to the Penn treebank.
• Roles (Arg0 to ArgN) specific to each individual
verb to avoid having to agree on a universal set.
– Arg0 basically “agent”
– Arg1 basically “patient”
• Annotated over 1M words of Wall Street Journal
text with existing gold-standard parse trees.
• Statistics:
– 43,594 sentences 99,265 propositions (verbs + roles)
– 3,324 unique verbs 262,281 role assignments
                                                                                                  134
CONNL SRL Shared Task
• CONLL (Conference on Computational Natural Language Learning) is the annual meeting for the
SIGNLL (Special Interest Group on Natural Language Learning) of ACL.
• Each year, CONLL has a “Shared Task” competition.
• PropBank semantic role labeling was used as the Shared Task for CONLL-04 and CONLL-05.
• In CONLL-05, 19 teams participated.
Issues in SRL
• How to properly integrate syntactic parsing, WSD, and role assignment so they all aid each other.
• How can SRL be used to aid end-use applications:
– Question answering
– Machine Translation
– Text Mining
135
                                                                                                 136
Word Embedding
   ●   Word embedding is a technique in natural language processing (NLP) that represents words as
       vectors in a continuous vector space. This allows for capturing semantic relationships and
       similarities between words based on their context in large text corpora.
   ●   Some word embedding models are Word2vec (Google), Glove (Stanford), and fastest
       (Facebook).
   ●
   ●   Word Embedding is also called a distributed semantic model or distributed represented or
       semantic vector space or vector space model.
   ●   The similar words can be grouped together. For example, fruits like apples, mango, and
       banana should be placed close whereas books will be far away from these words.
   ●   In a broader sense, word embedding will create the vector of fruits which will be placed far
       away from the vector representation of books.
Applications
   ●   NLP Tasks: Used in sentiment analysis, machine translation, information retrieval, and more.
   ●   Transfer Learning: Pre-trained models can be fine-tuned for specific tasks, leveraging
       knowledge from vast corpora.
   ●   Word Embedding Limitations
           ○ Out-of-Vocabulary Words: Word2Vec does not handle words that were not present in
               the training data well.
           ○ Lack of Contextual Awareness: It generates a single vector for each word regardless
               of its context (e.g., "bank" in "river bank" vs. "financial bank").
           ○
Word2Vec
Word2Vec is a popular technique for creating word embeddings, developed by a team at Google led
by Tomas Mikolov. It represents words in a continuous vector space, allowing machines to understand
their meanings based on context. Here are the main components of Word2Vec:
Word2Vec - Architectures
CBOW - Example
139
140
                                                                                                      141
Summary:
    ●   In CBOW, context words are chosen based on proximity to the target word, which is
        determined by the window size.
    ●   Including additional words like "queen" depends on expanding the context window.
    ●   By adjusting the context, you can influence the prediction, capturing more semantic
        relationships.
Skip-gram model - Ex
142
143
                                                                                                  144
Threshold: You can set a threshold to consider only those context words whose probabilities exceed a
certain value.
Top-N Selection: Alternatively, you can select the top-N context words with the highest probabilities.
This is common when you expect multiple context words.
Stages in NLP
                                                                                                   145
Objective: Understand meaning based on situational context, speaker intentions, and social factors.
Focus on implicature (implied meaning) and speech acts (e.g., requests, promises).
Analyze how language is used in different contexts.
Why It Comes After: Pragmatic analysis requires a higher level of understanding because it often
depends on resolving the structure and coherence of the discourse first. Once the text is analyzed for
structural and relational meaning (discourse), pragmatics can then interpret speaker intent and context
to understand deeper, implied meanings.
Interprets speaker intent and context to uncover deeper meanings.
Example: "Can you close the door?" interpreted as a request, not a literal question about ability.
Semantic Analysis
Key Concepts:
                                                                                                  146
   Word Meaning (Lexical Semantics): The study of individual word meanings and how they relate to
one another (e.g., synonyms, antonyms).
   Sentence Meaning (Compositional Semantics): How the meaning of individual words combines to
form the meaning of larger units, such as phrases and sentences.
   Truth Conditions: Whether a sentence can be judged as true or false based on its semantic content.
   Ambiguity: How a word or sentence can have multiple meanings depending on its linguistic
structure (e.g., "bank" can mean a financial institution or the side of a river).
Pragmatic Analysis
• Pragmatic analysis examines how context influences the
interpretation of meaning in language use. It deals with the
relationship between language and its users.
• Context: A conversation between two people at a restaurant.
                                                                                                    147
        Speaker A: "Can you pass the salt?"
        Speaker B: Passes the salt.
        Literal meaning: Speaker A is asking if Speaker B is physically capable of passing
        the salt (a yes/no question).
Discourse Analysis
• Discourse analysis looks at language use across longer stretches of text (written or spoken) to
examine how meaning is constructed in context. It emphasizes the structure and patterns of
communication in social and cultural contexts.
Context: A political speech advocating for education reform.
        Speaker: "Education is the backbone of our nation’s future.
Without strong schools, we risk everything. We must invest in teachers
and give every child a chance to succeed."
Discourse analysis
• Discourse analysis in NLP involves examining how sentences and utterances form coherent and
meaningful texts. This process is crucial for understanding and generating natural language, as it helps
models grasp the context and relationships between different parts of a text.
• Coherence: Ensuring that the text makes sense as a whole. Coherence involves
logical connections between sentences and ideas, making the text flow naturally.
Summary of Differences:
• Pragmatic Analysis
Example: Focuses on the implied meaning of a single sentence within a
conversation, emphasizing how context shapes interpretation (e.g., a polite
request in conversation).
• Discourse Analysis
Example: Looks at a larger text, examining how various elements (cohesion,
power, social context) work together to create an argument and convey
meaning over the course of a speech or extended discourse.
                                                                            148
Anaphora resolution
• Anaphora resolution is a process in linguistics and natural language
processing (NLP) that involves identifying and linking anaphoric
expressions (like pronouns or noun phrases) to the correct
antecedents (the words or phrases they refer to) within a text. This is
essential for understanding the meaning of sentences and
maintaining coherence in both written and spoken discourse.
• Example: In the sentence "John told Mark that he should leave," the
antecedent for "he" (John or Mark) might depend on the
conversational context or the speaker’s intentions. Pragmatic analysis
would help by considering social or contextual cues outside the
immediate text.
• Pragmatic Analysis: Focuses on how context beyond the text helps resolve
references, especially when there is ambiguity or the meaning relies on
shared knowledge or inference.
Types of Anaphora:
• Pronominal Anaphora: The most common type, where pronouns like
he, she, it, they are used.
• Example: Sarah found her keys. She was relieved.
         • Anaphora: "She"
         • Antecedent: "Sarah
• Nominal Anaphora: When a noun phrase is repeated or replaced with
another noun phrase.
         • Example: The dog was barking loudly. The animal seemed upset.
         • Anaphora: "The animal"
         • Antecedent: "The dog"
• Zero Anaphora: This occurs when a referent is implied but not
                                                                                                    150
explicitly stated, common in languages like Japanese.
• Example: Atsui desu ne? (It's hot, isn't it?) Here, the subject ("the
weather") is not explicitly mentioned but understood through
Context.
• Pronouns:
        • Another form of definite reference is pronominalization, used for entities that
        are extremely salient in the discourse
                 • Emma smiled and chatted as cheerfully as she could
        • Pronouns can also participate in cataphora, in which they are mentioned
        before their referents are
        • Even before she saw it, Dorothy had been thinking about the Emerald City
        • Pronouns also appear in quantified contexts in which they are considered to
        be bound, as in
                 • Every dancer brought her left arm forward
                 • Under the relevant reading, her does not refer to some woman in context, but instead
                 behaves like a variable bound to the quantified expression every dancer
                                                                                                    154
Linguistic Properties of the Coreference Relation
• Number Agreement
         • Referring expressions and their referents must generally
         • agree in number; English she/her/he/him/his/it are singular,
         we/us/they/them are plural, and you is unspecified for number.
         • So a plural antecedent like the chefs cannot generally corefer with a singular
         anaphor like she
         • semantically plural entities can be referred to by either it or they:
                  • IBM announced a new machine translation product yesterday. They have been
                  working on it for 20 years.
• Person Agreement:
         • English distinguishes between first, second, and third person, and a pronoun’s
         antecedent must agree with the pronoun in person
         • Thus a third person pronoun show have a third person antecedent
         • Quotations can cause exceptions
                  • “I voted for Nader because he was most aligned with my values,” she said.
• Gender or Noun Class Agreement:
         • Maryam has a theorem. She is exciting. (she=Maryam, not the theorem)
         • Maryam has a theorem. It is exciting. (it=the theorem, not Maryam)
• Binding Theory Constraints:
         • The binding theory is a name for syntactic constraints on the relations
         between a mention and an antecedent in the same sentence
         • Oversimplifying a bit, reflexive pronouns like himself and herself corefer with
         the subject of the most immediate clause that contains them, whereas
         nonreflexives cannot corefer with this subject
                  • Janet bought herself a bottle of fish sauce. [herself=Janet]
                  • Janet bought her a bottle of fish sauce. [her6=Janet]
• Grammatical Role:
         • Entities mentioned in subject position are more salient than those in object Position
                  • Billy Bones went to the bar with Jim Hawkins. He called for a glass of rum. [ he =
                  Billy ]
         • Jim Hawkins went to the bar with Billy Bones. He called for a glass of rum. [ he = Jim ]
• Verb Semantics:
         • Some verbs semantically emphasize one of their arguments, biasing the
         interpretation of subsequent pronouns
                  • John telephoned Bill. He lost the laptop.
                  • John criticized Bill. He lost the laptop.
• Selectional Restrictions:
         • the selectional restrictions that a verb places on its arguments can help
         eliminate referents
                  • I ate the soup in my new bowl after cooking it for hours
                                                                                                       155
• Rule-Based Approaches: These methods rely on predefined linguistic rules, such as the proximity
between the anaphor and its antecedent, syntactic constraints, or grammatical agreements (e.g., gender
and number).
• Statistical/Probabilistic Models: These use large datasets to calculate the probability of various
antecedents being linked to an anaphor based on patterns found in the data.
• Machine Learning Approaches: Modern approaches use neural networks and deep learning to
automatically learn the patterns for anaphora resolution from large corpora, often outperforming
traditional rule-based or statistical methods.
• We start at the target pronoun in the tree, and crawl up the parse tree to the root node(S).
• It performs a breadth first left to right search of the node’s children to the left of the target for
each noun phrase or ‘S’ node it finds.
• The algorithm begins with the parse tree of sentence 2 and works its way up to the root node
S2.
• After that, it performs a breadth-first search to locate the noun phrase (NP). The algorithm
                                                                                                     157
discovers its first noun phrase for the word ‘Jill’ here
Because Jill “likes” the person of the pronoun, the pronoun cannot obviously be Jill.
• Binding theory states that: A reflexive can refer to the subject of the most immediate clause in which
it appears, whereas nonreflexive cannot corefer this subject.. Words such as himself, herself,
themselves, etc. are known as reflexive.
• So, in our scenario, ‘him’ will not refer to Jill because of the binding theory limitation
• Also, Jill will not be the accepted referent for the pronoun ‘he’ due to the gender agreement
requirement.
• As a result, the algorithm now begins its search in the previous sentence’s syntax tree.
• Hobbs distance property states that entities in a subject position are more likely the possible
substitute for the pronoun than in the object position.
• Following the Hobbs distance property, we perform a breadth first left to right search of
the node’s children for each noun phrase it discovers.
• As a result, the sentence’s subject Jack, who is an engineer, is explored before the object engineer,
and Jack is finally the resolved referent for the pronoun him.
                                                                                                       158
For each prior mention (Victoria Chen, Megabucks Banking, her, etc.), the binary classifier computes
a probability: whether or not the mention is the antecedent of she.
• We want this probability to be high for actual antecedents (Victoria Chen, her, the 38-year-old) and
low for non-antecedents (Megabucks Banking, her pay).
• Early classifiers used hand-built features; more recent classifiers use neural representation learning
• For training, we need a heuristic for selecting training samples; since most pairs of mentions in a
document are not coreferent, selecting every pair would lead to a massive overabundance of negative
samples.
• The most common heuristic, is to choose the closest antecedent as a positive example, and all pairs
in between as the negative examples.
• More formally, for each anaphor mention mi we create
         • one positive instance (mi ;mj ) where mj is the closest antecedent to mi, and
• negative instance (mi ;mk ) for each mk between mj and mi
• Thus for the anaphor she, we would choose (she, her) as the positive example and no negative
examples.
• Similarly, for the anaphor the company we would choose
        • (the company, Megabucks) as the positive example and
        • (the company, she) (the company, the 38-year-old) (the company, her pay)
        and (the company, her) as negative examples.
• Advantage
        • Simplicity
• Disadvantages
        • The classifier doesn’t directly compare candidate antecedents to each other,
        so it’s not trained to decide, between two likely antecedents, which one is in
        fact better.
        • It ignores the discourse model, looking only at mentions, not entities. Each
        classifier decision is made completely locally to the pair, without being able to
        take into account other mentions of the same entity
Testing time:
      • O(T^2) spans of text in a document
      • O(T^4) possible pairs of spans — too expensive!
      • Use sm(i) for aggressive pruning
                                                                                                167
Vector Semantics & Embeddings
Ludwig Wittgenstein
PI #43:
"The meaning of a word is its use in the language"
Defining meaning as a point in space based on distribution Each word = a vector (not just "good" or
"w45") Similar words are "nearby in semantic space" We build this space automatically by seeing
which words are nearby in text
◦ With embeddings:
        ◦ Feature is a word vector
        ◦ 'The previous word was vector [35,22,17…]
        ◦ Now in the test set we might see a similar vector [34,21,14]
        ◦ We can generalize to similar but unseen words!!!
battle is "the kind of word that occurs in Julius Caesar and Henry V" fool is "the kind of word that
occurs in comedies, especially Twelfth Night"
                                                                                                 171
Computing word similarity: Dot product and cosine
The dot product tends to be high when the two vectors have large values in the same dimensions
Dot product can thus be a useful similarity metric between vectors
But since raw frequency values are non-negative, the cosine for term-term matrix vectors ranges from
0–1
                                                                                                    173
Word Embedding
• Word embedding is a technique in natural language processing (NLP) that represents words as
vectors in a continuous vector space. This allows for capturing semantic relationships and similarities
between words based on their context in large text corpora.
• Some word embedding models are Word2vec (Google), Glove (Stanford), and fastest (Facebook).
• Word Embedding is also called a distributed semantic model or distributed represented or semantic
vector space or vector space model.
• The similar words can be grouped together. For example, fruits like apples, mango, and banana
should be placed close whereas books will be far away from these words.
• In a broader sense, word embedding will create the vector of fruits which will be placed far away
from the vector representation of books.
fastText
• fastText introduces a pivotal shift by considering words as composed of character n-grams, enabling
it to build representations for words based on these subword units
• This approach allows the model to understand and generate embeddings for words not seen in the
training data, offering a substantial advantage in handling morphologically rich languages and
rare words.
Representation of Words
• Word2Vec: Word2Vec generates word embeddings based solely on the words without considering
internal structure or morphological information.
• fastText: fastText captures subword information, allowing it to understand word meanings based on
their constituent character ngrams. This enables fastText to represent words by considering their
morphological makeup, providing a richer representation, especially for morphologically rich
languages or domains with specialised jargon.
Training Efficiency
• Word2Vec: The training process in Word2Vec is relatively faster than older methods but might be
slower than fastText due to its word-level approach.
• fastText: fastText is known for its exceptional speed and scalability, especially when dealing with
large datasets, as it operates efficiently at the subword level.
Use Cases
• Word2Vec: Word2Vec’s word-level embeddings are well-suited for
tasks like finding similar words, understanding relationships between
words, and capturing semantic similarities.
                                                                         174
• fastText: fastText’s subword embeddings make it more adaptable in
scenarios involving out-of-vocabulary words, sentiment analysis,
language identification, and tasks requiring a deeper understanding of
morphology.
                                                                                                     175
Language models
A language model is the core component of modern Natural Language Processing (NLP). It’s a
statistical tool that analyzes the pattern of human language for the prediction of words.
Language models analyze bodies of text data to provide a basis for their word predictions. They are
used in natural language processing (NLP) applications
Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the
probability of a given sequence of words occurring in a sentence.
Language model is a process to represent the text to a understandable form from the machine point of
view.
The goal of a language model is to compute a probability of a token (e.g. a sentence or a sequence of
words).
Language Model (LM) actually a grammar of a language as it gives the probability of word that will
follow
Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a
language.
For NLP, a probabilistic model of a language that gives a probability that a string is a member of a
language is more useful.
To specify a correct probability distribution, the probability of all sentences in a language must sum to
1.
Chain Rule
Word sequences
Bigram approximation
                                                                                                    177
N-gram approximation
Estimating Probabilities
N-gram conditional probabilities can be estimated from raw text based on the relative frequency of
word sequences.
Bigram:
N-gram:
To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
Word Prediction
Guess the next word...
... I notice three guys standing on the ???
There are many sources of knowledge that could be helpful, including world knowledge.
But we can do pretty well by simply looking at the preceding words and keeping track of simple
counts
Formalize this task using N-gram models. N-grams are token sequences of length N.
Given N-grams counts, we can guess likely next words in a sequence.
Counting
He stepped out into the hall, was delighted to encounter a water brother.
                                                                                                  178
13 tokens, 15 if we include “,” and “.” as separate tokens.
Assuming we include the comma and period, how many bigrams are there? 14
12 Bigram without punctuation adn non-alhpanumeric characters
he stepped
stepped out
out into
into the
the hall
hall was
was delighted
delighted to
to encounter
encounter a
a water
water brother
They picnicked by the pool, then lay back on the grass and looked at the stars.
18 tokens (again counting punctuation)
only 16 types (again counting punctuation)
only 14 types (excluding punctuation)
In going forward, we’ll sometimes count types and sometimes tokens.
Word Types - A unique word form regardless of how many times it appears in a text
Counting: Wordforms
Should “cats” and “cat” count as the same?
How about “geese” and “goose”?
   ● Lemma: cats & cat and geese & goose have the same lemmas
   ● Wordform: fully inflected surface form
Again, we’ll have occasion to count both lemmas and wordforms
Counting: Corpora
    ●   Numbers
    ●   Misspellings
    ●   Names
    ●   Acronyms
    ●   etc
Language Modelling
To Perform word prediction:
Assess the conditional probability of a word given the previous words in the sequence
P(wn|w1,w2…wn-1).
                                                                                                    179
We’ll call a statistical model that can assess this a Language Model.
One way to calculate them is to use the definition of conditional probabilities, and get the counts from
a large corpus.
Counts are used to estimate the probabilities.
Unfortunately, for most sequences and for most text collections we won’t get good estimates from this
method.
We’ll use the chain rule of probability
         and an independence assumption.
Independence Assumption
This kind of independence assumption is called a Markov assumption after the Russian mathematician
Andrei Markov.
For each component in the product replace with the approximation (assuming a prefix of N)
Bigram Version
Example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
                                                                                          180
Training corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Vellore do engineering</s>
Test data:
<s> students are from Vellore </s>
Method 1
P(<s> students are from Vellore </s>)
                   = P(students | <s>) * P(are | students) * P(from | are)
                     * P(Vellore | from) * P(</s> | Vellore)
                                                                                                181
Method 2
Bigram Probabilities
827/2533=0.326=>0.33
Kinds of Knowledge
As crude as they are, N-gram probabilities capture a range of interesting facts about language.
P(english|want) = .0011
                                                                                                   183
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it
also brings about the meaning of immediately succeeding sentence.
Discourse Analysis
Discourse analysis in NLP involves examining how sentences and utterances form coherent and
meaningful texts. This process is crucial for understanding and generating natural language, as it helps
models grasp the context and relationships between different parts of a text.
Here are some key aspects of discourse analysis in NLP:
Coherence: Ensuring that the text makes sense as a whole. Coherence involves logical connections
between sentences and ideas, making the text flow naturally.
Coreference Resolution: Identifying when different expressions refer to the same entity. For example,
recognizing that “John” and “he” in a text refer to the same person.
Discourse Structure: Analyzing the organization of a text, including how it is segmented into
paragraphs or sections and how these segments relate to each other.
Topic Structure: Understanding the main topics discussed in a text and how they shift throughout the
discourse.
Conversation Structure: In dialogues, this involves understanding the turn-taking and the flow of
conversation between participants.
Let’s turn the model around and use it to generate random sentences that are like the sentences from
which the model was derived.
Illustrates:
         Dependence of N-gram models on the specific training corpus
         Greater N better model
Generally attributed to
   Claude Shannon.
<s> I
     I want
       want to
            to eat
               eat Chinese
                   Chinese food
                           food </s>
Shakespeare as a Corpus
N=884,647 tokens, V=29,066 (word types)
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams...
        So, 99.96% of the possible bigrams were never seen (have zero entries in the table)
        This is the biggest problem in language modeling; we’ll come back to it.
Quadrigrams(visualization data tool) are worse: What's coming out looks like Shakespeare because it
is Shakespeare
Evaluation
                                                                                                    185
Perplexity
The best language model is one that best predicts an unseen test set.
        Gives the highest P(sentence)
Perplexity is the inverse probability of the test set, normalized by the number of words.
Perplexity is the probability of the test set (assigned by the language model), normalized by the
number of words:
Chain rule:
For bigrams:
Smoothing in NLP
        That is, we’ll have many cases of putative “zero probability n-grams” that should really have
        some non- zero probability.
Zero Counts
Back to Shakespeare
        Recall that Shakespeare produced 300,000 bigram types out of V2= 844 million possible
        bigrams...
        So, 99.96% of the possible bigrams were never seen (have zero entries in the table)
        Does that mean that any sentence that contains one of those bigrams should have a
        probability of 0?
Smoothing
Smoothing in natural language processing (NLP) refers to a set of techniques used to adjust
probability estimates in models to handle the problem of zero probabilities for unseen events. This is
especially important in the context of language modeling, where we often deal with n-grams.
Add-One Smoothing
When constructing a probabilistic language model, particularly with n-grams, you often encounter
situations where certain word sequences (n-grams) have not been observed in the training data.
Without any adjustments, the probability of these unseen sequences would be zero, which can
severely impact the performance of the model.
Add-One Smoothing modifies the probability estimation to ensure that no sequence has a zero
probability. It does this by adding one to the count of each n-gram before normalizing:
For unigrams:
        Add 1 to every word (type) count
        Normalize by N (tokens) /(N (tokens) +V (types))
                                                            188
       Smoothed count (adjusted for additions to N) is
Can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast when is caffe venezia open during the day
                                                                                             190
Note that add-one smoothing has made a very big change to the counts. C(want|to) changed from 608
to 238!
                                                                                                 191
We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to
.26 in the smoothed case.
                                 Chunking or shallow parsing
What is Chunking?
Chunking is a process of extracting phrases from unstructured text. It divides the text into
syntactically correlated parts of words. Instead of just simple tokens which may not represent
the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single
Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output.
Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase(NP), Verb Phrase
(VP), etc. Chunking is very important when you want to extract information from text such as
Unlike full parsing or deep parsing, which involves analyzing the grammatical structure of a
For example, in the sentence “The black cat sat on the mat,” the noun phrase “the black cat”
can be identified and extracted using noun phrase chunking. Another common type of shallow
parsing is verb phrase chunking, which involves identifying and extracting all the verb phrases
in a sentence.
Chunking or shallow parsing is sufficient for closely related languages rather than full parsing.
Let’s consider Noun Phrase Chunking and we search for chunks corresponding to an individual
noun phrase. In order to create NP chunk, we define the chunk grammar using POS tags. We
The rule states that whenever the chunk finds an optional determiner (DT) followed by any
number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be
formed.
grammar for NP: {<DT>?<JJ>*<NN>}
Varalakshmi M, SCOPE
Neural Network based chunking:
Eamples:
1) [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP
2) [NP Mr. John] [VP had been] [NP organizing chair] [PP of] [NP NLP Conference]
The chunking task as shown in the above example can be transformed to a sequence labelling
Tokens that do not belong to a chunk are represented by O labels (to denote ‘outside’).
Now each word is provided a label which makes it suitable for neural networks to be used.
For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob
(NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.
So, to build a conditional random field to label sentences with their parts of speech, we have
to decide on a set of feature functions fi.
Varalakshmi M, SCOPE
Feature Functions in a CRF
In a CRF, each feature function is a function that takes in as input:
       • a sentence s
       • the position i of a word in the sentence
       • the label li of the current word
       • the label li-1 of the previous word
and outputs a real-valued number (though the numbers are often just either 0 or 1).
(Note: A linear-chain CRF is built by restricting our features to depend on only the current and
previous labels, rather than arbitrary labels throughout the sentence)
For example, one possible feature function could measure with how much probability, the
current word should be labeled as an adjective given that the previous word is “very”.
Varalakshmi M, SCOPE
         'is_numeric': word.isdigit(), #if word is in numeric
         'capitals_inside': word[1:].lower() != word[1:]
     }
(The first sum runs over each feature function j, and the inner sum runs over each position i of
the sentence.)
Finally, we can transform these scores into probabilities p(l|s) between 0 and 1 by
exponentiating and normalizing:
To sum up: to build a conditional random field, just define a bunch of feature functions (which
can depend on the entire sentence, a current position, and nearby labels), assign them weights,
and add them all together, transforming at the end to a probability if necessary.
Varalakshmi M, SCOPE
- CRFs can define a much larger set of features. Whereas HMMs are necessarily local in
nature (because they’re constrained to binary transition and emission feature functions, which
force each word to depend only on the current label and each label to depend only on the
previous label), CRFs can use more global features. For example, one of the features in our
POS tagger above increased the probability of labelings that tagged the first word of a sentence
as a VERB if the end of the sentence contained a question mark.
- CRFs can have arbitrary weights.
Learning Weights
Gradient descent can be used.
Assume there is a bunch of training examples (sentences and associated part-of-speech labels).
Randomly initialize the weights of the CRF model. To shift these randomly initialized weights
to the correct ones, for each training example…
       •   Go through each feature function fi, and calculate the gradient of the log probability
           of the training example with respect to
           Note that the first term in the gradient is the contribution of feature fi under the true
           label, and the second term in the gradient is the expected contribution of
           feature fi under the current model.
       •   Move λi in the direction of the gradient:
The naive way is to calculate p(l|s) for every possible labeling l, and then choose the label that
maximizes this probability. However, since there are km possible labels for a tag set of size k
and a sentence of length m, this approach would have to check an exponential number of labels.
A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property and
so a (polynomial-time) dynamic programming algorithm, similar to the Viterbi algorithm for
HMMs can be used to find the optimal label.
There are a lot of libraries which gives phrases out-of-box such as Spacy or TextBlob. NLTK
Varalakshmi M, SCOPE
 conda install -c anaconda nltk
 import nltk
 from nltk import pos_tag,word_tokenize
 nltk.download('averaged_perceptron_tagger')
 nltk.download('punkt')
 sentence = "the little yellow dog barked at the cat"
 grammar = ('''
     NP: {<DT>?<JJ>*<NN>} # NP
     ''')
 chunkParser = nltk.RegexpParser(grammar)
 tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
 tree = chunkParser.parse(tagged)
 for subtree in tree.subtrees():
     print(subtree)
tree.draw()
 #output:
 (S
   (NP the/DT little/JJ yellow/JJ dog/NN)
   barked/VBD
   at/IN
   (NP the/DT cat/NN))
 (NP the/DT little/JJ yellow/JJ dog/NN)
 (NP the/DT cat/NN)
Here "tree" variable is an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree.
We can reference these by doing something like tree.subtrees. We can then iterate through
these subtrees like so:
grammar = ('''
NP: {<DT>?<JJ>*<NN>} # NP
''')
Varalakshmi M, SCOPE
 chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tree = chunkParser.parse(tagged)
print("output")
print(subtree)
print("done")
#tree.draw()
 #output:
 (NP the/DT little/JJ yellow/JJ dog/NN)
 (NP the/DT cat/NN)
"""grammar = ('''
NP: {<DT>?<JJ>*<NN>} # NP
''')
"""
grammar=('''VB: {<VB.?>}''')
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tree = chunkParser.parse(tagged)
print("output")
print(subtree)
print("done")
Varalakshmi M, SCOPE
 #tree.draw()
 #output:
 (VB barked/VBD)
Varalakshmi M, SCOPE
                                          Parse Tree Generation
Syntax: It refers to the way words are arranged together in a sentence.
Syntactic constituency: Group of words behaving as single units or constituents
Noun phrase: A sequence of words surrounding atleast one noun eg., John the servant, the
lighthouse
An entire noun phrase can occur before a verb eg., John the servant came out.
Similarly, a prepositional phrase can be placed at the beginning (preposed), at the end (postposed)
or even placed in the middle of the sentence. But the individual words cannot be split up.
Eg., “on January 3rd”
On January 3rd, I would like to go for a trip. (preposed)
I would like to go for a trip on January 3rd. (postposed)
I would like to go on January 3rd for a trip (placed in the middle)
I would like to go on January for a trip 3rd ( this is wrong… a prepositional phrase cannot be split)
The sequence of rule expansions is called a derivation of the string of words. A parse tree is generally
used to represent a derivation.
Eg.,
Syntactic parsing: It is the problem of mapping from a string of words to its parse tree. It determines
if the structure of a sentence is according to the grammar of the language.
There are several approaches to construct a parse tree -top-down, bottom-up.
Ambiguous sentences lead to construction of more than one parse tree.
01 02 03 04 05
 11                12                13                14                   15
                   N->flight              ---          -----                ------
 21                22                23                24                   25
                                     V->includes         ---                VP->V NP
 31                32                33                34                   35
                                                       DET->a               NP->DET,N
 41                42                43                44                   45
                                                                            N->meals
Note: At last, if the start symbol is obtained, it infers that the sentence is correct according to the
given CNF grammar. If the start symbol is not reached, then the sentence is syntactically wrong.
Ie., arrangement of the words is not correct.
Now, to construct the parse tree, start with the cell ‘S->NP, VP’. Look for the cell in the same row
that has rule for NP and look for the cell in the same column that has rule for VP and proceed with
the same way until all the leaf nodes contain the terminal symbols.
Example-2:
Let’s consider an example grammar and sentence that lead to multiple parse trees.
Note: If there are 2 productions for the same non-terminal, let us label them as rule 1 and rule2.
Eg., in the CNF grammar given below, there are two rules for VP. So, label them as VP1 and VP2
01 02 03 04 05
 31                 32               33                34        35
                                                       P->with   PP->P,NP
 41                 42               43                44        45
                                                                 NP->fork
This results in the interpretation “we eat, fish with fork” ie., we eat that fish which has a fork..
    ii)      Now construct the tree with the second production of ‘S’, S->NP,VP2
    (Note: Follow the lines from the rules encircled in the table given below.
This results in the interpretation “we eat fish, with fork” ie., we eat fish using a fork.
Eg., the probability of occurrence of the word sequence “school students do their homework”
is
P(school students do their homework)=p(school).p(students | school).p(do | school
students).p(their | school students do).p(homework | school students do their)    ……(1)
But the drawback here is that the longer the sequence, the less likely we are to find it in a
training corpus.
This can be resolved using n-gram models.
The intuition of the n-gram model is that instead of computing the probability of a word
given its entire history, we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn | w1:n-1) by using only the conditional probability of the
preceding word P(wn | wn-1).
In other words, instead of computing the probability, p(homework | school students do their),
we approximate it with the probability, p(homework | their).
This way of assuming that the probability of a word depends only on the previous word is
called a Markov assumption.
We can generalize the bigram (which looks one word into the past) to the trigram (which
looks two words into the past) and thus to the n-gram (which looks n-1 words into the past).
Dr. Varalakshmi M
Thus, the general equation for this N-gram approximation to the conditional probability of the
next word in a sequence is
P(wn|w1:n-1) ≈ P(wn | wn-N+1:n-1)
In a unigram (1-gram) model, no history is used. In a bigram, one word history is used and in
a n-gram, n-1 words history is used.
Based on this, Equation (1) can be written as follows.
P(school students do their homework)=p(school)p(students | school).p(do | students).p(their |
do).p(homework | their)                                             ……………….(2)
Maximum Likelihood Estimation (MLE) can be used to find the probabilities of n-grams that
uses the count of occurrences of the n-grams.
Eg, for the bigram model, probability of a bigram is calculated by taking the count of the
number of times a given bigram (𝑤𝑛−1 𝑤𝑛 ) occurs in a corpus and normalizing it by the total
number of bigrams that share the same word (𝑤𝑛−1 ), in the corpus.
                 𝑐𝑜𝑢𝑛𝑡(𝑤     𝑤𝑛 )
P(wn | wn-1)= ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1
                           𝑛−1𝑤 )
But the count of the n-grams that contain the word is equal to the unigram count of the word.
So, the formula can be changed into
                 𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1𝑤𝑛 )
P(wn | wn-1)=    ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1 )
                                    𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘")
ie., for p(homework | their) =
                                          𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟")
Likewise, the probability should be calculated for all the other components p(students |
school), p(do | students) and p(their | do) given in equation (2).
Next word estimation using bi-gram model:
Conditional Probability is given by
            𝑝(𝐴,𝐵)
P(B|A) =                                                                           ………….. (3)
             𝑝(𝐴)
Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, based on eqn (3), the formula can be written
as
P(regularly | school students do their homework)
    𝑝(school students do their homework regularly)
=                                                                 -----   from eqn (3)
        p(school students do their homework)
Dr. Varalakshmi M
P(regularly | school students do their homework) = p(regularly | homework) =
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
      𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘)
 Thus, to find the probability of the next word in the sequence “school students do their
homework” to be “regularly”, it is enough if we find the probability of “regularly” given just
the previous word, “homework”.
Trigram:
The same idea for bigram model can be extended to trigram, four-gram and in general, n-
gram models.
                                                𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 𝑤3 )
            P(w1,w2,w3)=p(w3 | w1,w2)=          𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 )
=
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).p(regularly | their homewo
        p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).
 Thus, in a trigram model, to find the probability of the next word in the sequence “school
students do their homework” to be “regularly”, it is enough if we find the probability of
“regularly” given just the previous two words, “their homework”.
In practice, unigram models are not commonly used because the next word will be predicted
as the highest frequency word in the corpus.
Eg., Assume the next word has to be predicted for the sequence “ covishield is the vaccine for
_______”. If the word “and” occurs, the highest number of times in a corpus, then the
unigram model will predict “and” as the next word resulting in “covishield is the vaccine for
and”.
Bigram, trigram and four-gram models are usually preferred. For models with higher values
of ‘n’, large corpus is required.
Dr. Varalakshmi M
Examples:
Consider the following tiny corpus with seven sentences.
 <s> I am Henry</s>
 <s> I like college</s>                Word                      frequency
 <s> Do Henry like college</s>         <s>                       7
 <s> Henry I am</s>                    </s>                      7
 <s> Do I like Henry</s>               I                         6
 <s> Do I like college </s>            Am                        2
 <s> I do like Henry </s>              Henry                     5
                                       Like                      5
                                       College                   3
                                       do                        4
   i)       Using bigram model, predict the most probable next word for the sequence,
             <s> do _______
 Next word                                Probability of next word
 P(</s> | do)                             0/4
 P(I | do)                                2/4 -> the most probable next word is “I”
 P(am | do)                               0/4
 P(Henry | do)                            ¼
 P( like | do)                            ¼
 P(college| do)                           0/4
 P(do | do)                               0/4
   ii)      Using bigram model, predict the most probable next word for the sequence,
             <s> I like Henry _______
   iii)     Using trigram model, predict the most probable next word for the sequence,
             <s> Do I like _______
            Note: p(I like)=3
 Next word                                Probability of next word
 P(</s> | I like)                         0/3
Dr. Varalakshmi M
 P(I | I like)                               0/3
 P(am | I like)                              0/3
 P(Henry | I like)                           1/3
 P( like | I like)                           0/3
 P(college| I like)                          2/3 -> college is more probable
 P(do | I like)                              0/3
   iv)      Using 4-gram model, predict the most probable next word for the sequence,
             <s> Do I like college _______
            Note: p(I like college)=2
 Next word                                   Probability of next word
 P(</s> | I like college)                    2/2 -> </s> is more probable
 P(I | I like college)                       0/2
 P(am | I like college)                      0/2
 P(Henry | I like college)                   0/2
 P( like | I like college)                   0/2
 P(college| I like college)                  0/2
 P(do | I like college)                      0/2
   v)       Using bigram model and the corpus mentioned above, predict the most probable
            sentence out of the following two.
            a) <s> I like college </s>
            b) <s> do I like Henry</s>
Dr. Varalakshmi M
Eg., in the calculation above (b), the result is 0.0257 and if we keep multiplying the
probabilities of few more bigrams, the result will become smaller and smaller leading to
underflow. To avoid that, the same two calculations can be done as follows.
   a) <s> I like college </s>
       = p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
       = 3/7 * 3/6 * 3/5 * 3/3
       =log(3/7) + log(3/6)+log( 3/5)+log(3/3) = -2.0513
Even with logarithmic calculations, the first sentence is found to be more probable.
Zero-Probability problem:
Let us assume that we have to calculate the probability of the word sequence
“<s> like college </s>”
        = p(like | <s>) * p(college | like) * p( </s> | college)
        = 0/7*3/5*3/3 = 0
Probability is evaluated as 0. Though “ like college” is present thrice in the corpus but
because “<s>like” doesn’t occur even once, the first term becomes 0 and so, the answer is 0.
This is termed as zero-probability problem.
Dr. Varalakshmi M
      daily   0      0/20       1        1/24=
                                         0.04
Dr. Varalakshmi M
                                                              2
                                                              3   2     1
                                                         PGT=18 = 54 == 27
   Therefore, the probability of existing word “tennis” is discounted from 1/18 to 1/27 and
   that extra probability mass is used to account for the unseen words.
Perplexity (PP) is an intrinsic evaluation measure to evaluate the evaluating language models.
A language model is the best if it predicts an unseen test set.
Perplexity is the inverse probability of the test data which is normalized by the number of
words.
PP(w)=p(w1,w2,w3…wn)-1/n
Lower the value of perplexity, better will be the model.
More the value of perplexity, confused will be the model for prediction.
Perplexity calculation for bigram and trigram models based on the sentence,
<s> I like college </s>:
    i)      Bigram model:
        = p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
        = 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
        PP(w)=(1/0.13)1/4 = 1.67
    ii)     Trigram model:
        = p(like | <s> I) * p(college | I like) * p( </s> | like college)
        = 1/3 * 2/3 * 3/3 =2/9 = 0.22
        PP(w)=(1/0.22)1/3 = 1.66 (since this value is smaller, the trigram model is better, for
this example).
****************************
Dr. Varalakshmi M
Dr. Varalakshmi M
Dr. Varalakshmi M