Natural Language
Processing
Lecture 7: Parsing with Context Free Grammars II.
         CKY for PCFGs. Earley Parser.
                   11/13/2020
                 COMS W4705
                Yassine Benajiba
    Recall: Syntactic Ambiguity
                               S   → NP VP     NP   → she
                              VP   → V NP      NP   → glasses
                              VP   → VP PP      D   → the
                              PP   → P NP       N   → cat
                              NP   →DN          N   → glasses
                              NP   → NP PP      V   → saw
            S                                                            S
                                                P   → with
                    VP[1,6]                                                   VP[1,6]
          VP[2,4]                                                                 NP[2,6]
                NP            PP[4,6]                                        NP             PP
 NP V       D   N          P     NP                      NP V[1,2]    D   N             P     NP
she saw    the cat        with glasses                  she saw      the cat           with glasses
                          Which parse tree is “better”? More probable?
Probabilities for Parse Trees
•   Let    be the set of all parse trees generated by
    grammar G.
•   We want a model that assigns a probability to each parse
    tree, such that        .
•   We can use this model to select the most probable parse
    tree compatible with an input sentence.
    •   This is another example of a generative model!
Selecting Parse Trees
•   Let        be the set of trees generated by grammar G whose
    yield (sequence of leafs) is string s.
•   The most likely parse tree produced by G for string s is
    •   How do we define P(t)?
    •   How do we learn such a model from training data (annotated or
        un-annotated).
    •   How do we find the highest probability tree for a given
        sentence? (parsing/decoding)
        Probabilistic Context Free
           Grammars (PCFG)
•   A PCFG consists of a Context Free Grammar
    G=(N, Σ, R, S) and a probability P(A → β) for each
    production A → β ∈ R.
    •   The probabilities for all rules with the same left-hand-
        side sum up to 1:
    •   Think of this as the conditional probability for A → β,
        given the left-hand-side nonterminal A.
     PCFG Example
 S   → NP VP   [1.0]   NP   → she     [0.05]
VP   → V NP    [0.6]   NP   → glasses [0.05]
VP   → VP PP   [0.4]    D   → the     [1.0]
PP   → P NP    [1.0]    N   → cat     [0.3]
NP   →DN       [0.7]    N   → glasses [0.7]
NP   → NP PP   [0.2]    V   → saw     [1.0]
                        P   → with    [1.0]
     Parse Tree Probability
•   Given a parse tree                , containing rules
    the probability of t is
                                    S → NP PP 1.0
                                              VP → VP PP .4
                                      VP → V NP
                                               .6
                                         NP → D N      PP →P NP
                                                .7             1.0
                          D→ the N→cat
           NP→ she V→ saw saw       cat              P→with     NP→ glasses
             .05    1.0     1.0    .3                 1.0         .05
            1 x .05 x .4 x .6 x 1 x 0.7 x 1 x 0.3 x 1 x 1 x .05 = .000126
     Parse Tree Probability
•   Given a parse tree                , containing rules
    the probability of t is
                                   S → NP PP 1.0
                                             VP → V NP .6
                                              NP → NP PP
                                                             .2
                                        NP → D N      PP →P NP
                                               .7             1.0
                          D→ the N→cat
           NP→ she V→ saw saw       cat             P→with        NP→ glasses
             .05    1.0     1.0    .3                1.0            .05
             1 x .05 x .6 x 1 x .2 x .7 x 1 x .3 x 1 x 1 x .05 = 0.000063 < 0.000126
                Estimating PCFG
                  probabilities
•   Supervised training: We can estimate PCFG probabilities from a
    treebank, a corpus manually annotated with constituency
    structure using maximum likelihood estimates:
•   Unsupervised training:
    •   What if we have a grammar and a corpus, but no annotated
        parses?
    •   Can use the inside-outside algorithm for parsing and do EM
        estimation of the probabilities (not discussed in this course)
        The Penn Treebank
•   Syntactically annotated corpus of newspaper text (1989
    Wall Street Journal Articles).
•   The source text is naturally occurring but the treebank is
    not:
    •   Assumes a specific linguistic theory (although a simple
        one).
    •   Very flat structure (NPs, Ss, VPs).
             PTB Example
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken))
       (, ,)
       (ADJP (NML (CD 61) (NNS years))
       (JJ old))
       (, ,))
     (VP (MD will)
   (VP (VB join)
       (NP (DT the) (NN board))
       (PP-CLR (IN as)
         (NP (DT a) (JJ nonexecutive) (NN director)))
       (NP-TMP (NNP Nov.) (CD 29))))
     (. .)))
PTB Example
         Parsing with PCFG
•   We want to use PCFG to answer the following questions:
    •   What is the total probability of the sentence under the
        PCFG?
    •   What is the most probable parse tree for a sentence
        under the PCFG? (decoding/parsing)
•   We can modify the CKY algorithm.
    Basic idea: Compute these probabilities bottom-up using
    dynamic programming.
Computing Probabilities
    Bottom-Up
                   S → NP PP .05 x .00126 x 1 = 0.000063
                             VP → V NP 1 x .0021 x .6 = .00126
                              NP → NP PP .21 x x .05 x .2 = .0021
                         NP → D N       PP →P NP
                  1 x .3 x .7 = .21       1 x 1 x .05 = .05
                D→ the N→cat
 NP→ she V→ saw saw       cat         P→with   NP→ glasses
   .05    1.0     1.0    .3            1.0       .05
    CKY for PCFG Parsing
•   Let          be the set of trees generated by grammar G
    starting at nonterminal A, whose yield is string s
•   Use a chart π so that π[i,j,A] contains the probability of the highest
    probability parse tree for string s[i,j] starting in nonterminal A.
•   We want to find π[0,lenght(s),S] -- the probability of the highest-
    scoring parse tree for s rooted in the start symbol S.
    CKY for PCFG Parsing
•   To compute π[0,lenght(s),S] we can use the following recursive
    definition:
Base case:
•   Then fill the chart using dynamic programming.
     CKY for PCFG Parsing
 •   Input: PCFG G=(N, Σ, R, S), input string s of length n.
 •   for i=0…n-1:                                                   initialization
 •   for length=2…n:                                                main loop
       for i=0…(n-length):
            j = i+length
            for k=i+1…j-1:
                for A ∈ N:
Use backpointers to retrieve the highest-scoring parse tree (see previous lecture).
    Probability of a Sentence
•   What if we are interested in the probability of a sentence,
    not of a single parse tree (for example, because we want
    to use the PCFG as a language model).
•   Problem: Spurious ambiguity. Need to sum the
    probabilities of all parse trees for the sentence.
•   How do we have to change CKY to compute this?
                 Earley Parser
•   CKY parser starts with words and builds parse trees bottom-
    up; requires the grammar to be in CNF.
•   The Earley parser instead starts at the start symbol and tries
    to “guess” derivations top-down.
    •   It discards derivations that are incompatible with the
        sentence.
    •   The early parser sweeps through the sentence left-to-right
        only once. It keeps partial derivations in a table (“chart”).
    •   Allows arbitrary CFGs, no limitation to CNF.
                 Parser States
•   Earley parser keeps track of partial derivations using parser
    states / items.
•   State represent hypotheses about constituent structure based
    on the grammar, taking into account the input.
•   Parser states are represented as dotted rules with spans.
    •   The constituents to the left of the · have already been seen
        in the input string s (corresponding to the span)
 S → · NP VP [0,0]    “According to the grammar, there may be an NP
                      starting in position 0. “
NP → D A · N [0,2] "There is a determiner followed by an adjective in s[0,2]“
NP → NP PP · [3,8] "There is a complete NP in s[3,8], consisting of an NP and PP”
        Earley Parser (sketch)
    S   → NP VP      V → saw
   VP   → V NP       P → with                    S → · NP VP [0,0]
   VP   → VP PP      D → the
                              NP → · NP PP [0,0] NP → · D N [0,0]
   PP   → P NP       N → cat
   NP   →DN          N → tail                    D → · the [0,0]
   NP   → NP PP      N → student
Three parser operations:
1. Predict new subtrees top-down.
                               the student saw   the   cat   with   the   tail
                             0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → · D N [0,0]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1]
   NP   → NP PP       N → student
Three parser operations:
1. Predict new subtrees top-down.
2. Scan input terminals.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → · D N [0,0]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1]   passive state
   NP   → NP PP       N → student
Three parser operations:
1. Predict new subtrees top-down.
2. Scan input terminals.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → D · N [0,1]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1]   passive state
   NP   → NP PP       N → student
Three parser operations:
1. Predict new subtrees top-down.
2. Scan input terminals.
3. Complete with passive states.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → D · N [0,1]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1] N → · cat [1,1]
   NP   → NP PP       N → student
                                                                    N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down.                              N → · student [1,1]
2. Scan input terminals.
3. Complete with passive states.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → D · N [0,1]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1] N → · cat [1,1]
   NP   → NP PP       N → student
                                                                    N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down.                              N → student · [1,2]
2. Scan input terminals.
3. Complete with passive states.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → · NP VP [0,0]
   VP   → VP PP       D → the
                               NP → · NP PP [0,0] NP → D N · [0,2]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1] N → · cat [1,1]
   NP   → NP PP       N → student
                                                                    N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down.                              N → student · [1,2]
2. Scan input terminals.
3. Complete with passive states.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
        Earley Parser (sketch)
    S   → NP VP       V → saw
   VP   → V NP        P → with                    S → NP · VP [0,2]
   VP   → VP PP       D → the
                               NP → NP · PP [0,2] NP → D N · [0,2]
   PP   → P NP        N → cat
   NP   →DN           N → tail                    D → the · [0,1] N → · cat [1,1]
   NP   → NP PP       N → student
                                                                    N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down.                              N → student · [1,2]
2. Scan input terminals.
3. Complete with passive states.
                                the student saw   the   cat   with   the   tail
                              0    1       2    3     4     5      6     7
                  Earley Algorithm
•   Keep track of parser states in a table (“chart”). Chart[k]
    contains a set of all parser states that end in position k.
•   Input: Grammar G=(N, Σ, R, S), input string s of length n.
•   Initialization: For each production S→α ∈R
                   add a state S →·α[0,0] to Chart[0].
•   for i = 0 to n:
    •   for each state in Chart[i]:
        •   if state is of form A →α ·s[i] β [k,i]:
                scan(state)
        •   elif state is of form A →α ·B β [k,i]:
               predict(state)
        •   elif state is of form A →α · [k,i]
               complete(state)
                  Earley Algorithm
•   Keep track of parser states in a table (“chart”). Chart[k]
    contains a set of all parser states that end in position k.
•   Input: Grammar G=(N, Σ, R, S), input string s of length n.
•   Initialization: For each production S→α ∈R
                   add a state S →·α[0,0] to Chart[0].
•   for i = 0 to n:
    •   for each state in Chart[i]:
        •   if state is of form A →α ·s[i] β [k,i]:
                scan(state)                           else then is states of form
                                                       A →α · β [k,i], i.e.
        •   elif state is of form A →α ·B β [k,i]:
               predict(state)                         β is not s[i], in which case we
                                                      don’t want to do anything
        •   elif state is of form A →α · [k,i]
               complete(state)
    Earley Algorithm - Scan
•   The scan operation can only be applied to a state if the dot is
    in front of a terminal symbol that matches the next input
    terminal.
•   function scan(state):      // state is of form A →α ·s[i] β [k,i]
    •   Add a new state A →α s[i]·β [k,i+1]
        to Chart[i+1]
        Earley Algorithm - Predict
•       The predict operation can only be applied to a state if the dot is
        in front of a non-terminal symbol.
    •    function predict(state):   // state is of form A →α ·B β [k,i]:
         •   Add a new state B →· γ [i,i]
             to Chart[i]
•       Note that this modifies Chart[i] while the algorithm is looping
        through it.
•       No duplicate states are added (Chart[i] is a set)
    Earley Algorithm - Complete
•   The complete operation may only be applied to a passive item.
•   function complete(state):       // state is of form A →α · [k,j]
    •   for each state B → β ·A γ [i,k] add a new state
        B → β A · γ[i,j] to Chart[j]
•   Note that this modifies Chart[i] while the algorithm is looping
    through it.
•   Note that it is important to make a copy of the old state
    before moving the dot.
•   This operation is similar to the combination operation in CKY!
    Earley Algorithm - Runtime
•   The runtime depends on the number of items in the chart
    (each item is “visited” exactly once).
•   We proceed through the input exactly once, which takes
    O(N).
    •   For each position on the chart, there are O(N) possible split
        points where the dot could be.
    •   Each complete operation can produce O(N) possible new
        items (with different starting points).
•   Total: O(N3)
                 Earley Algorithm -
                Some Observations
•   How do we recover parse trees?
    •   What happens in case of ambiguity?
        •   Multiple ways to Complete the same state.
    •   Keep back-pointers in the parser state objects.
    •   Or use a separate data structure (CKY-style table or
        hashed states)
•   How do we make the algorithm work with PCFG?
    •   Easy to compute probabilities on Complete. Follow back pointer with
        max probability.