CS 432
Fall 2018
Mike Lam, Professor
            Finite Automata Conversions
                     and Lexing
Finite Automata
●
    Key result: all of the following have the same expressive
    power (i.e., they all describe regular languages):
    –   Regular expressions (REs)
    –   Non-deterministic finite automata (NFAs)
    –   Deterministic finite automata (DFAs)
●
    Proof by construction
    –   An algorithm exists to convert any RE to an NFA
    –   An algorithm exists to convert any NFA to a DFA
    –   An algorithm exists to convert any DFA to an RE
    –   For every regular language, there exists a minimal DFA
        ●
            Has the fewest number of states of all DFAs equivalent to RE
Finite Automata
●
     Finite automata transitions:
                                Kleene's construction
                                             Hopcroft's
                                             algorithm
                                               (minimize)
             Regex                      NFA                             DFA                Lexer
                      Thompson's                   Subset                       Lexer
                      construction               construction                 generators
                                           Brzozowski's algorithm
                                              (direct to minimal DFA)
    (dashed lines indicate transitions to a minimized DFA)
Finite Automata Conversions
●
    RE to NFA: Thompson's construction
    –   Core insight: inductively build up NFA using “templates”
    –   Core concept: use null transitions to build NFA quickly
●
    NFA to DFA: Subset construction
    –   Core insight: DFA nodes represent subsets of NFA nodes
    –   Core concept: use null closure to calculate subsets
●
    DFA minimization: Hopcroft’s algorithm
    –   Core insight: create partitions, then keep splitting
●
    DFA to RE: Kleene's construction
    –   Core insight: repeatedly eliminate states by combining regexes
Thompson's Construction
●
    Basic idea: create NFA inductively, bottom-up
    –   Base case:
        ●
            Start with individual alphabet symbols (see below)
    –   Inductive case:
        ●
            Combine by adding new states and null/epsilon transitions
        ●
            Templates for the three basic operations
    –   Invariant:
        ●
            The NFA always has exactly one start state and one accepting state
                                        a
Thompson's: Concatenation
      A              B
Thompson's: Concatenation
             AB
Thompson's: Union
   B
Thompson's: Union
  A|B
Thompson's: Closure
              A
Thompson's: Closure
              A*
                   ε
Thompson's Construction
   Base case
                       Concatenation
               Union
                                Closure
                            ε
Subset construction
●
    Basic idea: create DFA incrementally
    –   Each DFA state represents a subset of NFA states
    –   Use null closure operation to “collapse” null/epsilon transitions
    –   Null closure: all states reachable via epsilon transitions
        ●
            Essentially: where can we go “for free?”
        ●
            Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
    –   Simulates running all possible paths through the NFA
                                                     Null closure of A = { A }
                                                     Null closure of B = { B, D }
                                                     Null closure of C =
                                                     Null closure of D =
Subset construction
●
    Basic idea: create DFA incrementally
    –   Each DFA state represents a subset of NFA states
    –   Use null closure operation to “collapse” null/epsilon transitions
    –   Null closure: all states reachable via epsilon transitions
        ●
            Essentially: where can we go “for free?”
        ●
            Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
    –   Simulates running all possible paths through the NFA
                                                     Null closure of A = { A }
                                                     Null closure of B = { B, D }
                                                     Null closure of C = { C, D }
                                                     Null closure of D = { D }
Subset construction
●
    Basic idea: create DFA incrementally
    –   Each DFA state represents a subset of NFA states
    –   Use null closure operation to “collapse” null/epsilon transitions
    –   Null closure: all states reachable via epsilon transitions
        ●
            Essentially: where can we go “for free?”
        ●
            Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
    –   Simulates running all possible paths through the NFA
                                                     Null closure of A = { A }
                                                     Null closure of B = { B, D }
                                                     Null closure of C = { C, D }
                                                     Null closure of D = { D }
Formal Algorithm
SubsetConstruction(S, Σ, s0, SA, δ):
    t0 := ε-closure(s0)
    S' := { t0 }     S'A := ∅     W := { t0 }
    while W ≠ ∅:
       choose u in W and remove it from W
       for each c in Σ:
         t := ε-closure(δ(u,c))
         δ'(u,c) = t
         if t is not in S' then
            add t to S’ and W
            add t to S'A if any state in t is also in SA
    return (S', Σ, t0, S'A, δ')
Subset Example
Subset Example
Subset Example
                {B,D}
            a
      {A}
            b
                {C,D}
Subset Example
         SubsetConstruction(S, Σ, s0, SA, δ):
             t0 := ε-closure(s0)
             S' := { t0 }     S'A := ∅      W := { t0 }
             while W ≠ ∅:
                choose u in W and remove it from W
                for each c in Σ:
                   t := ε-closure(δ(u,c))
                   δ'(u,c) = t
                   if t is not in S' then
                      add t to S’ and W
                      add t to S'A if there exists a state v in t that is also in SA
             return (S', Σ, t0, S'A, δ')
Subset Example
                             {B,D,E}           a
                         a               a
                 {A,E}
                              b              {E}
                                     b
                         b
                             {C,D}
Algorithms
●
    Subset construction is a fixed-point algorithm
    –   Textbook: “Iterated application of a monotone function”
    –   Basically: A loop that is mathematically guaranteed to
        terminate at some point
    –   When it terminates, some desirable property holds
        ●
            In the case of subset construction: the NFA has been
            converted to a DFA!
    Hopcroft’s DFA Minimization
●
    Split into two partitions (final & non-final)
●
    Keep splitting a partition while there are states with differing behaviors
    –   Two states transition to differing partitions on the same symbol
    –   Or one state transitions on a symbol and another doesn’t
●
    When done, each partition becomes a single state
                             {B,D}
                       a
                 {A}                      Same behavior; collapse!
                       b
                              {C,D}
                                                               a,b    {B,C,D}
                                                         {A}
    Hopcroft’s DFA Minimization
●
    Split into two partitions (final & non-final)
●
    Keep splitting a partition while there are states with differing behaviors
    –   Two states transition to differing partitions on the same symbol
    –   Or one state transitions on a symbol and another doesn’t
●
    When done, each partition becomes a single state
                             {B,D}
                       a
                                                                                     a
                                           Differing behavior on
                 {A}                       ‘a’; split partition!             {B,D}
                                                                         a
                       b
                              {C,D}                                {A}
                                                                         b
                                                                             {C,D}
    Kleene's Construction
●
    Replace edge labels with REs
     –   "a" → "a" and "a,b" → "a|b"
●
    Eliminate states by combining REs
     –   See pattern below; apply pairwise around each state to be eliminated
     –   Repeat until only one or two states remain
●
    Build final RE
     –   One state with "A" self-loop → "A*"
     –   Two states: see pattern below
                                  B                                       A       C
         Eliminating          A        C              Combining final         B
         states:                                      two states:
                                  D                                           D
                               AB*C|D                                    A*B(C|DA*B)*
Brzozowski’s Algorithm
●
    Direct NFA → minimal DFA conversion
●
    Sub-procedures:
    –   Reverse(n): invert all transitions in NFA n, adding a new start
        state connected to all old final states
    –   Subset(n): apply subset construction to NFA n
    –   Reach(n): remove any part of NFA n unreachable from start state
●
    Apply them all in order three times to get minimal DFA
    –   First time eliminates duplicate suffixes
    –   Second time eliminates duplicate prefixes
    –   MinDFA(n) = Reach(Subset(Reverse(Reach(Subset(Reverse(n))))))
    –   Potentially easier to code than Hopcroft’s algorithm
Brzozowski’s Algorithm
●
    MinDFA(n) = Reach(Subset(Reverse(Reach(Subset(Reverse(n))))))
    Example from
    EAC (p.76)
NFA/DFA complexity
●
    What are the time and space requirements to...
    –   Build an NFA?
    –   Run an NFA?
    –   Build a DFA?
    –   Run a DFA?
                        aa*|b                      {B,D}
                                               a
                                         {A}
                            ε
                                               b
                                                   {C,D}
    NFA/DFA complexity
●
    Thompson's construction
    –   At most two new states and four transitions per regex character
    –   Thus, a linear space increase with respect to the # of regex characters
    –   Constant # of operations per increase means linear time as well
●
    NFA execution
    –   Proportional to both NFA size and input string size
    –   Must track multiple simultaneous “current” states
●
    Subset construction
    –   Potential exponential state space explosion
    –   A n-state NFA could require up to 2n DFA states
    –   However, this rarely happens in practice
●
    DFAs execution
    –   Proportional to input string size only (only track a single “current” state)
NFA/DFA complexity
●
    NFAs build quicker (linear) but run slower
    –   Better if you will only run the FA a few times
    –   Or if you need features that are difficult to implement with DFAs
●
    DFAs build slower but run faster (linear)
    –   Better if you will run the FA many times
                                      NFA                    DFA
            Build time                O(m)                   O(2m)
            Run time                 O(m×n)                  O(n)
                          m = length of regular expression
                          n = length of input string
Lexers
●
    Auto-generated
    –   Table-driven: generic scanner, auto-generated tables
    –   Direct-coded: hard-code transitions using jumps
    –   Common tools: lex/flex and similar
●
    Hand-coded
    –   Better I/O performance (i.e., buffering)
    –   More efficient interfacing w/ other phases
    –   This is what we’ll do for P2
Handling Keywords
●
    Issue: keywords are valid identifiers
●
    Option 1: Embed into NFA/DFA
    –   Separate regex for keywords
    –   Easier/faster for generated scanners
●
    Option 2: Use lookup table
    –   Scan as identifier then check for a keyword
    –   Easier for hand-coded scanners
    –   (Thus, this is probably easier for P2)
Handling Whitespace
●
    Issue: whitespace is usually ignored
    –   Write a regex and remove it before each new token
●
    Side effect: some results are counterintuitive
    –   Is this a valid token? “3abc”
    –   For now, it’s actually two!
    –   We’ll reject them later, in the parsing phase