CS 432
Fall 2018
Mike Lam, Professor
Finite Automata Conversions
and Lexing
Finite Automata
●
Key result: all of the following have the same expressive
power (i.e., they all describe regular languages):
– Regular expressions (REs)
– Non-deterministic finite automata (NFAs)
– Deterministic finite automata (DFAs)
●
Proof by construction
– An algorithm exists to convert any RE to an NFA
– An algorithm exists to convert any NFA to a DFA
– An algorithm exists to convert any DFA to an RE
– For every regular language, there exists a minimal DFA
●
Has the fewest number of states of all DFAs equivalent to RE
Finite Automata
●
Finite automata transitions:
Kleene's construction
Hopcroft's
algorithm
(minimize)
Regex NFA DFA Lexer
Thompson's Subset Lexer
construction construction generators
Brzozowski's algorithm
(direct to minimal DFA)
(dashed lines indicate transitions to a minimized DFA)
Finite Automata Conversions
●
RE to NFA: Thompson's construction
– Core insight: inductively build up NFA using “templates”
– Core concept: use null transitions to build NFA quickly
●
NFA to DFA: Subset construction
– Core insight: DFA nodes represent subsets of NFA nodes
– Core concept: use null closure to calculate subsets
●
DFA minimization: Hopcroft’s algorithm
– Core insight: create partitions, then keep splitting
●
DFA to RE: Kleene's construction
– Core insight: repeatedly eliminate states by combining regexes
Thompson's Construction
●
Basic idea: create NFA inductively, bottom-up
– Base case:
●
Start with individual alphabet symbols (see below)
– Inductive case:
●
Combine by adding new states and null/epsilon transitions
●
Templates for the three basic operations
– Invariant:
●
The NFA always has exactly one start state and one accepting state
a
Thompson's: Concatenation
A B
Thompson's: Concatenation
AB
Thompson's: Union
B
Thompson's: Union
A|B
Thompson's: Closure
A
Thompson's: Closure
A*
ε
Thompson's Construction
Base case
Concatenation
Union
Closure
ε
Subset construction
●
Basic idea: create DFA incrementally
– Each DFA state represents a subset of NFA states
– Use null closure operation to “collapse” null/epsilon transitions
– Null closure: all states reachable via epsilon transitions
●
Essentially: where can we go “for free?”
●
Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
– Simulates running all possible paths through the NFA
Null closure of A = { A }
Null closure of B = { B, D }
Null closure of C =
Null closure of D =
Subset construction
●
Basic idea: create DFA incrementally
– Each DFA state represents a subset of NFA states
– Use null closure operation to “collapse” null/epsilon transitions
– Null closure: all states reachable via epsilon transitions
●
Essentially: where can we go “for free?”
●
Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
– Simulates running all possible paths through the NFA
Null closure of A = { A }
Null closure of B = { B, D }
Null closure of C = { C, D }
Null closure of D = { D }
Subset construction
●
Basic idea: create DFA incrementally
– Each DFA state represents a subset of NFA states
– Use null closure operation to “collapse” null/epsilon transitions
– Null closure: all states reachable via epsilon transitions
●
Essentially: where can we go “for free?”
●
Formally: ε-closure(s) = {s} ∪ { t ∈ S | (s,ε→t) ∈ δ }
– Simulates running all possible paths through the NFA
Null closure of A = { A }
Null closure of B = { B, D }
Null closure of C = { C, D }
Null closure of D = { D }
Formal Algorithm
SubsetConstruction(S, Σ, s0, SA, δ):
t0 := ε-closure(s0)
S' := { t0 } S'A := ∅ W := { t0 }
while W ≠ ∅:
choose u in W and remove it from W
for each c in Σ:
t := ε-closure(δ(u,c))
δ'(u,c) = t
if t is not in S' then
add t to S’ and W
add t to S'A if any state in t is also in SA
return (S', Σ, t0, S'A, δ')
Subset Example
Subset Example
Subset Example
{B,D}
a
{A}
b
{C,D}
Subset Example
SubsetConstruction(S, Σ, s0, SA, δ):
t0 := ε-closure(s0)
S' := { t0 } S'A := ∅ W := { t0 }
while W ≠ ∅:
choose u in W and remove it from W
for each c in Σ:
t := ε-closure(δ(u,c))
δ'(u,c) = t
if t is not in S' then
add t to S’ and W
add t to S'A if there exists a state v in t that is also in SA
return (S', Σ, t0, S'A, δ')
Subset Example
{B,D,E} a
a a
{A,E}
b {E}
b
b
{C,D}
Algorithms
●
Subset construction is a fixed-point algorithm
– Textbook: “Iterated application of a monotone function”
– Basically: A loop that is mathematically guaranteed to
terminate at some point
– When it terminates, some desirable property holds
●
In the case of subset construction: the NFA has been
converted to a DFA!
Hopcroft’s DFA Minimization
●
Split into two partitions (final & non-final)
●
Keep splitting a partition while there are states with differing behaviors
– Two states transition to differing partitions on the same symbol
– Or one state transitions on a symbol and another doesn’t
●
When done, each partition becomes a single state
{B,D}
a
{A} Same behavior; collapse!
b
{C,D}
a,b {B,C,D}
{A}
Hopcroft’s DFA Minimization
●
Split into two partitions (final & non-final)
●
Keep splitting a partition while there are states with differing behaviors
– Two states transition to differing partitions on the same symbol
– Or one state transitions on a symbol and another doesn’t
●
When done, each partition becomes a single state
{B,D}
a
a
Differing behavior on
{A} ‘a’; split partition! {B,D}
a
b
{C,D} {A}
b
{C,D}
Kleene's Construction
●
Replace edge labels with REs
– "a" → "a" and "a,b" → "a|b"
●
Eliminate states by combining REs
– See pattern below; apply pairwise around each state to be eliminated
– Repeat until only one or two states remain
●
Build final RE
– One state with "A" self-loop → "A*"
– Two states: see pattern below
B A C
Eliminating A C Combining final B
states: two states:
D D
AB*C|D A*B(C|DA*B)*
Brzozowski’s Algorithm
●
Direct NFA → minimal DFA conversion
●
Sub-procedures:
– Reverse(n): invert all transitions in NFA n, adding a new start
state connected to all old final states
– Subset(n): apply subset construction to NFA n
– Reach(n): remove any part of NFA n unreachable from start state
●
Apply them all in order three times to get minimal DFA
– First time eliminates duplicate suffixes
– Second time eliminates duplicate prefixes
– MinDFA(n) = Reach(Subset(Reverse(Reach(Subset(Reverse(n))))))
– Potentially easier to code than Hopcroft’s algorithm
Brzozowski’s Algorithm
●
MinDFA(n) = Reach(Subset(Reverse(Reach(Subset(Reverse(n))))))
Example from
EAC (p.76)
NFA/DFA complexity
●
What are the time and space requirements to...
– Build an NFA?
– Run an NFA?
– Build a DFA?
– Run a DFA?
aa*|b {B,D}
a
{A}
ε
b
{C,D}
NFA/DFA complexity
●
Thompson's construction
– At most two new states and four transitions per regex character
– Thus, a linear space increase with respect to the # of regex characters
– Constant # of operations per increase means linear time as well
●
NFA execution
– Proportional to both NFA size and input string size
– Must track multiple simultaneous “current” states
●
Subset construction
– Potential exponential state space explosion
– A n-state NFA could require up to 2n DFA states
– However, this rarely happens in practice
●
DFAs execution
– Proportional to input string size only (only track a single “current” state)
NFA/DFA complexity
●
NFAs build quicker (linear) but run slower
– Better if you will only run the FA a few times
– Or if you need features that are difficult to implement with DFAs
●
DFAs build slower but run faster (linear)
– Better if you will run the FA many times
NFA DFA
Build time O(m) O(2m)
Run time O(m×n) O(n)
m = length of regular expression
n = length of input string
Lexers
●
Auto-generated
– Table-driven: generic scanner, auto-generated tables
– Direct-coded: hard-code transitions using jumps
– Common tools: lex/flex and similar
●
Hand-coded
– Better I/O performance (i.e., buffering)
– More efficient interfacing w/ other phases
– This is what we’ll do for P2
Handling Keywords
●
Issue: keywords are valid identifiers
●
Option 1: Embed into NFA/DFA
– Separate regex for keywords
– Easier/faster for generated scanners
●
Option 2: Use lookup table
– Scan as identifier then check for a keyword
– Easier for hand-coded scanners
– (Thus, this is probably easier for P2)
Handling Whitespace
●
Issue: whitespace is usually ignored
– Write a regex and remove it before each new token
●
Side effect: some results are counterintuitive
– Is this a valid token? “3abc”
– For now, it’s actually two!
– We’ll reject them later, in the parsing phase