2.
Scanning (Lexical Analysis)
Downloaded by Mamdouh Farghaly (mamfarouk3@gmail.com)
Contents
Finite Automata
From Regular Expressions to DFAs
From an NFA to a DFA
Lookahead, Backtracking, and
Nondeterministic Automata
Downloaded by Mamdouh Farghaly (mamfarouk3@gmail.com)
A Typical Action of DFA Algorithm
• Making a transition: move the character from the input
string to a string that accumulates the characters
belonging to a single token (the token string value or
lexeme of the token)
• Reaching an accepting state: return the token just
recognized, along with any associated attributes.
• Reaching an error state: either back up in the input
(backtracking) or to generate an error token.
letter
letter [other]
start in_id finish return ID
digit
Finite automation for an
identifier with delimiter and
return value
• The error state represents the fact that either
an identifier is not to be recognized (if came
from the start state) or a delimiter has been seen
and we should now accept and generate an
identifier-token.
• [other]: indicate that the delimiting character
should be considered look-ahead, it should be
returned to the input string and not consumed.
letter
letter [other]
start in_id finish return ID
digit
Finite automation for an
identifier with delimiter and
return value
• This diagram also expresses the principle of
longest sub-string described in Section 2.2.4:
the DFA continues to match letters and digits (in
state in_id) until a delimiter is found.
• By contrast the old diagram allowed the DFA to accept at any point
while reading an identifier string.
letter
letter
letter
letter [other]
star In-id
start in_id finish return ID
t
digit
digit
How to arrive at the start state in
the first place
(combine all the tokens into one DFA)
Each of these tokens begins with a
different character
• Consider the tokens given by
: =
return ASSIGN
the strings : =, <=, and = < =
return LE
• Each of these is a fixed =
string, and DFAs for them return EQ
can be written as right
=
return ASSIGN
:
• Uniting all of their start < =
return LE
states into a single start state =
to get the DFA return EQ
Several tokens beginning with
the same character
=
• They cannot be <
return LE
simply written as the < >
return NE
right diagram, since <
it is not a DFA return LT
• The diagram can be
return LE
< >
rearranged into a return NE
DFA [other] return LT
Expand the Definition of a Finite
Automaton
• One solution for the problem is to expand
the definition of a finite automaton
• More than one transition from a state
may exist for a particular character
(NFA: non-deterministic finite automaton,)
• Developing an algorithm for systematically
turning these NFA into DFAs
ε-transition
• A transition that may occur without consulting the
input string (and without consuming any characters)
It may be viewed as a "match" of the empty string.
( This should not be confused with a match of the
characterεin the input)
ε-Transitions Used in Two Ways.
• First: to express a choice of : =
alternatives in a way without
combining states < =
– Advantage: keeping the
=
original automata intact
and only adding a new
start state to connect them
• Second: to explicitly
describe a match of the
empty string.
Definition of NFA
• An NFA (non-deterministic finite automaton) M consists
of
– an alphabet , a set of states S,
– a transition function T: S x ( U{ε})℘(S),
– a start state s0 from S, and a set of accepting states A from S
• The language accepted by M, written L(M),
– is defined to be the set of strings of characters c1c2…. cn with
– each ci from U{ε}such that
– there exist states s1 in T(s0 ,c1), s2 in (s1, c2),..., sn in T(sn-1 , cn)
with sn an element of A.
• Any of the cI in c1c2……cn may beε,and
the string that is actually accepted is the string c,c2. . .cn with theε's
removed (since the concatenation of s withε is s itself).
Thus, the string c,c2.. .cn may actually have fewer than n
characters in it
• The sequence of states s1,..., sn are chosen from the sets of
states T(sQ , c1),..., T(sn-1, cn), and this choice will not
always be uniquely determined.
The sequence of transitions that accepts a particular string is
not determined at each step by the state and the next input
character.
Indeed, arbitrary numbers ofε's can be introduced into the string at
any point, corresponding to any number ofε-transitions in the NFA.
• An NFA does not represent an algorithm.
However, it can be simulated by an algorithm
that backtracks through every non-deterministic
choice.
Example 2.10
• The string abb can be accepted by either 2
of the following sequences of transitions: a b
a b ε b a
1 3 4
→1→2→4→2→4
aε ε bεb
→1→3→4→2→4→2→4
• This NFA accepts the languages as
a
follows:
regular expression: (a|ε)b*
ab+|ab*|b* b b
• Left DFA accepts the same language.
b
Example 2.11
• It accepts the string acab by making the
following transitions:
– (1)(2)(3)a(4)(7)(2)(5)(6)c(7)(2)(3)a(4)(7)(8)(9)b(10)
• It accepts the same language as that generated by
the regular expression : (a | c) *b
a
3 4
b
1 2 7 8 9 10
c
5 6
2.4 From Regular Expression To
DFAs
Main Purpose
• Study an algorithm:
– Translating a regular expression into a DFA via
NFA.
Regular Program
NFA DFA
Expression
2.4.1 From a Regular Expression
to an NFA
The Idea of Thompson’s
Construction
• Use ε-transitions
– to “glue together” the machine of each piece of a regular
expression
– to form a machine that corresponds to the whole expression
• Basic regular expression
– The NFAs for basic regular expression of the form a, ε,or φ
a
The Idea of Thompson’s
Construction
• Concatenation: to construct an NFA equal to rs
– To connect the accepting state of the machine of r to
the start state of the machine of s by anε-transition.
– The start state of the machine of r as its start state and
the accepting state of the machine of s as its accepting
state.
– This machine accepts L(rs) = L(r)L(s) and so
corresponds to the regular expression rs.
r s
… …
The Idea of Thompson’s
Construction
• Choice among alternatives: To construct an NFA
equal to r | s
– To add a new start state and a new accepting state and
connected them as shown usingε-transitions.
– Clearly, this machine accepts the language L(r|s)
=L(r )UL ( s), and so corresponds to the regular
expression r|s.
r
…
s
…
The Idea of Thompson’s
Construction
• Repetition: Given a machine that corresponds to r,
Construct a machine that corresponds to r*
– To add two new states, a start state and an accepting state.
– The repetition is afforded by the newε-transition from the
accepting state of the machine of r to its start state.
– To draw an ε-transition from the new start state to the new
accepting state.
– This construction is not unique, simplifications are possible in the
many cases.
r
…
Examples of NFAs Construction
Example 1.12: Translate regular expression ab|a into NFA
a
a b
a b
a
Examples of NFAs Construction
Example 1.13: Translate regular expression letter(letter|digit)* into NFA
letter
letter
digit
letter
letter
letter
letter
letter
letter
2.4.2 From an NFA to a DFA
Goal and Methods
• Goal
– Given an arbitrary NFA, construct an equivalent DFA. (i.e., one
that accepts precisely the same strings)
• Some methods
– (1) Eliminating -transitions
• -closure: the set of all states reachable by -transitions from a state
or states
– (2) Eliminating multiple transitions from a state on a single input
character.
• Keeping track of the set of states that are reachable by matching a
single character
– Both these processes lead us to consider sets of states instead of
single states. Thus, it is not surprising that the DFA we construct
has sets of states of the original NFA as its states.
The Algorithm Called Subset
Construction.
• The -closure of a Set of states:
– The -closure of a single state s is the set of states
reachable by a series of zero or more -transitions,
and we write this set as . s
• Example 2.14: regular a*
a
1 2 3 4
The algorithm called subset
construction.
a
1 2 3 4
1 = { 1,2,4}, 2 ={2}, 3 ={2,3,4}, and 4 ={4}.
The -closure of a set of states : the union of the -closures of each individual state.
S= ∪s
sin S
{1,3} = 1 3 = {1,2,3}{2,3,4}={1,2,3,4}
The Subset Construction Algorithm
(1) Compute the -closure of the start state of M; to obtain new state M .
(2) For this set, and for each subsequent set, compute transitions on
characters a as follows.
Given a set S of states and a character a in the alphabet,
Compute the set
Sa = { t | for some s in S there is a transition from s to t on a }.
Then, compute Sa ' , the -closure of Sa.
This defines a new state in the subset construction, together with
a new transition S Sa ' .
(3) Continue with this process until no new states or transitions are created.
(4) Mark as accepting those states constructed in this manner that contain
an accepting state of M.
Examples of Subset Construction
a
1 2 3 4
M -closure of M ( S ) Sa
1 1,2,4 3
3 2,3,4 3
a
a
{1,2,4} {2,3,4}
Examples of Subset Construction
a b
2 3 4 5
1
a
6 7
M -closure of M (S) Sa Sb
1 1,2,6 3,7
3,7 3,4,7,8 5
5 5,8
a b
{1,2,6} {3,4,7,8} {5,8}
Examples of Subset Construction
letter
5 6
letter
1 2 3 4 9
letter
7 8
M -closure of M (S) Sletter Sdigit
1 1 2
2 2,3,4,5,7,10 6 8
6 4,5,6,7,9,10 6 8
lett er
8 4,5,7,8,9,10 6 8
letter {4,5,6,7,9,10}
letter
{1} {2,3,4,5,7,10} digit letter
digit {4,5,7,8,9,10}
2.4.3 Simulating an NFA
using the Subset
Construction
One Way of Simulating an NFA
• NFAs can be implemented in similar ways to
DFAs, except that NFAs are nondeterministic
– Many different sequences of transitions that
must be tried.
– Store up transitions that have not yet been tried
and backtrack to them on failure.
An Other Way of Simulating an NFA
• Use the subset construction
– Instead of constructing all the states of the associated
DFA
– Construct only the state at each point that is indicated
by the next input character
• The advantage: Not need to construct the entire DFA
– Example: input single character a, construct the start
state {1,2,6}and then the second state {3,4,7,8} to
move and match the a.
– Since no following b, accept without generating the
state {5,8}
a b
{1,2,6} {3,4,7,8} {5,8}
An Other Way of Simulating an NFA
• The disadvantage: A state may be constructed many times, if the path
contains loops
– Example: given the input string r2d3, the sequence of states as showing
below letter
letter {4,5,6,7,9,10}
letter
{1} {2,3,4,5,7,10} digit letter
digit {4,5,7,8,9,10}
digit
• If these states are constructed as the transitions occur, then the states
of the DFA have been constructed and the state {4,5,7,8,9,10}has even
been constructed twice
– Less efficient than constructing the entire DFA
End of Chapter Two
THANKS