2.
Scanning (Lexical Analysis)
        Downloaded by Mamdouh Farghaly (mamfarouk3@gmail.com)
              Contents
Finite Automata
From Regular Expressions to DFAs
From an NFA to a DFA
Lookahead, Backtracking, and
Nondeterministic Automata
         Downloaded by Mamdouh Farghaly (mamfarouk3@gmail.com)
A Typical Action of DFA Algorithm
    •   Making a transition: move the character from the input
        string to a string that accumulates the characters
        belonging to a single token (the token string value or
        lexeme of the token)
    •   Reaching an accepting state: return the token just
        recognized, along with any associated attributes.
    •   Reaching an error state: either back up in the input
        (backtracking) or to generate an error token.
                                         letter
                        letter                   [other]
                start            in_id                     finish   return ID
                                         digit
    Finite automation for an
     identifier with delimiter and
     return value
•    The error state represents the fact that either
     an identifier is not to be recognized (if came
     from the start state) or a delimiter has been seen
     and we should now accept and generate an
     identifier-token.
•    [other]: indicate that the delimiting character
     should be considered look-ahead, it should be
     returned to the input string and not consumed.
                                       letter
                      letter                   [other]
              start            in_id                     finish   return ID
                                       digit
        Finite automation for an
         identifier with delimiter and
         return value
•        This diagram also expresses the principle of
         longest sub-string described in Section 2.2.4:
         the DFA continues to match letters and digits (in
         state in_id) until a delimiter is found.
•        By contrast the old diagram allowed the DFA to accept at any point
         while reading an identifier string.
                         letter
                                                                                           letter
                                                                       letter
        letter                   [other]
                                                                star            In-id
start            in_id                     finish   return ID
                                                                t
                                                                                        digit
                         digit
How to arrive at the start state in
        the first place
  (combine all the tokens into one DFA)
     Each of these tokens begins with a
             different character
• Consider the tokens given by
                                           :          =
                                                                 return ASSIGN
 the strings : =, <=, and =                <          =
                                                                 return LE
• Each of these is a fixed                 =
 string, and DFAs for them                           return EQ
 can be written as right
                                                 =
                                                                 return ASSIGN
                                   :
• Uniting all of their start           <         =
                                                                 return LE
states into a single start state       =
to get the DFA                                 return EQ
  Several tokens beginning with
           the same character
                                    =
• They cannot be          <
                                                  return LE
  simply written as the   <        >
                                                  return NE
  right diagram, since    <
  it is not a DFA                 return LT
• The diagram can be
                                                     return LE
                              <               >
  rearranged into a                                 return NE
  DFA                               [other]          return LT
 Expand the Definition of a Finite
          Automaton
• One solution for the problem is to expand
  the definition of a finite automaton
• More than one transition from a state
  may exist for a particular character
  (NFA: non-deterministic finite automaton,)
• Developing an algorithm for systematically
  turning these NFA into DFAs
                ε-transition
• A transition that may occur without consulting the
  input string (and without consuming any characters)
                     
  It may be viewed as a "match" of the empty string.
  ( This should not be confused with a match of the
  characterεin the input)
ε-Transitions Used in Two Ways.
 • First: to express a choice of                :   =
   alternatives in a way without    
   combining states                            <   =
    – Advantage: keeping the            
                                                =
      original automata intact
      and only adding a new
      start state to connect them
 • Second: to explicitly                    
   describe a match of the
   empty string.
                Definition of NFA
• An NFA (non-deterministic finite automaton) M consists
  of
   – an alphabet , a set of states S,
   – a transition function T: S x ( U{ε})℘(S),
   – a start state s0 from S, and a set of accepting states A from S
• The language accepted by M, written L(M),
   – is defined to be the set of strings of characters c1c2…. cn with
   – each ci from  U{ε}such that
   – there exist states s1 in T(s0 ,c1), s2 in (s1, c2),..., sn in T(sn-1 , cn)
     with sn an element of A.
• Any of the cI in c1c2……cn may beε,and
      the string that is actually accepted is the string c,c2. . .cn with theε's
      removed (since the concatenation of s withε is s itself).
      Thus, the string c,c2.. .cn may actually have fewer than n
      characters in it
• The sequence of states s1,..., sn are chosen from the sets of
  states T(sQ , c1),..., T(sn-1, cn), and this choice will not
  always be uniquely determined.
      The sequence of transitions that accepts a particular string is
      not determined at each step by the state and the next input
      character.
      Indeed, arbitrary numbers ofε's can be introduced into the string at
      any point, corresponding to any number ofε-transitions in the NFA.
•   An NFA does not represent an algorithm.
    However, it can be simulated by an algorithm
    that backtracks through every non-deterministic
    choice.
Example 2.10
• The string abb can be accepted by either                      2
   of the following sequences of transitions:       a                           b
          a b ε b                                       a               
                                                1               3                   4
        →1→2→4→2→4
                                                            
           aε ε bεb
        →1→3→4→2→4→2→4
•   This NFA accepts the languages as
                                                                    a
    follows:
     regular expression: (a|ε)b*
                        ab+|ab*|b*                          b               b
•   Left DFA accepts the same language.
                                                                                b
Example 2.11
• It accepts the string acab by making the
  following transitions:
   – (1)(2)(3)a(4)(7)(2)(5)(6)c(7)(2)(3)a(4)(7)(8)(9)b(10)
• It accepts the same language as that generated by
  the regular expression : (a | c) *b
                                     a
                                3       4   
                                                                  b
                 1       2                       7       8       9       10
                                     c
                                5       6
                                             
                                     
2.4 From Regular Expression To
            DFAs
              Main Purpose
• Study an algorithm:
  – Translating a regular expression into a DFA via
    NFA.
  Regular                                 Program
                NFA          DFA
 Expression
2.4.1 From a Regular Expression
           to an NFA
         The Idea of Thompson’s
              Construction
• Use ε-transitions
   – to “glue together” the machine of each piece of a regular
     expression
   – to form a machine that corresponds to the whole expression
• Basic regular expression
   – The NFAs for basic regular expression of the form a, ε,or φ
                  a                              
         The Idea of Thompson’s
              Construction
• Concatenation: to construct an NFA equal to rs
   – To connect the accepting state of the machine of r to
     the start state of the machine of s by anε-transition.
   – The start state of the machine of r as its start state and
     the accepting state of the machine of s as its accepting
     state.
   – This machine accepts L(rs) = L(r)L(s) and so
     corresponds to the regular expression rs.
                             
                   r                  s
                   …                  …
       The Idea of Thompson’s
            Construction
• Choice among alternatives: To construct an NFA
  equal to r | s
  – To add a new start state and a new accepting state and
    connected them as shown usingε-transitions.
  – Clearly, this machine accepts the language L(r|s)
    =L(r )UL ( s), and so corresponds to the regular
    expression r|s.
                              r
                              …
                                      
                                      
                             s
                             …
          The Idea of Thompson’s
               Construction
• Repetition: Given a machine that corresponds to r,
  Construct a machine that corresponds to r*
   – To add two new states, a start state and an accepting state.
   – The repetition is afforded by the newε-transition from the
     accepting state of the machine of r to its start state.
   – To draw an ε-transition from the new start state to the new
     accepting state.
   – This construction is not unique, simplifications are possible in the
     many cases.
                                      
                                                
                                      r
                                      …
                                      
 Examples of NFAs Construction
Example 1.12: Translate regular expression ab|a into NFA
                    a
                a                  b
                        a              b
                                               
                                           
                                a
 Examples of NFAs Construction
Example 1.13: Translate regular expression      letter(letter|digit)* into NFA
                                                                 letter
                 letter                                                               
                 digit
                                                                                  
                                                                 letter
                            
                          letter
                                       
                                           
                                   
                          letter                                            
                                                                          letter
                                                                                             
                                   letter                                                        
                                                                                          
                                                                          letter
                                                                            
2.4.2 From an NFA to a DFA
               Goal and Methods
• Goal
  – Given an arbitrary NFA, construct an equivalent DFA. (i.e., one
    that accepts precisely the same strings)
• Some methods
  – (1) Eliminating -transitions
      • -closure: the set of all states reachable by -transitions from a state
        or states
  – (2) Eliminating multiple transitions from a state on a single input
    character.
      • Keeping track of the set of states that are reachable by matching a
        single character
  – Both these processes lead us to consider sets of states instead of
    single states. Thus, it is not surprising that the DFA we construct
    has sets of states of the original NFA as its states.
        The Algorithm Called Subset
               Construction.
•   The -closure of a Set of states:
    –   The -closure of a single state s is the set of states
        reachable by a series of zero or more -transitions,
        and we write this set as . s
•   Example 2.14: regular a*
                                
                               a       
                    1       2       3       4
                                
  The algorithm called subset
         construction.
                                   
                                  a       
          1            2               3        4
1 = { 1,2,4}, 2 ={2}, 3 ={2,3,4}, and 4 ={4}.
 The -closure of a set of states : the union of the -closures of each individual state.
                    S=     ∪s
                           sin S
{1,3} = 1 3 = {1,2,3}{2,3,4}={1,2,3,4}
The Subset Construction Algorithm
(1) Compute the -closure of the start state of M; to obtain new state M .
(2) For this set, and for each subsequent set, compute transitions on
    characters a as follows.
         Given a set S of states and a character a in the alphabet,
         Compute the set
               Sa = { t | for some s in S there is a transition from s to t on a }.
         Then, compute Sa ' , the -closure of Sa.
         This defines a new state in the subset construction, together with
              a new transition S Sa ' .
(3) Continue with this process until no new states or transitions are created.
(4) Mark as accepting those states constructed in this manner that contain
    an accepting state of M.
Examples of Subset Construction
                         
                        a             
   1             2             3               4
                                                   M   -closure of M ( S ) Sa
                                                   1   1,2,4                3
                                                   3   2,3,4                3
                                           a
                     a
       {1,2,4}               {2,3,4}
Examples of Subset Construction
                            a                                  b
                     2                 3               4                 5   
         1
                                                                        
                                                  a
                                    6                       7
   M                       -closure of M (S)                   Sa           Sb
   1                       1,2,6                                3,7
   3,7                     3,4,7,8                                            5
   5                       5,8
                                a                       b
                 {1,2,6}                    {3,4,7,8}                 {5,8}
    Examples of Subset Construction
                                                                                               
                                                                                            letter
                                                                                      5               6            
                                  letter                                                                                        
                          1                2               3           4                                                  9
                                                                                                               
                                                                                            letter
                                                                                       7               8
M    -closure of M (S)       Sletter               Sdigit                                   
1    1                        2
2    2,3,4,5,7,10             6                      8
6    4,5,6,7,9,10             6                      8
                                                                                                                                       lett er
8    4,5,7,8,9,10             6                      8
                                                                                           letter            {4,5,6,7,9,10}
                                                         letter
                                               {1}                    {2,3,4,5,7,10}                 digit                    letter
                                                                                           digit             {4,5,7,8,9,10}
2.4.3 Simulating an NFA
    using the Subset
    Construction
 One Way of Simulating an NFA
• NFAs can be implemented in similar ways to
  DFAs, except that NFAs are nondeterministic
  – Many different sequences of transitions that
    must be tried.
  – Store up transitions that have not yet been tried
    and backtrack to them on failure.
An Other Way of Simulating an NFA
• Use the subset construction
   – Instead of constructing all the states of the associated
     DFA
   – Construct only the state at each point that is indicated
     by the next input character
• The advantage: Not need to construct the entire DFA
   – Example: input single character a, construct the start
     state {1,2,6}and then the second state {3,4,7,8} to
     move and match the a.
   – Since no following b, accept without generating the
     state {5,8}
                           a               b
                 {1,2,6}       {3,4,7,8}       {5,8}
An Other Way of Simulating an NFA
•   The disadvantage: A state may be constructed many times, if the path
    contains loops
     – Example: given the input string r2d3, the sequence of states as showing
       below                                                  letter
                                              letter           {4,5,6,7,9,10}
                    letter
              {1}            {2,3,4,5,7,10}            digit                    letter
                                              digit            {4,5,7,8,9,10}
                                                                                         digit
•    If these states are constructed as the transitions occur, then the states
    of the DFA have been constructed and the state {4,5,7,8,9,10}has even
    been constructed twice
     – Less efficient than constructing the entire DFA
End of Chapter Two
     THANKS