Programming Languages Translators
Lexical Analysis
      Tasks of a Scanner
 recognizes the keywords of the language
    these are the reserved words that have a special meaning
     in the language, such as the word class in Java
 recognizes special characters, such as ( and ), or
  groups of special characters, such as := and ==
 recognizes identifiers, integers, reals, decimals,
  strings, etc
 ignores whitespaces (tabs, blanks, etc) and
  comments
 recognizes and processes special directives (such as
  the #include "file" directive in C) and macros
 Lexical Analysis
 A scanner groups input characters into tokens
input: x = x * (acc+123)
              token value
  identifier x
  equal =
  identifier x
  star*
           left-paren     (
  identifier acc
  plus +
  integer 123
  right-paren )
 Tokens are typically represented by numbers
Lexical Analysis
 Lexical analyzer splits it into tokens
     Token = sequence of characters (symbolic
      name) representing a single terminal symbol
   Identifiers: myVariable 
   Literals: 123 5.67 true 
   Keywords: char sizeof 
   Operators: + - * / 
   Punctuation: ; , } { 
   Discards whitespace and comments
 Examples of Tokens in C
    Tokens                    Lexemes
    identifier        Age, grade,Temp, zone, q1
     number         3.1416, -498127,987.76412097
      string        A cat sat on a mat., 90183654
open parentheses                     (
close parentheses                    )
   Semicolon                         ;
 reserved word if              IF, if, If, iF
Examples of Tokens in C
 Lexical analyzer usually represents each token
  by a unique integer code
     +    {   return(PLUS); }    //   PLUS = 401
     -    {   return(MINUS); }   //   MINUS = 402
     *    {   return(MULT); }    //   MULT = 403
     /    {   return(DIV); }     //   DIV = 404
 Some tokens require regular expressions
     [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier
     [1-9][0-9]*           { return(DECIMALINT); }
     0[0-7]*              { return(OCTALINT); }
     (0x|0X)[0-9a-fA-F]+     { return(HEXINT); }
                                                         slide 6
Example
      Token      Informal description                  Sample lexemes
       if      Characters i, f                         if
      else     Characters e, l, s, e                   else
  comparison < or > or <= or >= or == or !=            <=, !=
        id      Letter followed by letter and digits    pi, score, D2
     number          Any numeric constant               3.14159, 0, 6.02e23
     literal      Anything but  sorrounded by         core dumped
             printf(total = %d\n, score);
Redefining Identifiers can be
         dangerous
      program confusing;
       const true = false;
              begin
        if (a<b) = true then
                  f(a)
              else 
Whitespace
Whitespace is any space, tab, end-of-line
  character (or characters), or character
  sequence inside a comment
No token may contain embedded whitespace
  (unless it is a character or string literal)
Example:
  >= one token
  > = two tokens
Reserved Keywords in C
 auto, break, case, char, const, continue,
  default, do, double, else, enum, extern, float,
  for, goto, if, int, long, register, return, short,
  signed, sizeof, static, struct, switch, typedef,
  union, unsigned, void, volatile, wchar_t, while
 C++ added a bunch: bool, catch, class,
  dynamic_cast, inline, private, protected,
  public, static_cast, template, this, virtual and
  others
 Each keyword is mapped to its own token
                                                 slide 10
    Lexical Analysis
  The process of converting a character stream into a
   corresponding sequence of meaningful symbols
   (called tokens or lexemes) is called tokenizing, lexing
   or lexical analysis. A program that performs this
   process is called a tokenizer, lexer, or scanner.
 In Scheme, we tokenize (set! x (+ x 1))            as
(    set!      x      (     +      x     1     )      )
 Similarly, in Java, we tokenize
   System.out.println("Hello World!"); as
System . out . println ( "Hello
   World!" ) ;
 Parsing Process
 Call the scanner to get tokens
 Build a parse tree from the stream of tokens
   A parse tree shows the syntactic structure of the
    source program.
 Add information about identifiers in the symbol
 table
 Report error, when found, and recover from the
 error
                                                        12
  Parsing
 Parsing is a process that constructs a syntactic
  structure (i.e. parse tree) from the stream of
  tokens.
 We already learn how to describe the syntactic
  structure of a language using (context-free)
  grammar.
 So, a parser only need to do this?
     Stream of tokens
                          Parser      Parse tree
  Context-free grammar
 Sentinels
                            E = M eof * C * * 2 eof        eof
Switch (*forward++) {
   case eof:
   if (forward is at end of first buffer) {
   reload second buffer;
   forward = beginning of second buffer;
   }
   else if {forward is at end of second buffer) {
   reload first buffer;\
   forward = beginning of first buffer;
   }
   else /* eof within a buffer marks the end of input */
   terminate lexical analysis;
   break;
   cases for the other characters;
}
Transition diagrams
 Transition diagram for relop
Transition diagrams (cont.)
 Transition diagram for reserved words and
  identifiers
Transition diagrams (cont.)
 Transition diagram for unsigned numbers
Recognition
       state = 0;
       while ( (c = next_char() ) != EOF ) {
          switch (state) {
                case 0: if ( c == a ) state = 1;
                     break;
                case 1: if ( c == b ) state = 2;
                     break;
                case 2: if ( c == c ) state = 3;
                     break;
                case 3: if ( c == a ) state = 1;
                       else { ungetchar(); return (TRUE); }
                     break;
                default:
                     error();
          }
       }
       if ( state == 3 ) return (TRUE) else return (FALSE);
Finite Automata for the Lexical Tokens
                                                a- z           a- z
         i          f                                                                             0-9
                                                       2                                0-9
    1          2         3              1                                                     2
                                                                                    1
                                                                0-9
             IF                                 ID                                       NUM
              0-9       0-9
        0-9
                    .               1       -      2       -       3
                                                                       \n
                                                                              4
                                                                  a- z
    1          2         3                                                          1                2
                                                                                        any but \n
    .                                   blank, etc.
                                                           5          blank, etc.
         4 0-9 5              0-9
             REAL                           White space                       error
                                            (and comment starting with - -)
                                                                                              (Appel, pp. 21)
LEXICAL ANALYSIS
Lexical Errors
   Deleting an extraneous character
   Inserting a missing character
   Replacing an incorrect character by a correct
    character
   Transposing two adjacent characters(such as ,
    fi=>if)
   Pre-scanning
Tokens / Patterns / Regular Expressions
    Lexical Analysis - searches for matches of lexeme to pattern
    Lexical Analyzer returns:<actual lexeme, symbolic identifier of token>
      For Example:                Token         Symbolic ID
                                    if               1
                                   then              2
                                   else              3
                                  >,>=,<,           4
     Set of all regular              :=              5
     expressions plus
                                     id              6
     symbolic ids plus
     analyzer define required       int              7
     functionality.                 real             8
                 algs      algs
         REs --- NFA --- DFA (program for simulation)