Higher Technological Institute
(HTI)
Computer Science Department
CSC 415 Compiler Design
CHAPTER 1
Lexical Analysis
Dr. Hany M. Zamel
1
Course Syllabus
Scanning theory and practice: Regular expressions, finite automata,
and scanners, scanner generators, practical considerations, translating
regular expressions to finite automata. Grammar and parsing: Context
frees grammars, parsers and recognizers, grammar analysis
algorithms. Semantic processing: Syntax directed translation. semantic
processing techniques.
Symbol tables: Basic techniques, block structured and extensions,
implicit declarations. Run time storage organization: Static allocation,
sack allocation, heap allocation, and program layout in memory Dain
structures: analysis declaration processing fundamentals action
routines. Procedures and functions: If statements, loops, case
statement, exception handling, passing parameters to subprograms.
Code generation and optimization: Register and temporary
management interpretive code generation, generating code from
subprogram calls, loop optimization.
2
Textbook
A. Ullman, “Compilers Principles, Techniques, and Tools”, Pearson
Education Limited, 2014
Course Site
https://hanyzamel.wixsite.com/lec-hti
3
Grading Scheme
4
Classroom Policy
5
Outline
• Informal sketch of lexical analysis
– Identifies tokens in input string
• Issues in lexical analysis
– Lookahead
– Ambiguities
• Specifying lexers (aka. scanners)
– By regular expressions (aka. regex)
– Examples of regular expressions
6
Lexical Analysis
• What do we want to do? Example:
if (i ==j)
Z=0;
else
Z=1;
• The input is just a string of characters:
\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;
• Goal: Partition input string into substrings
– Where the substrings are called tokens
7
What’s a Token?
• A syntactic category
– In English:
noun, verb, adjective, …
– In a programming language:
Identifier, Integer, Keyword, Whitespace, …
8
Tokens
• A token class corresponds to a set of strings
Infinite set
• Examples var1
i
ports
– Identifier: strings of letters or foo Person
digits, starting with a letter …
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs
9
What are Tokens For?
• Classify program substrings according to role
• Lexical analysis produces a stream of tokens
• … which is input to the parser
• Parser relies on token distinctions
– An identifier is treated differently than a keyword
10
Designing a Lexical Analyzer: Step 1
• Define a finite set of tokens
– Tokens describe all items of interest
• Identifiers, integers, keywords
– Choice of tokens depends on
• language
• design of parser
11
Example
• Recall
\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;
• Useful tokens for this expression:
Integer, Keyword, Relation, Identifier, Whitespace, (, ),
=, ;
• N.B., (, ), =, ; above are tokens, not characters
12
Designing a Lexical Analyzer: Step 2
• Describe which strings belong to each token
• Recall:
– Identifier: strings of letters or digits, starting with a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs
13
Lexical Analyzer: Implementation
• An implementation must do two things:
1. Classify each substring as a token
2. Return the value or lexeme (value) of the token
– The lexeme is the actual substring
– From the set of substrings that make up the token
• The lexer thus returns token-lexeme pairs
– And potentially also line numbers, file names, etc. to
improve later error messages
14
Example
• Recall:
\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;
15
Lexical Analyzer: Implementation
• The lexer usually discards “uninteresting” tokens
that don’t contribute to parsing.
• Examples: Whitespace, Comments
16
True Crimes of Lexical Analysis
• Is it as easy as it sounds?
• Sort o f … if you do not make it hard!
• Look at some history
17
Lexical Analysis in FORTRAN
• FORTRAN rule: Whitespace is insignificant
• E.g., VAR1 is the same as VA R1
• A terrible design!
• Historical footnote: FORTRAN Whitespace rule
motivated by inaccuracy of punch card operators
18
FORTRAN Example
• Consider
– DO 5 I=1,25
– DO 5 I=1.25
19
Lexical Analysis in FORTRAN (Cont.)
• Two important points:
1. The goal is to partition the string. This is implemented
by reading left-to-right, recognizing one token at a time
2. “Lookahead” may be required to decide where one
token ends and the next token begins
20
Lookahead
• Even our simple example has lookahead issues
– i vs. if
– =vs. ==
21
Lexical Analysis in C++
• Unfortunately, the problems continue today
• C++ template syntax:
Foo<Bar>
• C++ stream syntax:
cin >>var;
• But there is a conflict with nested templates:
Foo<Bar<Bazz>>
Closing templates, not stream 22
20
Review
• The goal of lexical analysis is to
– Partition the input string into lexemes
– Identify the token of each lexeme
• Left-to-right scan => lookahead sometimes
required
23
Next
• We still need
– A way to describe the lexemes of each token
– A way to resolve ambiguities
• Is if two variables i and f?
• Is ==two equal signs = =?
24
Regular Languages
• There are several formalisms for specifying tokens
• Regular languages are the most popular
– Simple and useful theory
– Easy to understand
– Efficient implementations
25
Languages
Def. Let alphabet Σ be a set of characters.
A language over Σ is a set of strings of
characters drawn from Σ.
26
Examples of Languages
• Alphabet = English • Alphabet = ASCII
characters
• Language = English • Language = C programs
sentences
• Not every string of English • Note: ASCII character set
characters is an English is different from English
sentence character set
27
Notation
• Languages are sets of strings.
• Need some notation for specifying which sets we
want
• The standard notation for regular languages is
regular expressions.
28
Atomic Regular Expressions
• Single character
'c ' = "c"
• Epsilon
= ""
Not the empty set, but set with
a single, empty, string.
29
Compound Regular Expressions
• Union
A+ B = s | s A or s B
• Concatenation
AB = ab | a A and b B
• Iteration
A* = ∪i≥0 Ai where Ai = AA . . . A
i times
30
Regular Expressions
• Def. The regular expressions over Σ are the
smallest set of expressions including
'c ' where c
A+ B where A, B are rexp over
AB " " "
A* where A is a rexp over
31
Syntax vs. Semantics
• Notation so far was imprecise
AB = ab | a A and b B
B as a piece of syntax B as a set
(the semantics of the syntax)
32
Syntax vs. Semantics
Semantics (content) L('a'*)
L('a' + 'b')
b aa aaa
a a
Box 'a' + 'b' 'a'*
Syntax (label)
33
Syntax vs. Semantics
• To be careful, we distinguish syntax and semantics.
L() = ""
L('c ') = {"c "}
L( A + B) = L( A) L(B)
L( AB) = {ab | a L( A) and b L(B)}
L( A* ) = L( Aii )
L(A*) = !∪ii≥0
0
L(A )
34
Example: Keyword
Keyword: “else” or “if” or “begin” or …
‘else’ + ‘if’ + ‘begin’ + . . .
Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’
35
Example: Integers
Integer: a non-empty string of digits
digit = '0 '+ '1'+ '2 '+ '3'+ ' 4 '+ '5 '+ '6 '+ '7 '+ '8 '+ '9 '
integer = digit digit*
Abbreviation: A + = AA*
Abbreviation: [0-2] = '0' + '1' + '2'
36
Example: Identifier
Identifier: strings of letters or digits, starting with a
letter
letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’
identifier = letter (letter + digit)*
Is (letter* + digit*) the same as (letter + digit)*?
37
Example: Whitespace
Whitespace: a non-empty sequence of blanks,
newlines, and tabs
(' ' + '\n' + '\t')
+
38
Example: Phone Numbers
• Regular expressions are all around you!
• Consider (650)-723-3232
= digits -,(,)
exchange = digit3
phone = digit4
area = digit3
phone_number = '(' area ')-' exchange '-' phone
39
Example: Email Addresses
• Consider anyone@cs.stanford.edu
= letters .,@
name = letter+
address = name '@' name '.' name '.' name
40