Compiler Construction
CS-4207
Instructor Name: Atif Ishaq
Lecture 4-5
Today’s Lecture
Lexical Analysis
Attributes of Token
Specification of Token
Recognition of Token
2
Lexical Analysis
Lexical Analysis is the first phase of a compiler. It takes the modified source
code from language preprocessor that are written in the form of sentences.
Lexical Analysis phase is also referred as scanning phase
The lexical analyzer breaks these syntaxes into a series of tokens, by removing
any whitespaces or comments in the source code
Source : https://www.tutorialspoint.com/compiler_design/compiler_design_lexical_analysis.htm 3
https://www.guru99.com/compiler-design-lexical-analysis.html
Role of Lexical Analyzer
Lexical Analyzer read the input characters of source program, group them into
lexemes, and produce as sequence the series of tokens for each lexeme in the
source program
Lexical Analyzer interacts with the symbol table also during this process
When Lexical Analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table
Lexical Analyzer strips out comments and whitespaces (blank, newline, tabs,
and other characters that are used to separate tokens in the input)
Lexical Analyzer correlate error message generated by the compiler with source
program – associates a line number with each error message
4
Role of Lexical Analyzer
Lexical Analyzer are divided into a cascade of two processes scanning and
Lexical Analysis
Scanning consists of simple processes that does not require tokenization of the
input, such as deletion of comments and compaction of consecutive white space
character into one
Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output
5
Tokens, Patterns, and Lexemes
A Token is a pair consisting of token name and optional attribute value.
A token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters donating identifier.
6
Tokens, Patterns, and Lexemes
A Pattern is a description of the form that the lexeme of token may take.
A Pattern explains what can be a token, and these pattern are defined by means of
regular expressions
In case of a keyword as token, the pattern is just the sequence of characters that
form the keyword
A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
7
Tokens, Patterns, and Lexemes
TOKEN INFORMAL DESCRIPTION SAMPLE LEXEME
if characters i , f if
else characters e , l, s , e else
comparison < or > , or <= or >= or == or != <= , !=
id letter followed by letters and digits pi , score , D2
number any numeric constant 3.14 , 0 , 6.02e23
literal Anything but “ , surrounded by “ “some thing”
Example : printf(“Total = %d\n”,score);
printf and score are lexemes matching the pattern for token id
“Total = %d\n” is lexeme matching literal
8
Tokens – Punctuation : Separators
Tokens for each punctuation symbol, such as left and right parentheses, comma,
and semicolon
Typically individual special characters such as ( { } : .. (two dots)
Sometimes double characters : lexical scanner looks for longest token
( * , /* -- comment openers in various language
Returned just as identity (kind) of token
And perhaps location for error message and debugging purpose
9
Tokens – Operators
Tokens for operators, either individually or in classes such the token comparison
They are like punctuations
No real difference for lexical analyzer
Typically single or double special chars
Operator : + - == <=
Operations : := =>
Returned as kind of token
And perhaps location
10
Tokens – Keywords
One token for each keyword. The pattern for a keyword is same as the keyword
itself
They are reserved identifiers
Examples are if , do , while , try , catch
May be distinguished from identifiers
Returned as kind of token
With possible location information
11
Tokens – Identifiers
One token representing all identifiers
Rules differ
Length , allowed characters , separators
Needs to build a names table (identifier table)
Single entry for all occurrences of Var1
Language may be case insensitive : same entry for VAR1, vAR1, Var1
Typical structure Hash Table
Lexical Analyzer returns token kind
And key (index) to table entry
Table entry includes location information
12
Organization of Names Table
Most common structure of names table is hash Table
With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters (e.g. sum mod table size)
No hash code is perfect : expect collision
Avoid any arbitrary limits on table or chain size
13
Tokens – String Literals
One or more tokens representing constants, such as numbers, literals and strings
Text must be stored
Actual characters are important
Not like identifiers must preserve casing
Character set issues : uniform internal representation
Table needed
Lexical analyzer returns key into table (in addition to kind)
May or may not be worth hashing to avoid duplicates
14
Tokens – Character Literals
One or more tokens representing constants, such as numbers, literals and strings
Similar issues to string literals
Lexical analyzer returns
Token kind
Identity of characters
Cannot assume character set of host machine, may be different
15
Tokens – Numeric Literals
One or more tokens representing constants, such as numbers, literals and strings
Use a table to store numeric literals
For example 123=0123=01_23 (Ada)
But can not use predefined types for values
o Because may have different bounds
Floating point represents much more complex
Denormals, correct rounding
Very delicate to computer precise values
Host/target issues
16
Handling Comments
Comments have no effect on program
Can be eliminated by scanner
But may need to be retrieved by tools
Error detection issues
Unclosed comments
Scanner skips over comments and returns next meaningful token
17
Attributes of Tokens
When more then one lexeme match a pattern
Additional information provided to next phase about particular lexeme
For example the pattern for token number matches both 0 and 1
important for the code generator to know which lexeme was found in
source code
Information about an identifier – for example its lexeme, its type and the
location it is first found is kept in symbol table
Lexical analyzer return parser a token name and an attribute value
`\token name influences parsing decision
attribute value influences translation after parsing - also a pointer to the
18
symbol table entry for that identifier
Lexical Errors
A lexical error is one when a character sequence is not possible to scan into any
valid token
Facts about Lexical Error
Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifier, operators, keyword are considered as lexical error
Lexical error in general are caused by the appearance of some illegal
character, mostly at the beginning of token
A lexical analyzer without the help of other components can not determine
that an error exists in the source code
19
Error Recovery in Lexical Analyzer
If the string fi is encountered for the first time in a C program in the context:
fi ( a == fx()) ….
Difficult for lexical analyzer to determine whether fi is misspelling of the
keyword if or an undeclared function identifier.
fi is a valid lexeme for the token id – this id return to parser – handle an error
due to disposition of letters
In situation when a lexical analyzer is unable to proceed because none of the
pattern for token matches any prefix of the remaining input – the simplest
recovery is “panic mode” recovery
Start deleting successive characters from the remaining input – until a token at
20
the beginning of what input is left – may confuse parser
Error Recovery in Lexical Analyzer
Other possible recovery actions are
1. Remove/delete one character from the remaining input
2. In the panic mode, the successive character are always ignored until we
reach a well-formed token
3. Insert a missing character into the remaining input
4. Replace a character with another character
5. Transpose two serial characters
21
Input Buffering
Compilation of large program requires time to process characters and large
number of characters
Specialized buffering techniques developed to reduce the amount of
overhead required to process a single input character
Two buffers are used – reload alternatively
Each buffer is of size N and N is size of disk block (4096 bytes)
One system read command can read N characters into buffer rather than one
character
eof is used to mark end of source file
22
Input Buffering
String of characters between two pointers is current lexeme
Two pointers to the input are maintained
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
23
Specification of Tokens
Regular languages are the most popular for specifying tokens because
These are based on simple and useful theory
These are easy to understand
Efficient implementation exists for generating lexical analyzer based on
such language
24
Specification of Tokens : Strings and Languages
An alphabet is any finite set of symbols
Letters, digits , punctuations
The set { 0 , 1 } is binary alphabet
An string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
Length of string s is defined as |s|
Empty string denoted by e
A language is any countable set of strings over some fixed alphabets
Abstract language like 0 , the empty set, or {e}, the set containing only the
empty string, are language under this definition
25
Specification of Tokens : Strings and Languages
The following string related terms are commonly used
1. A prefix of string s is any string obtained by removing zero or more symbols
from the end of s. For example, ban, banana, and ϵ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana, banana, and ϵ are suffixes of
banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and ϵ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not ϵ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a subsequence of
26
banana.
Specification of Tokens : Operations on
Languages
In lexical analysis the most important operations on language are union,
concatenation, and closure
Union
L M = {s s L or s M}
Concatenation
LM = {xy x L and y M}
Exponentiation
L0 = {}; Li = Li-1L
Kleene closure
L* = i=0,…, Li
Positive closure
L+ = i=1,…, Li 27
Specification of Tokens : Regular Expression
The regular expressions are built recursively out of smaller regular
expressions using some rules.
Each regular expression r denotes a language L(r) which is also defined
recursively from the language denoted by r’s subexpressions
is a regular expression denoting language {}
a is a regular expression denoting {a}
If r and s are regular expressions denoting languages L(r) and M(s)
respectively, then
rs is a regular expression denoting L(r) M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r) 28
Specification of Tokens : Regular Expression
A language defined by regular expression is called a regular set
The parenthesis may drop if we follow the conventions that
1. The unary operator * has highest precedence and is left associative
2. Concatenation has second highest precedence and is left associative
3. | has lowest precedence and is left associative
Under these conventions the regular expression
(a) | ( (b) * (c) )
by
a|b*c
29
Specification of Tokens : Regular Expression
A language defined by regular expression is called a regular set
If two regular expression r and s denote the same regular set then we say they
are equivalent and we write r = s –- example (a | b ) = ) b | a )
30
Specification of Tokens : Regular Definition
If Σ is an alphabet of basic symbols, then a regular definition is a sequence of
definition of the form
Where :
Each di is a new symbol, not in Σ and not the same as any other of the d's, and
Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l).
31
Specification of Tokens : Regular Definition
Example
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
A regular definition can not be recursive
digit digit digitdigit (wrong)
The following shorthand are often used:
r+ = rr*
r? = r
[a-z] = abc…z
[abc] = abc
Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )? 32
Recognition of Tokens
Grammar that describes the simple form of branching statement and
conditional expression
stmt if expr then stmt
|if expr then stmt else stmt
|
expr term relop term
term
term id
num The terminals of the grammar are
if, then, else, relop, id and num
For relop we use comparison operators of language, where = is “equals” and <>
is “not equals”
33
Recognition of Tokens
Patterns of tokens of the grammar described using regular definition
digit [0-9]
digits digit+
number digits (. digits)?(E(+|-)?digits)?
letter [A-Z][a-z]
id letter (letter | digit )*
if if
then then
else else
relop < | > | <= | >= | <> | =
The token ws for recognizing the white space is defined by
ws (blank | tab | new line) this token is not return to the parser
34
Recognition of Tokens
Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace. It is the following token that gets returned to the
parser.
35
Lecture Outcome
Lexical Analyzer and Its Role
Comprehensive discussion on Token
Relationship between specification of token and Regular
expressions
36
37