Compiler Design
6th Semester B.Tech. (CSE)
Course Code: 18CS1T08
Dr. Jasaswi Prasad Mohanty
School of Computer Engineering
KIIT Deemed to be University
Bhubaneswar, India
MODULE II
Lexical Analysis
Sl. No. Topics
1. Role of Lexical Analyzer
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Finite Automata
6. From Regular Expressions to finite Automata
7. Implementing Scanners
The Role of Lexical Analyzer
The main task of the lexical analyzer is to read the input characters of the source program,
processes every character in the input program, if a word is valid then group them into lexemes,
and produce as output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis.
It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table.
CD / Module-II / Jasaswi 3 31 January 2024
The Role of Lexical Analyzer – contd . . .
Commonly, the interaction is implemented by having the parser call the lexical analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token,
which it returns to the parser.
CD / Module-II / Jasaswi 4 31 January 2024
The Role of Lexical Analyzer
Lexical analyzer perform some of the following other tasks besides identification
of lexemes:
• Stripe out comments and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the input).
• Correlate error messages generated by the compiler with the source
program. For instance, the lexical analyzer may keep track of the number of
newline characters seen, so it can associate a line number with each error
message.
• Makes a copy of the source program with the error messages inserted at
the appropriate positions.
• May Perform the expansion of macros if the source program uses a macro-
preprocessor
CD / Module-II / Jasaswi 5 31 January 2024
The Role of Lexical Analyzer
Lexical analyzers are divided into a cascade of two processes:
• Scanning: consists of the simple processes that do not require tokenization of
the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
• (Proper) Lexical analysis: Here the scanner produces the sequence of
tokens as output.
CD / Module-II / Jasaswi 6 31 January 2024
Why to separate Lexical analysis and parsing
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design
• The separation of lexical and syntactic analysis often allows us to simplify at least one of
these tasks.
If the Parser had to deal with comments and whitespace: The parser would need to handle
comments and whitespace as part of its syntactic analysis. This would add complexity to
the parser because it would have to distinguish between meaningful language constructs
and non-essential elements like comments and whitespace.
2. Improving compiler efficiency
• A separate lexical analyzer allows us to apply specialized techniques that serve only the
lexical task, not the job of parsing.
• Specialized buffering techniques for reading input characters can speed up the compiler
significantly.
3. Enhancing compiler portability
• Ability of a compiler to generate code that can run on different target platforms without
modification.
CD / Module-II / Jasaswi 7 31 January 2024
Responsibilities of Lexical Analysis
1. Tokenization:
• Recognition of Lexical Elements: Identify and recognize lexical elements in the
source code, such as keywords, identifiers, literals (constants), operators, and
punctuation symbols.
• Grouping into Tokens: Group the recognized lexical elements into tokens. A token
is a meaningful unit that represents a particular syntactic element in the programming
language.
2. Handling Whitespace and Comments:
• Whitespace Removal: Recognize and discard whitespace characters (spaces, tabs,
line breaks) from the source code. Whitespace is generally irrelevant to the syntactic
structure of the program but contributes to readability for humans.
• Comment Removal: Identify and eliminate comments from the source code.
Comments are annotations meant for human readers and do not affect the program's
execution.
CD / Module-II / Jasaswi 8 31 January 2024
Tokenization: Example
Code: Responsibility: Recognize and categorize
int main() lexical elements such as int, main, (, ), {,
return, 0, ;, and } into tokens.
{
return 0;
}
Handling Whitespace and Comments: Example
Code:
Responsibility: Recognize and discard
/* This is a comment */ comments (/* This is a comment */) and
int main() handle whitespace.
{
return 0;
}
CD / Module-II / Jasaswi 9 31 January 2024
Responsibilities of Lexical Analysis
3. Error Handling:
• Error Detection: Detect and report lexical errors, such as the use of undefined
symbols or invalid characters. Lexical errors indicate deviations from the language's
lexical rules.
• Error Recovery: Implement strategies for recovering from errors to continue
processing the remaining code. Error recovery mechanisms aim to provide
informative error messages without prematurely terminating the compilation process.
4. Symbol Table Management:
• Building Symbol Table Entries: Create entries in the symbol table for identifiers
encountered during tokenization. The symbol table is a data structure that associates
each identifier with information about its type, scope, and other properties.
• Handling Reserved Words: Identify reserved words in the language and ensure
they are appropriately classified as keywords.
CD / Module-II / Jasaswi 10 31 January 2024
Error Handling: Example
Code:
Responsibility: Detect and report
int mai%n() lexical errors, such as the use of % in
{ the identifier.
return 0; Error Handling: Lexical error: Invalid
character '%' in identifier.
}
Symbol Table Management: Example
Code:
int main()
Symbol Table Responsibility: Create entries in the
{
x . . . symbol table for identifiers (x)
int x; encountered during tokenization.
return x;
}
CD / Module-II / Jasaswi 11 31 January 2024
Responsibilities of Lexical Analysis
5. Generating Output:
• Output Tokens: Generate a stream of tokens, where each token represents a
recognized and categorized syntactic element. This token stream becomes the input
for the subsequent phases of the compiler.
6. Optimizations and Preprocessing:
• Optimization Opportunities: Identify simple optimizations that can be performed at
the lexical analysis phase. For example, recognizing constant literals and replacing
them with their computed values.
• Preprocessing Directives: Handle preprocessor directives if applicable, such as
macro expansions or conditional compilation directives.
7. Interface with the Parser:
• Providing Input to the Parser: Present the generated token stream to the parser for
syntactic analysis. The parser relies on the lexical analyzer to provide a well-defined
and organized stream of tokens.
CD / Module-II / Jasaswi 12 31 January 2024
Optimizations and Pre-processing: Example
Code: Responsibility: Handle preprocessing
#define MAX 100 directives, such as macro expansions..
int main()
{
return MAX;
}
Generating Output and Interface with the Parser: Example
Code:
Responsibility: Generate the
int main() tokens and provide the generated
{ token stream to the parser for
syntactic analysis.
return 0;
}
CD / Module-II / Jasaswi 13 31 January 2024
Lexical Analysis Operations
This phase of the Compiler does the following operations:
Recognize tokens and ignore white spaces, comments.
Generates token stream
Error Reporting
Model using Regular Expression
Recognize using Finite State Automata
CD / Module-II / Jasaswi 14 31 January 2024
Terms related to Lexical Analysis
Lexeme Token Pattern
• A lexeme is a sequence of • A token is a pair consisting of a token • A pattern is a description of the
characters in the source name and an optional attribute value. form that the lexemes of a token
code that matches the • The token name is an abstract symbol may take.
pattern for a token. representing a kind of lexical unit, e.g.: • A pattern is a rule or template
• It is the actual occurrence a particular keyword, or a sequence of that defines the possible structure
of a token in the source input characters denoting an identifier. of a token.
code. • The optional attribute value provides • It describes the set of valid
additional information associated with sequences of characters that can
the token. form a particular token.
Example: Considering the C Example:, The statement int x = 10; can Example: For an identifier and an
programming language the be broken down into the following tokens: integer in, the pattern might be
statement int x = 10; , <int> defined as follows:
contains the following tokens: <id, 1> Ex-1: A pattern for an identifier:
int, x, =, 10 and ; <=> starts with a letter, followed by zero
<number, 10> or more letters or digits or
<;> underscore.
Ex-2: A pattern for integers:
An integer is a sequence of digits.
CD / Module-II / Jasaswi 15 31 January 2024
Examples of Tokens
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword
itself.
2. Tokens for the operators, either individually or in classes such as the token comparison.
3. One token representing all identifiers (names).
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
CD / Module-II / Jasaswi 16 31 January 2024
Attribute Value for Tokens
Attributes for tokens are additional pieces The token names and associated attribute
of information associated with each token values for the Fortran statement
in a programming language. E = M * C ** 2 are:
These attributes help convey more details <id, pointer to symbol-table entry for E>
about the tokens beyond their basic
categorization. <assign-op >
The inclusion of attributes is particularly <id, pointer to symbol-table entry for M>
valuable for certain types of tokens where
<mult-op>
additional information is needed.
An attribute of a token is a value that the <id, pointer to symbol-table entry for C>
scanner extracts from the corresponding <exp-op>
lexeme and supplies to the syntax
analyzer <number , integer value 2 >
• What can be important attributes? Note: In certain pairs, especially operators,
punctuation, and keywords, there is no need
• Where is this information stored?
for an attribute value.
CD / Module-II / Jasaswi 17 31 January 2024
Token/Pattern/Lexeme Example-1
Consider the statement int x = 10; in the C programming language. Count the number of
tokens.
Lexeme: int Lexeme: 10
Token: <keyword>
Token: <number, 10>
Pattern: int
Pattern: digit+
Token: <id, 1>
Lexeme: x Lexeme: ;
Pattern: letter (letter | digit)* Token: <symbol>
Pattern: ;
Lexeme: =
Token: <assign-op>
Pattern: =
CD / Module-II / Jasaswi 18 31 January 2024
Token/Pattern/Lexeme Example-2
Consider the following
code in C-programming
6 Tokens
language. Find the number
of tokens: 1 Tokens
int strange (int x)
{ 9 Tokens
if(x<=0) return 0;
if((x%2)!=0) return x-1; 15 Tokens
return 1+strange(x-1);
} 10 Tokens
1 Tokens
Total 42 Tokens
CD / Module-II / Jasaswi 19 31 January 2024
Token/Pattern/Lexeme Example-3
Consider the following C-program. 3 Tokens
Find the number of tokens:
main() 1 Tokens
{ 5 Tokens
char ch=‘A’;
int x, y; 5 Tokens
x=y=20;
6 Tokens
x++;
printf(“%d%d”,x,y); 3 Tokens
}
9 Tokens
1 Tokens
Total 33 Tokens
CD / Module-II / Jasaswi 20 31 January 2024
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error.
• Example: if the string fi is encountered for the first time in a C program in the
following context a lexical analyzer cannot tell whether fi is a misspelling of the
keyword if or an undeclared function identifier.
fi (a == f(x))
return 5;
However it may be able to recognize errors like: d = 2r
• Such errors are recognized when no pattern for tokens matches a character
sequence.
• The simplest recovery strategy is "panic mode" recovery. We delete
successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
CD / Module-II / Jasaswi 21 31 January 2024
Lexical Errors
When the token pattern does not match the prefix of the remaining input, the lexical analyzer gets
stuck and has to recover from this state to analyze the remaining input.
In simple words, a lexical error occurs when a sequence of characters does not match the
pattern of any token. It typically happens during the execution of a program.
Types of Lexical Error:
1. Exceeding length of identifier or numeric constants.
Example:
#include <iostream>
using namespace std; This is a lexical error
since signed integer
int main()
literal lies between
{ −2,147,483,648 and
int a = 21474836477844; 2,147,483,647 for a 32-bit
return 0; signed integer.
}
CD / Module-II / Jasaswi 22 31 January 2024
Lexical Errors – contd…
2. Appearance of illegal characters: This is a lexical error due to the
#include <iostream> presence of the $ sign after the printf
using namespace std; statement. The $ sign is not a valid
int main() character in the C++ language. The
{ compiler recovers by skipping the
printf("Geeksforgeeks") $; invalid character ('$') and continue
return 0; analyzing the rest of the code.
}
3. Unmatched comment:
#include <iostream> This is a lexical error since the ending
using namespace std; of comment “*/” is not present but the
int main() { beginning is present. The compiler
/* comment recovers by ignoring the rest of the
cout<<"GFG!"; line and continue analyzing
return 0; subsequent lines.
}
CD / Module-II / Jasaswi 23 31 January 2024
Lexical Errors
4. Spelling Error (Misspelling of identifier):
#include <iostream>
using namespace std; Spelling error as identifier cannot start with a
int main() number.
{
int 3num= 1234;
return 0;
}
• The lexical analyzer will not declare it as an error since,
Exception: whether “ofr” is a misspelling of the keyword ‘for’ or an
void main() undeclared function identifier it is now known to it.
{ • Since ‘ofr’ is a valid lexeme for the token id, the lexical
analyzer must return the token id to the parser and let
int i, n; some other phase of the compiler ( probably the parser
ofr( i=0; i< n; i++) in this case ) handle an error due to transposition of
the letters
printf(“great india\n”);
}
CD / Module-II / Jasaswi 24 31 January 2024
Lexical Error Recovery Actions
LA cannot catch any other errors except for simple errors such as illegal
symbols.
In such cases, LA skips characters in the input until a well-formed token is found
• This is called “panic mode” recovery.
The "panic mode" recovery strategy is the simplest recovery strategy . We
delete successive characters from the remaining input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.
Some of the lexical error recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
CD / Module-II / Jasaswi 25 31 January 2024
Input buffering
The LA scans the characters of the source program one at a time to discover
tokens.
We often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
There are many situations where we need to look at least one additional
character ahead.
Sometimes lexical analyser needs to look ahead at some symbols to decide
about the token to return.
• In C language, single-character operators like -, =, or < could also be the
beginning of a two-character operator like ->, ==, or <=.
• In Fortran: DO 5 I = 1.25
CD / Module-II / Jasaswi 26 31 January 2024
Input buffering – contd…
The primary role of input buffering is to maintain a buffer of characters retrieved from the
source code.
It allows the lexical analyzer to look ahead in the input stream, examining upcoming
characters to make decisions about token boundaries.
A two-buffer scheme that handles large look ahead safely.
Buffering techniques:
1. One Buffer Scheme
2. Two Buffer Scheme using Buffer Pairs
3. Two Buffer Scheme using Sentinels
Two pointers to the input in buffering techniques are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
CD / Module-II / Jasaswi 27 31 January 2024
One Buffer Scheme
In this one buffer scheme, only one buffer is used to store the input string.
But the problem is that if lexeme is very long then it crosses the buffer boundary, to scan rest of
the lexeme the buffer has to be refilled, that makes overwriting the first part of lexeme.
Consider the statement:
int i=i+1; j=j+1;
Buffer :
i n t i = i + 1
lexemeBegin forward
Buffer gets refilled when ‘fp’ reaches the end of buffer i.e. ‘1’ in int i=i+1 :
; j = j + 1 ;
lexemeBegin
CD / Module-II / Jasaswi 28 31 January 2024
Buffer Pairs
Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.
This scheme involves two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
Using one system read command we can read N characters into a buffer, rather than using one
system call per character.
• If fewer than N characters remain in the input file, then a special character, represented by
eof, marks the end of the source file and is different from any possible character of source
program.
CD / Module-II / Jasaswi 29 31 January 2024
Buffer Pairs – contd…
Once the next lexeme is determined, forward is set to the character at its right
end.
Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, lexemeBegin is set to the character immediately after the lexeme just
found.
Advancing forward requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.
CD / Module-II / Jasaswi 30 31 January 2024
Buffer pairs (Code to advance forward pointer)
CD / Module-II / Jasaswi 31 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-1
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 32 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-2
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 33 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-3
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 34 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-4
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 35 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-5
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
int → keyword
CD / Module-II / Jasaswi 36 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-6
forward
Buffer 1:
i n t i = i + 1
CD / Module-II / Jasaswi 37 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-7
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 38 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-8
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
i → identifier
CD / Module-II / Jasaswi 39 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-9
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 40 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-10
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 41 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-11
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
= → operator
CD / Module-II / Jasaswi 42 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-12
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 43 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-13
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 44 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-14
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
i → identifier
CD / Module-II / Jasaswi 45 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-15
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 46 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-16
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 47 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-17
forward
Buffer 1:
i n t i = i + 1
lexemeBegin
+ → operator
CD / Module-II / Jasaswi 48 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-18
Buffer 1:
i n t i = i + 1
lexemeBegin
CD / Module-II / Jasaswi 49 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-19
Buffer 1:
i n t i = i + 1
lexemeBegin
Reload 2nd half
forward 1 → number
Buffer 2:
; j = j + 1 ; eof
CD / Module-II / Jasaswi 50 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Time-20
Buffer 1:
i n t i = i + 1
forward
Buffer 2:
; j = j + 1 ; eof
lexemeBegin ; → delimeter
CD / Module-II / Jasaswi 51 31 January 2024
Example: Buffer pairs
E.g.: Consider int i=i+1; j=j+1;
Continuing till
end of buffer
Buffer 1:
i n t i = i + 1
forward
Buffer 2:
; j = j + 1 ; eof
lexemeBegin
CD / Module-II / Jasaswi 52 31 January 2024
Sentinels
In the Buffer-Pair technique, we must check, each time we advance forward, that we have not
moved off one of the buffers; if we do, then we must also reload the other buffer.
For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read (the latter may be a multiway branch).
We can combine the buffer-end test with the test for the current character if we extend each buffer
to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof.
The eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a buffer means that the input is at an end.
CD / Module-II / Jasaswi 53 31 January 2024
Sentinels
The following algorithm shows the method for advancing forward pointer.
CD / Module-II / Jasaswi 54 31 January 2024
Example: Sentinels
E.g.: Consider int i=i+1; j=j+1;
Sentinel is added
Buffer 1:
i n t i = i + 1 eof
In Sentinels,
if the forward pointer is not at the “eof”,
It scans the next character i.e. forward++.
Hence, it takes one comparison
forward End of source file
Buffer 2:
; j = j + 1 ; eof eof
lexemeBegin Sentinel is added
CD / Module-II / Jasaswi 55 31 January 2024
Tokens in Programming Languages
Keywords, operators, identifiers (names), constants, literal strings, punctuation
symbols (parentheses, brackets, commas, semicolons, and colons)
Attributes for tokens (apart from the integer representing the token)
• identifier: the lexeme of the token, or a pointer into the symbol table where
the lexeme is stored by the LA.
• intnum: the value of the integer (similarly for floatnum, etc.)
• string: the string itself
The exact set of attributes are dependent on the compiler designer.
CD / Module-II / Jasaswi 56 31 January 2024
Role of a Lexical Analyzer
Identify tokens and corresponding lexemes
Construct constants: for example, convert a number to token num and pass the
value as its attribute
• 31 becomes <num, 31>
Recognize keyword and identifiers
• counter = counter + increment becomes id = id + id
• Note that id here is not a keyword
Discard whatever does not contribute to parsing
• White spaces (blanks, tabs, newlines) and comments
CD / Module-II / Jasaswi 57 31 January 2024
How to specify Tokens?
How to describe tokens
2.e0 20.e-01 2.000
How to break text into token
if (x==0) a = x << 1;
if (x==0) a = x < 1;
How to break input into tokens efficiently
• Tokens may have similar prefixes
• Each character should be looked at only once
CD / Module-II / Jasaswi 58 31 January 2024
Specifying and Recognizing Patterns and Tokens
Patterns are denoted with regular expressions, and recognized with finite state
automata
Regular definitions, a mechanism based on regular expressions, are popular for
specification of tokens
Transition diagrams, a variant of finite state automata, are used to implement
regular definitions and to recognize tokens
• Usually used to model LA before translating them to executable programs
Regular languages
• Are easy to understand
• There is a well understood and useful theory
• They have efficient implementation
CD / Module-II / Jasaswi 59 31 January 2024
Specification of Tokens
Programming language tokens can be described by regular languages.
Regular expressions are means for specifying regular languages.
Regular expressions are an important notation for specifying lexeme patterns.
Although they cannot express all possible patterns, they are very effective in
specifying those types of patterns that we actually need for tokens.
We will discuss:
• Formal notation for regular expressions
• How these regular expressions are used in a lexical-analyzer generator.
CD / Module-II / Jasaswi 60 31 January 2024
Strings and Languages
Alphabet:
• An alphabet is any finite set of symbols.
• Examples: Letters, digits, punctuation, binary alphabets, ASCII, Unicode (includes
approximately 100,000 characters from alphabets around the world).
String:
• A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
• In language theory, the terms "sentence" and "word" are often used as synonyms for
"string“.
• Example: “KIIT” is a string of length 4.
• : Empty String
Language:
• A language is any countable set of strings over some fixed alphabet.
• Example: (the empty set), {} (set containing only empty string), set of all syntactically
well-formed C programs, set of all grammatically correct English sentences.
CD / Module-II / Jasaswi 61 31 January 2024
Terms for Parts of Strings
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of s.
Example: ban, banana, and are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s.
Example: nana, banana, and are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
Example: banana, nan, and are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
Example: baan is a subsequence of banana.
CD / Module-II / Jasaswi 62 31 January 2024
Operations on Strings
1. If x and y are strings, then the concatenation of x and y, denoted as xy, is the
string formed by appending y to x.
Example: if x = my and y = home, then xy = myhome.
2. The empty string is the identity under concatenation that is, for any string sS,
s = s = s.
3. If we think of concatenation as a product, we can define the ‘exponentiation“ of
strings as follows:
• Define s0 to be , and for all i > 0, define si to be si-1s.
• Since s = s, it follows that s1 = s. Then s2 = ss, s3 = sss, and so on.
CD / Module-II / Jasaswi 63 31 January 2024
Operation on Languages
Example:
• Let L = set of letters {A, B, . . . , Z, a, b, . . . , z) and let D = set of digits {0,1,. .
.9) be two languages containing strings of length one.
CD / Module-II / Jasaswi 64 31 January 2024
Operation on Languages: Example
Let L = set of letters {A, B, . . . , Z, a, b, . . . , z) and let D = set of digits {0,1,. . .9) be
two languages containing strings of length one.
1. L D is the set of letters and digits - strictly speaking the language with 62
strings of length one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including , the empty string.
5. L(L D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
CD / Module-II / Jasaswi 65 31 January 2024
Regular Expression
CD / Module-II / Jasaswi 66 31 January 2024
Examples of Regular Expressions
CD / Module-II / Jasaswi 67 31 January 2024
More Examples of Regular Expressions
CD / Module-II / Jasaswi 68 31 January 2024
Precedence Rules of Regular Expressions
We can reduce the use of parentheses by introducing precedence
and associativity rules
• Binary operators, closure, concatenation, and alternation are left
associative
Precedence rule is:
• parentheses > closure > concatenation > alternation
CD / Module-II / Jasaswi 69 31 January 2024
Algebraic Laws for REs
CD / Module-II / Jasaswi 70 31 January 2024
Regular Set
A language that can be defined by a regular expression is called a regular set.
If two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.
• Example: (a l b) = (b l a)
CD / Module-II / Jasaswi 71 31 January 2024
Regular Definitions
For notational convenience, we may wish to give names to certain regular
expressions and use those names in subsequent expressions, as if the names
were themselves symbols.
If is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:
d1 r1
d2 r2
...
dn rn
where
1. Each di is a new symbol, not in and not the same as any other of the d’s, and
2. Each ri is a regular expression over the alphabet U {d1, d2,. . . , di-1).
CD / Module-II / Jasaswi 72 31 January 2024
Examples of Regular Definitions
Example 1: Regular Definition for the language of C identifiers.
letter_ A | B | . . . | Z | a | b | . . . | z | _
digit 0 | 1 | . . . | 9
id letter_ ( letter_ | digit )*
Example 2: Regular Definition for the language of C unsigned numbers (integer
or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E4.
digit 0 | 1 | . . . | 9
digits digit digit*
optionalFraction . digits |
optionalExponent ( E ( + | | ) digits ) |
number digits optionalFraction optionalExponent
CD / Module-II / Jasaswi 73 31 January 2024
Extensions of Regular Expressions
Kleene introduced regular expressions with the basic operators for union, concatenation, and Kleene
closure in the 1950s
The following are few notational extensions of Regular Expressions used to enhance their ability:
• One or more instances: The unary, postfix operator + represents the positive closure of a
regular expression and its language.
+ +
If r is a regular expression, then (r) denotes the language (L(r)) .
The operator + has the same precedence and associativity as the operator *.
Two useful algebraic laws:
– The Kleene closure r* = r+ | and Positive Closure r+ = r r* = r* r.
• Zero or one instance: The unary postfix operator ? means "zero or one occurrence."
That is, r? is equivalent to r | , or, L(r?) = L(r) U {}.
The ? operator has the same precedence and associativity as * and +.
• Character Classes: A regular expression a1 | a2 | . . . | an, where the ai's are each symbols of the
alphabet, can be replaced by the shorthand [a1 | a2 | . . . | an].
When a1, a2, . . . an, form a logical sequence, we can replace them by a1 an.
CD / Module-II / Jasaswi 74 31 January 2024
Extensions of Regular Expressions
CD / Module-II / Jasaswi 75 31 January 2024
Extensions of Regular Expressions : Examples
Example 1: Regular Definition for the language of C identifiers.
letter_ A | B | . . . | Z | a | b | . . . | z | _ letter_ [AZaz_]
digit 0 | 1 | . . . | 9 digit [09]
id letter_ ( letter_ | digit )* id letter_ ( letter | digit )*
Example 2: : Regular Definition for the language of C unsigned numbers (integer
or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E4.
digit 0 | 1 | . . . | 9 digit [09]
digits digit digit* digits digit+
optionalFraction . digits | number digits (. digits)? ( E[+]? digits) ?
optionalExponent ( E ( + | | ) digits ) |
number digits optionalFraction optionalExponent
CD / Module-II / Jasaswi 76 31 January 2024
Complemented Character Class
A complemented character class represents any character except the ones listed in the character
class.
We denote a complemented class by using ^ as the first character.
The symbol ^(caret) is not itself part of the class being complemented, unless it is listed within the
class itself.
Example:
• [^A-Za-z]: matches any character that is not an uppercase or lowercase letter
• [^ \^]: represents any character but the caret
CD / Module-II / Jasaswi 77 31 January 2024
Examples of Regular Expression
My fax number: 91-(943)-716-2867
Σ = digit U {-, (, ) }
country digit+
area ‘(‘digit+‘)’
exchange digit+
phone digit+
number country ‘-’ area ‘-’ exchange ‘-’ phone
CD / Module-II / Jasaswi 78 31 January 2024
Examples of Regular Expression
My email address: jasaswi.mohantyfcs@kiit.ac.in
Σ = letter U {@, . }
letter a | b | … | z | A | B | … | Z
name letter+
address name ‘.’ name ‘@’ name ‘.’ name ‘.’ name
CD / Module-II / Jasaswi 79 31 January 2024
Regular expressions in specifications
Regular expressions describe many useful languages.
Regular expressions are only specifications; implementation is still required.
Given a string s and a regular expression R, does s L(R) ?
Solution to this problem is the basis of the lexical analyzers.
However, just the yes/no answer is not sufficient.
Goal: Partition the input into tokens
CD / Module-II / Jasaswi 80 31 January 2024
Steps for Recognizing Tokens
1. Write a regular expression for lexemes of each token
• number digit+
• identifier letter(letter | digit)+
2. Construct R matching all lexemes of all tokens
• R = R1 + R2 + R3 + …..
3. Let the input be x1x2…xn
• For 1 ≤ i ≤ n, check x1x2…xi L(R)
if x1x2…xi L(R) then find x1…xi L(Rj) for some j
– smallest such j is token class of x1…xi
– Remove x1…xi from the input and go to Step 3.
CD / Module-II / Jasaswi 81 31 January 2024
Recognition of Tokens: Example
The following grammar fragment describes a simple form of branching statements and conditional
expressions of the language Pascal:
CD / Module-II / Jasaswi 82 31 January 2024
Recognition of Tokens
For the language discussed in the previous slide, the lexical analyzer will
recognize the keywords if, then, and else, as well as lexemes that match the
patterns for relop, id, and number.
We make the common assumption that keywords are also reserved words: that
is, they are not identifiers, even though their lexemes match the pattern for
identifiers.
We assign the lexical analyzer the job of stripping out white space, by
recognizing the "token" ws defined by:
Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace.
CD / Module-II / Jasaswi 83 31 January 2024
Recognition of Tokens
That following table shows, for each lexeme or family of lexemes, which token
name is returned to the parser and what attribute value is returned.
CD / Module-II / Jasaswi 84 31 January 2024
Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert patterns
into stylized flowcharts, called "transition diagrams”.
Regular expression are declarative specifications where as Transition diagram is an
implementation.
A transition diagram consists of
• A collection of nodes or circles, called states. Each state represents a condition that
could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns.
• Edges are directed from one state of the transition diagram to another. Each edge is
labeled by a symbol or set of symbols.
• Certain states are said to be accepting, or final state. These states indicate that a
lexeme has been found. We always indicate an accepting state by a double circle,
and if there is an action to be taken typically returning a token and an attribute
value to the parser we shall attach that action to the accepting state.
CD / Module-II / Jasaswi 85 31 January 2024
Transition Diagrams
• If it is necessary to retract the forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the accepting state), then we shall
additionally place a * near that accepting state. We can attach any number of
*'s to the accepting state depending on the number of position we need to
retract.
• One state is designated the start state, or initial state which is indicated by an
edge, labeled "start" entering from nowhere. The transition diagram always
begins in the start state before any input symbols have been read.
CD / Module-II / Jasaswi 86 31 January 2024
Examples of Transition Diagrams
Identifiers and reserved words
𝑙𝑒𝑡𝑡𝑒𝑟 = [𝑎−𝑧𝐴−𝑍]
𝑑𝑖𝑔𝑖𝑡 = [0−9]
𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑟 = 𝑙𝑒𝑡𝑡𝑒𝑟 ( 𝑙𝑒𝑡𝑡𝑒𝑟 | 𝑑𝑖𝑔𝑖𝑡 )
∗
* indicates a retraction state return(getToken (), installID())
Initially the reserved words are installed in the symbol table.
When we find an identifier, a call to installID() places it in the symbol table if it is not already there
and returns a pointer to the symbol-table entry for the lexeme found.
Any identifier not in the symbol table during lexical analysis cannot be a reserved word, so its
token is id.
The function getToken examines the symbol table entry for the lexeme found, and returns
whatever token name the symbol table says this lexeme represents - either id or one of the
keyword tokens that was initially installed in the table.
CD / Module-II / Jasaswi 87 31 January 2024
Transition diagram for the token relop
The following figure shows a transition diagram that recognizes the lexemes
matching the token relop.
CD / Module-II / Jasaswi 88 31 January 2024
Transition Diagrams for IDs and Keywords
return(getToken (), installID())
CD / Module-II / Jasaswi 89 31 January 2024
Transition Diagram for Unsigned Numbers
CD / Module-II / Jasaswi 90 31 January 2024
Finite Automata
Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
Finite automata come in two flavors:
• Nondeterministic finite automata (NFA):
Have no restrictions on the labels of their edges.
A symbol can label several edges out of the same state, and , the empty string, is a
possible label.
• Deterministic finite automata (DFA):
For each state, and for each symbol of its input alphabet exactly one edge with that symbol
leaving that state.
NOTE:
• Both deterministic and nondeterministic finite automata are capable of recognizing the same
languages.
• In fact these languages are exactly the same languages, called the regular languages, that
regular expressions can describe.
CD / Module-II / Jasaswi 91 31 January 2024
Nondeterministic Finite Automata
A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols , the input alphabet. We assume that , which
stands for the empty string, is never a member of .
3. A transition function that gives, for each state, and for each symbol in
{} a set of next states.
4. A state s0 from S that is distinguished as the start state (or initial state).
5. A set of states F, a subset of S, that is distinguished as the accepting states
(or final states).
CD / Module-II / Jasaswi 92 31 January 2024
Transition graph
We can represent either an NFA or DFA by a Transition Graph, where the nodes
are states and the labeled edges represent the transition function.
There is an edge labeled a from state s to state t if and only if t is one of the next
states for state s and input a.
The same symbol can label edges from one state to several different states.
An edge may be labeled by , the empty string, instead of, or in addition to,
symbols from the input alphabet.
Example: NFA recognizing the language of regular expression (a|b)*abb
CD / Module-II / Jasaswi 93 31 January 2024
Transition Table
An NFA can also represented by a transition table, whose
rows correspond to states, and whose columns correspond
to the input symbols and .
The entry for a given state and input is the value of the
transition function applied to those arguments.
If the transition function has no information about that state-
input pair, we put in the table for the pair.
Advantage:
Example: NFA recognizing
• We can easily find the transitions on a given state and the language of regular
input. expression (a|b)*abb
Disadvantage:
• It takes a lot of space, when the input alphabet is large,
yet most states do not have any moves on most of the
input symbols.
CD / Module-II / Jasaswi 94 31 January 2024
Acceptance of Input Strings by Automata
An NFA accepts input string x if and only if there is some path in the transition graph from the
start state to one of the accepting states, such that the symbols along the path spell out x.
Note that labels along the path are effectively ignored, since the empty string does not
contribute to the string constructed along the path.
Example: The string aabb is accepted by the NFA
• We have the following two paths labelled by the same string aabb. Among this the first path
leads to state 0 which is an accepting state.
NOTE:
• An NFA accepts a string as long as some path labeled by that string leads from the start
state to an accepting state. The existence of other paths leading to a non-accepting state is
irrelevant.
CD / Module-II / Jasaswi 95 31 January 2024
Language of an NFA
The language defined (or accepted) by an NFA is the set of strings labeling some
path from the start to an accepting state.
The following NFA defines the same language as does the regular expression
(a|b)*abb, that is, all strings from the alphabet {a, b} that end in abb.
We use L(A) to stand for the language accepted by automaton A.
CD / Module-II / Jasaswi 96 31 January 2024
Language of an NFA: Example
The following NFA accepts accepting the language of the regular expression
aa*|bb*.
String aaa is accepted because of the following path:
CD / Module-II / Jasaswi 97 31 January 2024
Deterministic Finite Automata
A deterministic finite automaton (DFA) is a special case of an NFA where:
1. There are no moves on input , and
2. For each state s and input symbol a, there is exactly one edge out of s labeled a.
In the transition table of a DFA, each entry is a single state.
The NFA is an abstract representation of an algorithm to recognize the strings of a
certain language, whereas the DFA is a simple, concrete algorithm for recognizing
strings.
NOTE:
• Every regular expression and every NFA can be converted to a DFA accepting the
same language
• It is the DFA that we really implement or simulate when building lexical analyzers.
CD / Module-II / Jasaswi 98 31 January 2024
Simulating a DFA
INPUT:
• An input string x terminated by an end-of-file
character eof.
• A DFA D with start state so, accepting states F, and
transition function move.
OUTPUT:
• Answer ''yes" if D accepts x; "no" otherwise.
METHOD:
• Apply the algorithm shown here to the input string x.
• The function move(s, c) gives the state to which there
is an edge from state s on input c.
• The function nextChar returns the next character of
the input string x.
CD / Module-II / Jasaswi 99 31 January 2024
Simulating a DFA: Example
The transition graph of a DFA accepting the language (a l b)*abb, the same as
that accepted by the NFA.
NFA DFA
Given the input string ababb, this DFA enters the sequence of states 0,1,2,1,2,3
and returns "yes."
CD / Module-II / Jasaswi 100 31 January 2024
Finite State Automaton (FSA) for recognizing “new ”
CD / Module-II / Jasaswi 101 31 January 2024
FSA for Unsigned Integers
CD / Module-II / Jasaswi 102 31 January 2024
Equivalence of RE and FSA
Let 𝑟 be an RE. Then there exists an NFA with 𝜖-transitions that accepts 𝐿(𝑟)
If 𝐿 is accepted by a DFA, then 𝐿 is generated by a RE.
CD / Module-II / Jasaswi 103 31 January 2024
DFA to Minimal DFA: Hopcroft’s Algorithm
A DFA from Subset construction can have a large number of states
• Does not increase the time needed to scan a string
Increases the space requirement of the scanner in memory
• Speed of accesses to main memory may turn out to be the bottleneck
• Smaller scanner has better chances of fitting in the processor cache
Hopcroft’s Algorithm: The Idea
• We need to identify and merge any equivalent states in the DFA: states produce identical
behavior across all inputs.
• It takes the set of all states, and repeatedly partitions them into smaller and smaller subsets
by identifying ways in which some states in a subset behave differently than other in the
subset.
• First: Split the set of all states into accept/non-accept states.
• Repeat: Pick an input character, check if each state in a partition takes you to a common
partition (e.g. on input x do all states in partition P transitions to states in partition Q)
CD / Module-II / Jasaswi 104 31 January 2024
Hopcroft’s Algorithm
Algorithm
Current = { AcceptStates, NonAcceptStates)
P={}
Repeat until P and Current are same:
P = Current
Current = P
For each set, s, in P
Current = Union of Current and Split(s)
(Split tries to break s into two sets of “distinguishable” states.)
Split(s)
For each input character, c:
for each state in s
determine which partition T[s][c] leads to
If some states of s leads to one partition and the other states do not, use that to
divide them into two sets s1 and s2 and return {s1, s2}
CD / Module-II / Jasaswi 105 31 January 2024
DFA Minimization: Example
First divide accept/non-accept states into
{c, d, e}, {a, b, f}
In the first for loop pass split (a, b, f) on input 0, into
(a,b), (f)
Finally we have {(c,d,e), (a,b),(f)}
q δ(q,0) δ(q,1)
a b d
b a c
c e f
d e f
e e f
f f f
CD / Module-II / Jasaswi 106 31 January 2024
From Regular Expressions to Finite Automata
NFAs allow multiple transitions from a single state on the same input symbol, and they
may have epsilon () transitions, making them non-deterministic.
Simulating an NFA can be computationally expensive, especially when dealing with
epsilon transitions and multiple possible paths for a given input sequence.
DFAs, on the other hand, are deterministic and have a unique transition for each input
symbol from every state. Deterministic behavior simplifies the implementation and
understanding of the automaton.
A DFA is more efficient to implement and execute. It avoids the need to explore multiple
paths simultaneously, leading to better performance during lexical analysis.
The table-driven approach allows for efficient lookups during lexical analysis, making it
easier to implement a scanner for recognizing tokens in the source code.
The DFA obtained after the conversion recognizes the same language as the original
NFA. This ensures that the lexical analyzer correctly recognizes the tokens in the source
code.
CD / Module-II / Jasaswi 107 31 January 2024
From Regular Expressions to Finite Automata
High Level Sketch
Lexical Regular Table-Driven
NFA DFA
Specification Expression Implementation
of DFA
Now we will discuss:
1. The subset construction: How to convert NFA's to DFA's.
2. Algorithm for simulating NFA's directly, in situations (other than lexical analysis)
where the NFA-to-DFA conversion takes more time than the direct simulation.
3. How to convert regular expressions to NFA's, from which a DFA can be
constructed if desired.
CD / Module-II / Jasaswi 108 31 January 2024
From Regular Expressions to Finite Automata
For a given language the NFA can be simpler than DFA.
It is possible that the number of DFA states is exponential in the number of NFA
states, which could lead to difficulties when we try to implement this DFA.
However, part of the power of the automaton-based approach to lexical analysis
is that for real languages, the NFA and DFA have approximately the same
number of states, and the exponential behavior is not seen.
CD / Module-II / Jasaswi 109 31 January 2024
Conversion of an NFA to a DFA (Subset Construction)
Input: An NFA N
Output: A DFA D accepting the same language as N.
Method:
• The algorithm constructs a transition table Dtran for D.
• Each state of D is a set of NFA states. To construct Dtran, D will simulate "in parallel" all possible moves
N can make on a given input string.
• First we need to deal with -transitions of N properly.
• The following table shows the definitions of several functions that describe basic computations on the
states of N that are needed in the algorithm. Note that s is a single state of N, while T is a set of states of
N.
CD / Module-II / Jasaswi 110 31 January 2024
Subset Construction of a DFA from NFA
Basis: Before reading the first input symbol, N can be in any of the states of -
closure(s0), where s0 is its start state.
Induction:
• Suppose that N can be in set of states T after reading input string x. If it next
reads input a, then N can immediately go to any of the states in move(T, a).
• However, after reading a, it may also make several -transitions; thus N could
be in any state of -closure(move(T, a)) after reading input xa.
The start state of D is -closure(s0), and the accepting states of D are all those
sets of N's states that include at least one accepting state of N.
CD / Module-II / Jasaswi 111 31 January 2024
Computing - closure(T)
CD / Module-II / Jasaswi 112 31 January 2024
The subset construction: Main Algorithm
CD / Module-II / Jasaswi 113 31 January 2024
Subset Construction: Example
NFA N for (a|b)*abb Transition table Dtran for DFA D
DFA D: Result of applying the subset construction on N
CD / Module-II / Jasaswi 114 31 January 2024
Simulation of an NFA
INPUT: An input string x terminated by an end-of-file character eof. An NFA N
with start state s0, accepting states F, and transition function move.
OUTPUT: Answer "yes' if M accepts x; "no" otherwise.
Running Time:
O(k(n+m)), where
k is the input
length and the
NFA has n states
& m transitions.
CD / Module-II / Jasaswi 115 31 January 2024
Efficiency of NFA Simulation
If carefully implemented, the Algorithm to simulate an NFA can be quite efficient.
For the implementation we need the following data structures:
• Two stacks, each of which holds a set of NFA states. One of these stacks,
oldStates, holds the "current" set of states. The second stack, newStates,
holds the "next" set of states. Just before a transition newStates is transferred
to oldstates.
• A boolean array alreadyOn, indexed by the NFA states, to indicate which
states are in newStates. While the array and stack hold the same information,
it is much faster to interrogate alreadyOn[s] than to search for state s on the
stack newstates. It is for this efficiency that we maintain both representations.
• A two-dimensional array move[s, a] holding the transition table of the NFA.
The entries in this table, which are sets of states, are represented by linked
lists.
CD / Module-II / Jasaswi 116 31 January 2024
NFA Simulation
To implement line (1) of the algorithm, we need to set each entry in array alreadyOn to
FALSE, then for each state s in -closure(so), push s onto oldstates and set
alreadyOn[s] to TRUE.
This operation on state s, and the implementation of line (4) as well, are facilitated by a
function we shall call addState(s).
This function pushes state s onto newstates, sets alreadyOn[s] to TRUE, and calls itself
recursively on the states in move[s, ] in order to further the computation of -
closure(s).
The addState(s) function is shown here:
CD / Module-II / Jasaswi 117 31 January 2024
NFA Simulation
To implement line (4) we need to look at each state s on oldstates.
We first find the set of states move[s, c], where c is the next input, and for each of those
states that is not already on newstates, we apply addstate to it.
Note that addstate has the effect of computing the -closure and adding all those states
to newstates as well, if they were not already on.
This sequence of steps is summarized below:
CD / Module-II / Jasaswi 118 31 January 2024
Construction of an NFA from a Regular Expression
Algorithm: The McNaughton-Yamada-Thompson algorithm to convert a regular
expression to an NFA.
INPUT: A regular expression r over alphabet .
OUTPUT: An NFA N accepting L(r).
METHOD: The rules for constructing an NFA consist of basis rules for handling
subexpressions with no operators, and inductive rules for constructing larger NFA's from
the NFA's for the immediate subexpressions of a given expression.
• BASIS:
For expression construct the NFA as shown in Fig. 1.
For any sub-expression a in , construct the NFA as shown in Fig. 2.
Fig. 1 Fig. 2
CD / Module-II / Jasaswi 119 31 January 2024
Construction of an NFA from a Regular Expression – contd…
• INDUCTION:
Suppose N(s) and N(t) are NFA's for regular expressions s and t, respectively.
Let r = s l t. Then N(r), the NFA for r, is constructed as shown in Fig. 3.
Let r = s t. Then N(r), the NFA for r, is constructed as shown in Fig. 4.
Let r = s*. Then N(r), the NFA for r, is constructed as shown in Fig. 5
Let r = (s). Then L(r) = L(s), and we can use the NFA N(s) as N(r).
Fig. 3 Fig. 4 Fig. 5
CD / Module-II / Jasaswi 120 31 January 2024
Construction of an NFA from a Regular Expression:
Example
Construct an NFA for r = (a | b)*abb.
r1 or r6 r2 or r8 or r10
r3 and r4
r5
r11
CD / Module-II / Jasaswi 121 31 January 2024
Implementing Scanners
1. Specify REs for each syntactic category
2. Construct an NFA for each RE
3. Join the NFAs with 𝜖-transitions
4. Create the equivalent DFA
5. Minimize the DFA
6. Generate code to implement the DFA
CD / Module-II / Jasaswi 122 31 January 2024
Implementation Considerations
Speed is paramount for scanning
Processes every character from an input source program
Repeatedly read the input character and simulate the corresponding DFA
• Table-driven scanners
• Direct-coded scanners
• Hand-coded scanners
CD / Module-II / Jasaswi 123 31 January 2024
High-Level Idea in Implementing Scanners
1. Read input character one by one
2. Look up the transition based on the current state and the input character
3. Switch to the new state
4. Check for termination conditions, i.e., accept and error
5. Repeat
CD / Module-II / Jasaswi 124 31 January 2024
Table-Driven Scanner
In a table-driven approach, the scanner uses a finite char = getNextChar()
automaton or a set of rules stored in tables to state = 𝑠0
recognize and classify tokens.
while (char ≠ EOF)
The tables typically consist of states and transitions, state = 𝛿(state, char)
defining the behavior of the scanner for each possible char = getNextChar()
input symbol in a given state.
if (state ∈ 𝑆𝐹)
The tables are often generated by a tool (like Lex or accept
Flex) based on a high-level description of the
else
language's lexical structure.
error
Changes to the scanner's behavior can be made by
modifying the tables, making it more modular and
easier to update.
More modular and easier to maintain, as changes to
the lexical structure can be made by updating the
tables.
CD / Module-II / Jasaswi 125 31 January 2024
Direct-Coded Scanner
In a direct code approach, the scanner is 𝑠0: char = getNextChar()
implemented using a series of procedural if (char == ‘r’)
code directly. goto 𝑠1
else
The code consists of explicit instructions goto 𝑠𝑒
and conditionals that directly recognize and 𝑠1: char = getNextChar()
process the tokens in the source code. if (‘0’ ≤ char ≤ ‘9’)
The implementation tends to be more goto 𝑠2
straightforward and may be easier to else
goto 𝑠𝑒
understand, especially for small languages
𝑠2: char = nextChar()
or simple grammars.
if (‘0’ ≤ char ≤ ‘9’)
Modifications to the scanner's behavior goto 𝑠2
often involve directly changing the source else if (char == EOF)
code, which may be less modular than the accept else
table-driven approach. goto 𝑠𝑒
𝑠𝑒: error
CD / Module-II / Jasaswi 126 31 January 2024
Challenges in Lexical Analysis
Certain languages like PL/I do not have any reserved words
• while, do, if, and else are reserved in C but not in PL/I
• Makes it difficult for the scanner to distinguish between keywords and user defined identifiers
if then then then = else else else = then
if if then then = then + 1
PL/I declarations
• DECLARE(arg1,arg2,arg3,…,argn)
• Cannot tell whether DECLARE is a keyword with variable definitions or is a procedure with
arguments until after “)”
Requires arbitrary lookahead and very large buffers
• Worse, the buffers may have to be reloaded in case of wrong inferences
Is fi a typo or a function call?
• Note that fi is a valid lexeme for IDENTIFIER
CD / Module-II / Jasaswi 127 31 January 2024
Challenges in Lexical Analysis
In fixed-format Fortran, some keywords are context-dependent
In the statement, DO 10 I = 10.86, DO10I is an identifier, and DO is not a
keyword
But in the statement, DO 10 I = 10, 86, DO is a keyword
Blanks are not significant in Fortran and can appear in the midst of identifiers, but
not so in C
• Variable “counter” is same as “count er”
In Fortran, blanks are important only in literal strings
Reading from left to right, one cannot distinguish between the two until the “,” or
“.” is reached
• Requires look ahead for resolution
CD / Module-II / Jasaswi 128 31 January 2024