0% found this document useful (0 votes)

39 views128 pages

Lexical Analysis

The document outlines the role and responsibilities of the lexical analyzer in compiler design, focusing on its tasks such as tokenization, error handling, and symbol table management. It explains the importance of separating lexical analysis from parsing for simplicity, efficiency, and portability. Additionally, it covers the operations involved in lexical analysis, including recognizing tokens, handling whitespace and comments, and generating output for the parser.

Uploaded by

arjunduttaiotonly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views128 pages

Lexical Analysis

Uploaded by

arjunduttaiotonly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 128

Compiler Design

6th Semester B.Tech. (CSE)

Course Code: 18CS1T08

Dr. Jasaswi Prasad Mohanty

School of Computer Engineering
KIIT Deemed to be University
Bhubaneswar, India
MODULE II
Lexical Analysis

Sl. No. Topics

1. Role of Lexical Analyzer
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Finite Automata
6. From Regular Expressions to finite Automata
7. Implementing Scanners
The Role of Lexical Analyzer
 The main task of the lexical analyzer is to read the input characters of the source program,
processes every character in the input program, if a word is valid then group them into lexemes,
and produce as output a sequence of tokens for each lexeme in the source program.
 The stream of tokens is sent to the parser for syntax analysis.
 It is common for the lexical analyzer to interact with the symbol table as well.
 When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table.

CD / Module-II / Jasaswi 3 31 January 2024

The Role of Lexical Analyzer – contd . . .
 Commonly, the interaction is implemented by having the parser call the lexical analyzer.
 The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token,
which it returns to the parser.

CD / Module-II / Jasaswi 4 31 January 2024

The Role of Lexical Analyzer
 Lexical analyzer perform some of the following other tasks besides identification
of lexemes:
• Stripe out comments and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the input).
• Correlate error messages generated by the compiler with the source
program. For instance, the lexical analyzer may keep track of the number of
newline characters seen, so it can associate a line number with each error
message.
• Makes a copy of the source program with the error messages inserted at
the appropriate positions.
• May Perform the expansion of macros if the source program uses a macro-
preprocessor

CD / Module-II / Jasaswi 5 31 January 2024

The Role of Lexical Analyzer
 Lexical analyzers are divided into a cascade of two processes:
• Scanning: consists of the simple processes that do not require tokenization of
the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
• (Proper) Lexical analysis: Here the scanner produces the sequence of
tokens as output.

CD / Module-II / Jasaswi 6 31 January 2024

Why to separate Lexical analysis and parsing
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design
• The separation of lexical and syntactic analysis often allows us to simplify at least one of
these tasks.
 If the Parser had to deal with comments and whitespace: The parser would need to handle
comments and whitespace as part of its syntactic analysis. This would add complexity to
the parser because it would have to distinguish between meaningful language constructs
and non-essential elements like comments and whitespace.
2. Improving compiler efficiency
• A separate lexical analyzer allows us to apply specialized techniques that serve only the
lexical task, not the job of parsing.
• Specialized buffering techniques for reading input characters can speed up the compiler
significantly.
3. Enhancing compiler portability
• Ability of a compiler to generate code that can run on different target platforms without
modification.
CD / Module-II / Jasaswi 7 31 January 2024
Responsibilities of Lexical Analysis
1. Tokenization:
• Recognition of Lexical Elements: Identify and recognize lexical elements in the
source code, such as keywords, identifiers, literals (constants), operators, and
punctuation symbols.
• Grouping into Tokens: Group the recognized lexical elements into tokens. A token
is a meaningful unit that represents a particular syntactic element in the programming
language.
2. Handling Whitespace and Comments:
• Whitespace Removal: Recognize and discard whitespace characters (spaces, tabs,
line breaks) from the source code. Whitespace is generally irrelevant to the syntactic
structure of the program but contributes to readability for humans.
• Comment Removal: Identify and eliminate comments from the source code.
Comments are annotations meant for human readers and do not affect the program's
execution.
CD / Module-II / Jasaswi 8 31 January 2024
Tokenization: Example
 Code:  Responsibility: Recognize and categorize
int main() lexical elements such as int, main, (, ), {,
return, 0, ;, and } into tokens.
{
return 0;
}

Handling Whitespace and Comments: Example

 Code:
 Responsibility: Recognize and discard
/* This is a comment */ comments (/* This is a comment */) and
int main() handle whitespace.
{
return 0;
}

CD / Module-II / Jasaswi 9 31 January 2024

Responsibilities of Lexical Analysis
3. Error Handling:
• Error Detection: Detect and report lexical errors, such as the use of undefined
symbols or invalid characters. Lexical errors indicate deviations from the language's
lexical rules.
• Error Recovery: Implement strategies for recovering from errors to continue
processing the remaining code. Error recovery mechanisms aim to provide
informative error messages without prematurely terminating the compilation process.
4. Symbol Table Management:
• Building Symbol Table Entries: Create entries in the symbol table for identifiers
encountered during tokenization. The symbol table is a data structure that associates
each identifier with information about its type, scope, and other properties.
• Handling Reserved Words: Identify reserved words in the language and ensure
they are appropriately classified as keywords.

CD / Module-II / Jasaswi 10 31 January 2024

Error Handling: Example
 Code:
 Responsibility: Detect and report
int mai%n() lexical errors, such as the use of % in
{ the identifier.
return 0;  Error Handling: Lexical error: Invalid
character '%' in identifier.
}

Symbol Table Management: Example

 Code:
int main()
Symbol Table  Responsibility: Create entries in the
{
x . . . symbol table for identifiers (x)
int x; encountered during tokenization.
return x;
}

CD / Module-II / Jasaswi 11 31 January 2024

Responsibilities of Lexical Analysis
5. Generating Output:
• Output Tokens: Generate a stream of tokens, where each token represents a
recognized and categorized syntactic element. This token stream becomes the input
for the subsequent phases of the compiler.
6. Optimizations and Preprocessing:
• Optimization Opportunities: Identify simple optimizations that can be performed at
the lexical analysis phase. For example, recognizing constant literals and replacing
them with their computed values.
• Preprocessing Directives: Handle preprocessor directives if applicable, such as
macro expansions or conditional compilation directives.
7. Interface with the Parser:
• Providing Input to the Parser: Present the generated token stream to the parser for
syntactic analysis. The parser relies on the lexical analyzer to provide a well-defined
and organized stream of tokens.
CD / Module-II / Jasaswi 12 31 January 2024
Optimizations and Pre-processing: Example
 Code:  Responsibility: Handle preprocessing
#define MAX 100 directives, such as macro expansions..
int main()
{
return MAX;
}

Generating Output and Interface with the Parser: Example

 Code:
 Responsibility: Generate the
int main() tokens and provide the generated
{ token stream to the parser for
syntactic analysis.
return 0;
}

CD / Module-II / Jasaswi 13 31 January 2024

Lexical Analysis Operations
This phase of the Compiler does the following operations:
 Recognize tokens and ignore white spaces, comments.

Generates token stream

 Error Reporting
 Model using Regular Expression
 Recognize using Finite State Automata

CD / Module-II / Jasaswi 14 31 January 2024

Terms related to Lexical Analysis
Lexeme Token Pattern
• A lexeme is a sequence of • A token is a pair consisting of a token • A pattern is a description of the
characters in the source name and an optional attribute value. form that the lexemes of a token
code that matches the • The token name is an abstract symbol may take.
pattern for a token. representing a kind of lexical unit, e.g.: • A pattern is a rule or template
• It is the actual occurrence a particular keyword, or a sequence of that defines the possible structure
of a token in the source input characters denoting an identifier. of a token.
code. • The optional attribute value provides • It describes the set of valid
additional information associated with sequences of characters that can
the token. form a particular token.
Example: Considering the C Example:, The statement int x = 10; can Example: For an identifier and an
programming language the be broken down into the following tokens: integer in, the pattern might be
statement int x = 10; , <int> defined as follows:
contains the following tokens: <id, 1> Ex-1: A pattern for an identifier:
int, x, =, 10 and ; <=> starts with a letter, followed by zero
<number, 10> or more letters or digits or
<;> underscore.
Ex-2: A pattern for integers:
An integer is a sequence of digits.
CD / Module-II / Jasaswi 15 31 January 2024
Examples of Tokens
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword
itself.
2. Tokens for the operators, either individually or in classes such as the token comparison.
3. One token representing all identifiers (names).
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.

CD / Module-II / Jasaswi 16 31 January 2024

Attribute Value for Tokens
 Attributes for tokens are additional pieces  The token names and associated attribute
of information associated with each token values for the Fortran statement
in a programming language. E = M * C ** 2 are:
 These attributes help convey more details <id, pointer to symbol-table entry for E>
about the tokens beyond their basic
categorization. <assign-op >
 The inclusion of attributes is particularly <id, pointer to symbol-table entry for M>
valuable for certain types of tokens where
<mult-op>
additional information is needed.
 An attribute of a token is a value that the <id, pointer to symbol-table entry for C>
scanner extracts from the corresponding <exp-op>
lexeme and supplies to the syntax
analyzer <number , integer value 2 >
• What can be important attributes?  Note: In certain pairs, especially operators,
punctuation, and keywords, there is no need
• Where is this information stored?
for an attribute value.

CD / Module-II / Jasaswi 17 31 January 2024

Token/Pattern/Lexeme Example-1
Consider the statement int x = 10; in the C programming language. Count the number of
tokens.
Lexeme: int Lexeme: 10
Token: <keyword>
Token: <number, 10>
Pattern: int
Pattern: digit+

Token: <id, 1>

Lexeme: x Lexeme: ;
Pattern: letter (letter | digit)* Token: <symbol>
Pattern: ;
Lexeme: =
Token: <assign-op>
Pattern: =

CD / Module-II / Jasaswi 18 31 January 2024

Token/Pattern/Lexeme Example-2

Consider the following

code in C-programming
6 Tokens
language. Find the number
of tokens: 1 Tokens
int strange (int x)
{ 9 Tokens
if(x<=0) return 0;
if((x%2)!=0) return x-1; 15 Tokens
return 1+strange(x-1);
} 10 Tokens

1 Tokens

Total 42 Tokens

CD / Module-II / Jasaswi 19 31 January 2024

Token/Pattern/Lexeme Example-3

Consider the following C-program. 3 Tokens

Find the number of tokens:
main() 1 Tokens

{ 5 Tokens
char ch=‘A’;
int x, y; 5 Tokens
x=y=20;
6 Tokens
x++;
printf(“%d%d”,x,y); 3 Tokens
}
9 Tokens

1 Tokens

Total 33 Tokens
CD / Module-II / Jasaswi 20 31 January 2024
Lexical Errors
 It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error.
• Example: if the string fi is encountered for the first time in a C program in the
following context a lexical analyzer cannot tell whether fi is a misspelling of the
keyword if or an undeclared function identifier.
fi (a == f(x))
return 5;
 However it may be able to recognize errors like: d = 2r
• Such errors are recognized when no pattern for tokens matches a character
sequence.
• The simplest recovery strategy is "panic mode" recovery. We delete
successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
CD / Module-II / Jasaswi 21 31 January 2024
Lexical Errors
 When the token pattern does not match the prefix of the remaining input, the lexical analyzer gets
stuck and has to recover from this state to analyze the remaining input.
 In simple words, a lexical error occurs when a sequence of characters does not match the
pattern of any token. It typically happens during the execution of a program.
Types of Lexical Error:
1. Exceeding length of identifier or numeric constants.
Example:
#include <iostream>
using namespace std; This is a lexical error
since signed integer
int main()
literal lies between
{ −2,147,483,648 and
int a = 21474836477844; 2,147,483,647 for a 32-bit
return 0; signed integer.
}

CD / Module-II / Jasaswi 22 31 January 2024

Lexical Errors – contd…
2. Appearance of illegal characters: This is a lexical error due to the
#include <iostream> presence of the $ sign after the printf
using namespace std; statement. The $ sign is not a valid
int main() character in the C++ language. The
{ compiler recovers by skipping the
printf("Geeksforgeeks") $; invalid character ('$') and continue
return 0; analyzing the rest of the code.
}
3. Unmatched comment:
#include <iostream> This is a lexical error since the ending
using namespace std; of comment “*/” is not present but the
int main() { beginning is present. The compiler
/* comment recovers by ignoring the rest of the
cout<<"GFG!"; line and continue analyzing
return 0; subsequent lines.
}

CD / Module-II / Jasaswi 23 31 January 2024

Lexical Errors
4. Spelling Error (Misspelling of identifier):
#include <iostream>
using namespace std; Spelling error as identifier cannot start with a
int main() number.
{
int 3num= 1234;
return 0;
}
• The lexical analyzer will not declare it as an error since,
Exception: whether “ofr” is a misspelling of the keyword ‘for’ or an
void main() undeclared function identifier it is now known to it.
{ • Since ‘ofr’ is a valid lexeme for the token id, the lexical
analyzer must return the token id to the parser and let
int i, n; some other phase of the compiler ( probably the parser
ofr( i=0; i< n; i++) in this case ) handle an error due to transposition of
the letters
printf(“great india\n”);
}
CD / Module-II / Jasaswi 24 31 January 2024
Lexical Error Recovery Actions
 LA cannot catch any other errors except for simple errors such as illegal
symbols.
 In such cases, LA skips characters in the input until a well-formed token is found
• This is called “panic mode” recovery.
 The "panic mode" recovery strategy is the simplest recovery strategy . We
delete successive characters from the remaining input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.
 Some of the lexical error recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
CD / Module-II / Jasaswi 25 31 January 2024
Input buffering
 The LA scans the characters of the source program one at a time to discover
tokens.
 We often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
 There are many situations where we need to look at least one additional
character ahead.
 Sometimes lexical analyser needs to look ahead at some symbols to decide
about the token to return.
• In C language, single-character operators like -, =, or < could also be the
beginning of a two-character operator like ->, ==, or <=.
• In Fortran: DO 5 I = 1.25

CD / Module-II / Jasaswi 26 31 January 2024

Input buffering – contd…
 The primary role of input buffering is to maintain a buffer of characters retrieved from the
source code.
 It allows the lexical analyzer to look ahead in the input stream, examining upcoming
characters to make decisions about token boundaries.
 A two-buffer scheme that handles large look ahead safely.
 Buffering techniques:
1. One Buffer Scheme
2. Two Buffer Scheme using Buffer Pairs
3. Two Buffer Scheme using Sentinels
 Two pointers to the input in buffering techniques are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.

CD / Module-II / Jasaswi 27 31 January 2024

One Buffer Scheme
 In this one buffer scheme, only one buffer is used to store the input string.
 But the problem is that if lexeme is very long then it crosses the buffer boundary, to scan rest of
the lexeme the buffer has to be refilled, that makes overwriting the first part of lexeme.
Consider the statement:
int i=i+1; j=j+1;
Buffer :
i n t i = i + 1

lexemeBegin forward

Buffer gets refilled when ‘fp’ reaches the end of buffer i.e. ‘1’ in int i=i+1 :
; j = j + 1 ;

lexemeBegin

CD / Module-II / Jasaswi 28 31 January 2024

Buffer Pairs
 Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.

 This scheme involves two buffers that are alternately reloaded.

 Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
 Using one system read command we can read N characters into a buffer, rather than using one
system call per character.
• If fewer than N characters remain in the input file, then a special character, represented by
eof, marks the end of the source file and is different from any possible character of source
program.

CD / Module-II / Jasaswi 29 31 January 2024

Buffer Pairs – contd…
 Once the next lexeme is determined, forward is set to the character at its right
end.
 Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, lexemeBegin is set to the character immediately after the lexeme just
found.
 Advancing forward requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.