0% found this document useful (0 votes)

21 views40 pages

CSC 415 Compiler Design: Lexical Analysis

The document outlines the syllabus for the CSC 415 Compiler Design course, focusing on lexical analysis, grammar, parsing, semantic processing, and code generation. It covers key concepts such as tokens, regular expressions, finite automata, and the design of lexical analyzers. The course uses A. Ullman's textbook on compiler principles and includes various examples and issues related to lexical analysis in programming languages.

Uploaded by

George Youssef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views40 pages

CSC 415 Compiler Design: Lexical Analysis

Uploaded by

George Youssef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Higher Technological Institute

(HTI)
Computer Science Department

CSC 415 Compiler Design

CHAPTER 1
Lexical Analysis

Dr. Hany M. Zamel

1
Course Syllabus
Scanning theory and practice: Regular expressions, finite automata,
and scanners, scanner generators, practical considerations, translating
regular expressions to finite automata. Grammar and parsing: Context
frees grammars, parsers and recognizers, grammar analysis
algorithms. Semantic processing: Syntax directed translation. semantic
processing techniques.
Symbol tables: Basic techniques, block structured and extensions,
implicit declarations. Run time storage organization: Static allocation,
sack allocation, heap allocation, and program layout in memory Dain
structures: analysis declaration processing fundamentals action
routines. Procedures and functions: If statements, loops, case
statement, exception handling, passing parameters to subprograms.
Code generation and optimization: Register and temporary
management interpretive code generation, generating code from
subprogram calls, loop optimization.

2
Textbook
A. Ullman, “Compilers Principles, Techniques, and Tools”, Pearson
Education Limited, 2014

Course Site
https://hanyzamel.wixsite.com/lec-hti

3
Grading Scheme

4
Classroom Policy

5
Outline

• Informal sketch of lexical analysis

– Identifies tokens in input string

• Issues in lexical analysis

– Lookahead
– Ambiguities

• Specifying lexers (aka. scanners)

– By regular expressions (aka. regex)
– Examples of regular expressions

6
Lexical Analysis

• What do we want to do? Example:

if (i ==j)
Z=0;
else
Z=1;

• The input is just a string of characters:

\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;

• Goal: Partition input string into substrings

– Where the substrings are called tokens

7
What’s a Token?

• A syntactic category
– In English:
noun, verb, adjective, …

– In a programming language:
Identifier, Integer, Keyword, Whitespace, …

8
Tokens

• A token class corresponds to a set of strings

Infinite set
• Examples var1
i
ports
– Identifier: strings of letters or foo Person
digits, starting with a letter …
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

9
What are Tokens For?

• Classify program substrings according to role

• Lexical analysis produces a stream of tokens

• … which is input to the parser

• Parser relies on token distinctions

– An identifier is treated differently than a keyword

10
Designing a Lexical Analyzer: Step 1

• Define a finite set of tokens

– Tokens describe all items of interest

• Identifiers, integers, keywords

– Choice of tokens depends on

• language
• design of parser

11
Example

• Recall
\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;

• Useful tokens for this expression:

Integer, Keyword, Relation, Identifier, Whitespace, (, ),
=, ;

• N.B., (, ), =, ; above are tokens, not characters

12
Designing a Lexical Analyzer: Step 2

• Describe which strings belong to each token

• Recall:
– Identifier: strings of letters or digits, starting with a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

13
Lexical Analyzer: Implementation

• An implementation must do two things:

1. Classify each substring as a token

2. Return the value or lexeme (value) of the token

– The lexeme is the actual substring
– From the set of substrings that make up the token

• The lexer thus returns token-lexeme pairs

– And potentially also line numbers, file names, etc. to
improve later error messages

14
Example

• Recall:
\tif (i ==j)\n\t\tz =0;\n\telse\n\t\tz =1;

15
Lexical Analyzer: Implementation

• The lexer usually discards “uninteresting” tokens

that don’t contribute to parsing.

• Examples: Whitespace, Comments

16
True Crimes of Lexical Analysis

• Is it as easy as it sounds?

• Sort o f … if you do not make it hard!

• Look at some history

17
Lexical Analysis in FORTRAN

• FORTRAN rule: Whitespace is insignificant

• E.g., VAR1 is the same as VA R1

• A terrible design!

• Historical footnote: FORTRAN Whitespace rule

motivated by inaccuracy of punch card operators

18
FORTRAN Example

• Consider
– DO 5 I=1,25
– DO 5 I=1.25

19
Lexical Analysis in FORTRAN (Cont.)

• Two important points:

1. The goal is to partition the string. This is implemented

by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one

token ends and the next token begins

20
Lookahead

• Even our simple example has lookahead issues

– i vs. if
– =vs. ==

21
Lexical Analysis in C++

• Unfortunately, the problems continue today

• C++ template syntax:

Foo<Bar>
• C++ stream syntax:
cin >>var;
• But there is a conflict with nested templates:
Foo<Bar<Bazz>>

Closing templates, not stream 22

20
Review

• The goal of lexical analysis is to

– Partition the input string into lexemes
– Identify the token of each lexeme

• Left-to-right scan => lookahead sometimes

required

23
Next

• We still need
– A way to describe the lexemes of each token

– A way to resolve ambiguities

• Is if two variables i and f?
• Is ==two equal signs = =?

24
Regular Languages

• There are several formalisms for specifying tokens

• Regular languages are the most popular

– Simple and useful theory
– Easy to understand
– Efficient implementations

25
Languages

Def. Let alphabet Σ be a set of characters.

A language over Σ is a set of strings of
characters drawn from Σ.

26
Examples of Languages

• Alphabet = English • Alphabet = ASCII

characters
• Language = English • Language = C programs
sentences

• Not every string of English • Note: ASCII character set

characters is an English is different from English
sentence character set

27
Notation

• Languages are sets of strings.

• Need some notation for specifying which sets we

want

• The standard notation for regular languages is

regular expressions.

28
Atomic Regular Expressions

• Single character

'c ' = "c"

• Epsilon
 = ""
Not the empty set, but set with
a single, empty, string.

29
Compound Regular Expressions

• Union
A+ B = s | s  A or s  B

• Concatenation
AB = ab | a  A and b  B
• Iteration
A* = ∪i≥0 Ai where Ai = AA . . . A
i times
30
Regular Expressions

• Def. The regular expressions over Σ are the

smallest set of expressions including

'c ' where c  
A+ B where A, B are rexp over 
AB " " "
A* where A is a rexp over 
31
Syntax vs. Semantics

• Notation so far was imprecise

AB = ab | a  A and b  B

B as a piece of syntax B as a set

(the semantics of the syntax)

32
Syntax vs. Semantics

Semantics (content) L('a'*)

L('a' + 'b')
b aa aaa
a a

Box 'a' + 'b' 'a'*

Syntax (label)

33
Syntax vs. Semantics

• To be careful, we distinguish syntax and semantics.

L() = ""
L('c ') = {"c "}
L( A + B) = L( A) L(B)
L( AB) = {ab | a  L( A) and b  L(B)}
L( A* ) = L( Aii )
L(A*) = !∪ii≥0
0
L(A )

34
Example: Keyword

Keyword: “else” or “if” or “begin” or …

‘else’ + ‘if’ + ‘begin’ + . . .

Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’

35
Example: Integers

Integer: a non-empty string of digits

digit = '0 '+ '1'+ '2 '+ '3'+ ' 4 '+ '5 '+ '6 '+ '7 '+ '8 '+ '9 '
integer = digit digit*

Abbreviation: A + = AA*
Abbreviation: [0-2] = '0' + '1' + '2'

36
Example: Identifier

Identifier: strings of letters or digits, starting with a

letter

letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’

identifier = letter (letter + digit)*

Is (letter* + digit) the same as (letter + digit)?

37
Example: Whitespace

Whitespace: a non-empty sequence of blanks,

newlines, and tabs

(' ' + '\n' + '\t')

38
Example: Phone Numbers

• Regular expressions are all around you!

• Consider (650)-723-3232

 = digits  -,(,)
exchange = digit3
phone = digit4
area = digit3
phone_number = '(' area ')-' exchange '-' phone
39
Example: Email Addresses

• Consider anyone@cs.stanford.edu

 = letters .,@
name = letter+
address = name '@' name '.' name '.' name

Lecture 03
No ratings yet
Lecture 03
42 pages
Chapter 2 Lexical - Analysis
No ratings yet
Chapter 2 Lexical - Analysis
38 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
36 pages
2 Lex
No ratings yet
2 Lex
45 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Week 5-6
No ratings yet
Week 5-6
33 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
4 LexicalAnalysis
No ratings yet
4 LexicalAnalysis
27 pages
Lecture 2 10022025 035804pm
No ratings yet
Lecture 2 10022025 035804pm
27 pages
Lexical Analysis in Compilers
No ratings yet
Lexical Analysis in Compilers
5 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
L2 Lexical Analysis
No ratings yet
L2 Lexical Analysis
59 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Lexical Analysis for CS Students
No ratings yet
Lexical Analysis for CS Students
12 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
56 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Compiler - Lexical Analyzer-2
No ratings yet
Compiler - Lexical Analyzer-2
16 pages
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
No ratings yet
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
64 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
HW 31712
No ratings yet
HW 31712
22 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
L4 - Lexical Analysis
No ratings yet
L4 - Lexical Analysis
44 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
88 pages
Unit 03 Scanner
No ratings yet
Unit 03 Scanner
51 pages
1 - Scanning Slides Sanyal Part1
No ratings yet
1 - Scanning Slides Sanyal Part1
22 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
39 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
27 pages
Unit NO.03 Phases in Compilers-Lexical Analysis& Syntax Analysis
No ratings yet
Unit NO.03 Phases in Compilers-Lexical Analysis& Syntax Analysis
43 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
No ratings yet
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
40 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
46 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
CSC 415 Compiler Design: Finite Automaton (FA)
No ratings yet
CSC 415 Compiler Design: Finite Automaton (FA)
20 pages
Compiler Design
No ratings yet
Compiler Design
24 pages
CSC 415 Compiler Design: Introduction To Compiling
No ratings yet
CSC 415 Compiler Design: Introduction To Compiling
20 pages
Task SW N - FF
No ratings yet
Task SW N - FF
8 pages
JavaCC Guide for Developers
No ratings yet
JavaCC Guide for Developers
15 pages
Logic - Chapter #1 by Mian Waqas Haider
100% (1)
Logic - Chapter #1 by Mian Waqas Haider
86 pages
LLK and LRK
No ratings yet
LLK and LRK
32 pages
Tutorial On Prolog (Lab) : Text Book: Introduction To Turbo Prolog
No ratings yet
Tutorial On Prolog (Lab) : Text Book: Introduction To Turbo Prolog
6 pages
Logic Rules for Computer Science
No ratings yet
Logic Rules for Computer Science
3 pages
CS311: Computational Foundations
No ratings yet
CS311: Computational Foundations
51 pages
Basics of Compilation Process COM 413
No ratings yet
Basics of Compilation Process COM 413
35 pages
Logic1-3 Masci
No ratings yet
Logic1-3 Masci
62 pages
Logic, Logic Fuzzy and Quantum
100% (2)
Logic, Logic Fuzzy and Quantum
125 pages
Compiler Construction Unit 3 Part-6 CLR (1) and LANR (1) Parser CSE
No ratings yet
Compiler Construction Unit 3 Part-6 CLR (1) and LANR (1) Parser CSE
5 pages
AI FinalExam
No ratings yet
AI FinalExam
5 pages
Javagrm README
No ratings yet
Javagrm README
3 pages
Toc M1, M2
No ratings yet
Toc M1, M2
32 pages
Intro to Summation Notation
No ratings yet
Intro to Summation Notation
7 pages
Full Python Regex Questions Detailed
No ratings yet
Full Python Regex Questions Detailed
4 pages
Prolog - Programming For Artificial Intelligence PDF
100% (1)
Prolog - Programming For Artificial Intelligence PDF
352 pages
Intelligent Agents Theory and Practice
No ratings yet
Intelligent Agents Theory and Practice
62 pages
Er (DBMS)
No ratings yet
Er (DBMS)
3 pages
IF and NESTED IF Formulas
No ratings yet
IF and NESTED IF Formulas
6 pages
Abstract Numbers Explained
No ratings yet
Abstract Numbers Explained
21 pages
Automata: Formal Language:: Finite State Machines (Finite Automata)
No ratings yet
Automata: Formal Language:: Finite State Machines (Finite Automata)
60 pages
Algebra Bunge
No ratings yet
Algebra Bunge
13 pages
At Module-3
No ratings yet
At Module-3
36 pages
Universal Turing Machine
No ratings yet
Universal Turing Machine
2 pages
Artificial Intelligence and Neural Networks PDF
No ratings yet
Artificial Intelligence and Neural Networks PDF
4 pages
Hacking, Ian - On The Reality of Existence and Identity (1978)
No ratings yet
Hacking, Ian - On The Reality of Existence and Identity (1978)
20 pages
Fundamentals of Programming
0% (2)
Fundamentals of Programming
2 pages
Programming in C: Presentation Created by Sukhadev SK
No ratings yet
Programming in C: Presentation Created by Sukhadev SK
16 pages
Computational Applied Logic: CSC 503 Fall 2005
No ratings yet
Computational Applied Logic: CSC 503 Fall 2005
77 pages

CSC 415 Compiler Design: Lexical Analysis

Uploaded by

CSC 415 Compiler Design: Lexical Analysis

Uploaded by

Higher Technological Institute

CSC 415 Compiler Design

Dr. Hany M. Zamel

• Informal sketch of lexical analysis

• Issues in lexical analysis

• Specifying lexers (aka. scanners)

• What do we want to do? Example:

• The input is just a string of characters:

• Goal: Partition input string into substrings

• A token class corresponds to a set of strings

• Classify program substrings according to role

• Lexical analysis produces a stream of tokens

• Parser relies on token distinctions

• Define a finite set of tokens

– Tokens describe all items of interest

– Choice of tokens depends on

• Useful tokens for this expression:

• N.B., (, ), =, ; above are tokens, not characters

• Describe which strings belong to each token

• An implementation must do two things:

1. Classify each substring as a token

2. Return the value or lexeme (value) of the token

• The lexer thus returns token-lexeme pairs

• The lexer usually discards “uninteresting” tokens

• Examples: Whitespace, Comments

• Sort o f … if you do not make it hard!

• Look at some history

• FORTRAN rule: Whitespace is insignificant

• E.g., VAR1 is the same as VA R1

• Historical footnote: FORTRAN Whitespace rule

• Two important points:

1. The goal is to partition the string. This is implemented

2. “Lookahead” may be required to decide where one

• Even our simple example has lookahead issues

• Unfortunately, the problems continue today

• C++ template syntax:

Closing templates, not stream 22

• The goal of lexical analysis is to

• Left-to-right scan => lookahead sometimes

– A way to resolve ambiguities

• There are several formalisms for specifying tokens

• Regular languages are the most popular

Def. Let alphabet Σ be a set of characters.

• Alphabet = English • Alphabet = ASCII

• Not every string of English • Note: ASCII character set

• Languages are sets of strings.

• Need some notation for specifying which sets we

• The standard notation for regular languages is

'c ' = "c"

• Def. The regular expressions over Σ are the

• Notation so far was imprecise

B as a piece of syntax B as a set

Semantics (content) L('a'*)

Box 'a' + 'b' 'a'*

• To be careful, we distinguish syntax and semantics.

Keyword: “else” or “if” or “begin” or …

‘else’ + ‘if’ + ‘begin’ + . . .

Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’

Integer: a non-empty string of digits

Identifier: strings of letters or digits, starting with a

letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’

Is (letter* + digit*) the same as (letter + digit)*?

Whitespace: a non-empty sequence of blanks,

(' ' + '\n' + '\t')

• Regular expressions are all around you!

You might also like

Is (letter* + digit) the same as (letter + digit)?