CS-411:Compiler Construction
1. Introduction
Why Compilers?
• Compiler
– A program that translates from 1 language to another
– It must preserve semantics of the source
– It should create an efficient version of the target
language
• In the beginning, there was machine language
– Ugly – writing code, debugging
– Then came textual assembly
– High-level languages – Fortran, Pascal, C, C++
– Machine structures became too complex and software
management too difficult to continue with low-level
languages
Compiler learning
• Isn’t it an old discipline?
– Yes, it is a well-established discipline
– Algorithms, methods and techniques are researched and
developed in early stages of computer science growth
– There are many compilers around and many tools to
generate them automatically
• So, why we need to learn it?
– Although you may never write a full compiler
– But the techniques we learn is useful in many tasks like
writing an interpreter for a scripting language,
validation checking for forms and so on
Why Study Compilers? (1)
• Become a better programmer(!)
– Insight into interaction between languages,
compilers, and hardware
– Understanding of implementation techniques
– Better intuition about what your code does
Why Study Compilers? (2)
• Compiler techniques are everywhere
– Parsing (little languages, interpreters, XML)
– Database engines, query languages
– AI: domain-specific languages
– Text processing
• Tex/LaTex -> dvi -> Postscript -> pdf
– Hardware: VHDL; model-checking tools
– Mathematics (Mathematica, Matlab)
Why Study Compilers? (3)
• Fascinating blend of theory and engineering
– Direct applications of theory to practice
• Parsing, scanning, static analysis
– Some very difficult problems (NP-hard or
worse)
• Resource allocation, “optimization”, etc.
• Need to come up with good-enough
approximations/heuristics
Why Study Compilers? (4)
• Ideas from many parts of CS
– AI: Greedy algorithms, heuristic search
– Algorithms: graph algorithms, dynamic programming,
approximation algorithms
– Theory: Grammars, DFAs and PDAs, pattern matching,
fixed-point algorithms
– Systems: Allocation & naming, synchronization,
locality
– Architecture: pipelines, instruction set use, memory
hierarchy management
Why Study Compilers? (5)
• You might even write a compiler some day!
– You’ll almost certainly write parsers and
interpreters in some context if you haven’t
already
Course scope
• Aim:
– To learn techniques of a modern compiler
• Books:
– Des Watson, A Practical Approach to Compiler
Construction, Springer, 2017.
– Compilers – Principles, Techniques and Tools, Second
Edition by Alfred V. Aho, Ravi Sethi, Jeffery D.
Ullman
Prerequisites
• Problem Solving & Programming (C++,
Python)
• Data Structures
• Algorithm Design
• Theory of Automata
• Computer Architecture
• Assembly Programming
• Operating Systems
Topics
• High Level Languages
• Lexical analysis (Scanning)
• Syntax Analysis (Parsing)
• Syntax Directed Translation
• Intermediate Code Generation
• Run-time environments
• Code Generation
• Machine Independent Optimization
Grading policy
• Midterm (25 Marks)
• Final exam (50 Marks)
• Assignments (20 Marks)
• Quizes (5 Marks)
Terminology
• Compiler:
– a program that translates an executable program in one
language into an executable program in another
language
– we expect the program produced by the compiler to be
better, in some way, than the original
• Interpreter:
– a program that reads an executable program and
produces the results of running that program
– usually, this involves executing the source program in
some fashion
• Our course is mainly about compilers but many of
the same issues arise in interpreters
Common Issues
• Compilers and interpreters both must read
the input – a stream of characters – and
“understand” it; analysis
w h i l e ( k < l e n g t h ) { <nl> <tab> i f ( a [ k ] > 0
) <nl> <tab> <tab>{ n P o s + + ; } <nl> <tab> }
Interpreter
• Interpreter
– Execution engine
– Program execution interleaved with analysis
running = true;
while (running) {
analyze next statement;
execute that statement;
}
– Usually need repeated analysis of statements
(particularly in loops, functions)
– But: immediate execution, good debugging &
interaction
Compiler
• Read and analyze entire program
• Translate to semantically equivalent program in
another language
– Easier to execute or more efficient
– Should “improve” the program in some fashion
Typical Implementations
• Compilers
– FORTRAN, C, C++, Java, COBOL, etc. etc.
– Strong need for optimization in many cases
• Interpreters
– PERL, Python, Ruby, awk, sed, shells,
Scheme/Lisp/ML, postscript/pdf, Java VM
– Particularly effective if interpreter overhead is
low relative to execution cost of individual
statements
Hybrid approaches
• Well-known example: Java
– Compile Java source to byte codes – Java Virtual
Machine language (.class files)
– Execution
• Interpret byte codes directly, or
• Compile some or all byte codes to native code
– Just-In-Time compiler (JIT) – detect hot spots & compile on the
fly to native code – standard these days
• Variation: .NET
– Compilers generate MSIL
– All IL compiled to native code before execution
High-Level Languages
• In the 1940s it was clear that there was a
need for software tools to support the
programming process.
• Programming was done in machine code
– required considerable skill and was hard work
• Assembly language
– relieving the programmer from having to deal
with much of the low-level detail
– but require an assembler
High-Level Languages (Cont..)
• Development of high-level languages
gathered speed in the 1950.
• Need for compilers and other tools for the
implementation of these languages.
• The importance of formal language
specifications was recognized
• The correspondence between particular
grammar types and implementation was
understood
High-Level Languages (Cont..)
• The extensive use of high-level languages
prompted the rapid development of a wide
range of new languages.
– COBOL for business applications
– FORTRAN for numerical computation
– PL/I general-purpose
Advantages of High-Level Languages
• Problem solving is significantly faster
• High-level language programs
– Easier to read
– Understand
– Maintain
• High-level languages are easier to learn.
• Programs are structured more easily
• Data structuring features
Advantages of High-Level Languages (Cont..)
• Object orientation
• Support for asynchronous processes and
parallelism
• Software portability
– Machine independence
– Java
Advantages of High-Level Languages (Cont..)
• Compile-time checking can remove many
bugs at an early stage
– variable declarations
– type checking
– variables initialization
– compatibility in function arguments
• The compiler can insert runtime code such
as array bound checking
Disadvantages of High-Level Languages
• The program may need to perform some
low-level, hardware-specific operations
which do not correspond to a high-level
language feature.
• In most high-level languages there is no
way to express direct machine addressing
• Less efficiency in terms of execution speed
Abstract view
Source Machine
code Compiler code
errors
• Recognizes legal (and illegal) programs
• Generate correct code
• Manage storage of all variables and code
• Agreement on format for object (or
assembly) code
Structure of a Compiler
• First approximation
– Front end: analysis
• Read source program and understand its structure
and meaning
– Back end: synthesis
• Generate equivalent target language program
Source Front End Back End Target
Implications
• Must recognize legal programs (& complain about
illegal ones)
• Must generate correct code
• Must manage storage of all variables/data
• Must agree with OS & linker on target format
Source Front End Back End Target
More Implications
• Need some sort of Intermediate Representation(s)
(IR)
• Front end maps source into IR
• Back end maps IR to target machine code
• Often multiple IRs – higher level at first, lower
level in later phases
Source Front End Back End Target
Front End
• Split into two parts
– Scanner: Responsible for converting character stream to
token stream
• Also strips out white space, comments
– Parser: Reads token stream; generates IR
• Both of these can be generated automatically
– Source language specified by a formal grammar
– Tools read the grammar and generate scanner & parser
(either table-driven or hard-coded)
source tokens IR
Scanner Parser
Front end
Source tokens IR
Scanner Parser
code
errors
• Scanner:
– Maps characters into tokens – the basic unit of syntax
• x = x + y becomes <id, x> = <id, x> + <id, y>
– Typical tokens: number, id, +, -, *, /, do, end
– Eliminate white space (tabs, blanks, comments)
• A key issue is speed so instead of using a tool like
LEX it sometimes needed to write your own
scanner
Front end
Source tokens IR
Scanner Parser
code
errors
• Parser:
– Recognize context-free syntax
– Guide context-sensitive analysis
– Construct IR
– Produce meaningful error messages
– Attempt error correction
• There are parser generators like YACC which
automates much of the work
Front end
• Context free grammars are used to represent
programming language syntaxes:
<expr> ::= <expr> <op> <term> | <term>
<term> ::= <number> | <id>
<op> ::= + | -
Front end
• A parser tries to map a
program to the syntactic
elements defined in the
grammar
• A parse can be
represented by a tree
called a parse or syntax
tree
• x+2-y
Front end
• A parse tree can be
represented more
compactly referred to as
Abstract Syntax Tree
(AST)
• AST is often used as IR
between front end and
back end
Back end
Instruction Register Machine code
IR selection Allocation
errors
• Translate IR into target machine code
• Choose instructions for each IR operation
• Decide what to keep in registers at each
point
• Ensure conformance with system interfaces
Back end
Instruction Register Machine code
IR selection Allocation
errors
• Produce compact fast code
• Use available addressing modes
Back end
Instruction Register Machine code
IR selection Allocation
errors
• Have a value in a register when used
• Limited resources
• Optimal allocation is difficult
Traditional three pass compiler
Source IR Middle IR Machine
Front end Back end
code end code
errors
• Code improvement analyzes and change IR
• Goal is to reduce runtime
Middle end (optimizer)
• Modern optimizers are usually built as a set
of passes
• Typical passes
– Constant propagation
– Common sub-expression elimination
– Redundant store elimination
– Dead code elimination
Defining a Programming Language
• Important to users of the programming
language as well as to the compiler writer.
• how to write programs in the language
• Syntax
– Defines the sequences of characters that could
form a valid program
• Semantics
– the meaning of these programs
Defining a Programming Language
• The language definition should be
– Clear
– Precise
– Complete
– Unambiguous
– Understandable
• Programming languages have been defined
using natural language and ambiguities
• syntax is usually done using a more formal
approach
Backus–Naur Form (BNF)
• Metalanguage
• Its use in the definition of the syntax of
ALGOL 60
• It is a very simple
• powerful metalanguage
• used extensively to support the formal
definitions of a huge range of languages
Backus–Naur Form (BNF)
• BNF specification consists of a set of rules
• Rule defining a symbol
– non-terminal symbols
• in angle brackets in BNF
– terminal symbol
• cannot be expanded further
• Tokens
– A Start Symbol
• non-terminal
A trivial language
• <sentence> ::= <subject> <verb> <object>
• <subject> ::= <article> <noun>
• <object> ::= <article> <noun> | <article>
<adjective> <noun>
• <verb> ::= watches | hears | likes
• <article> ::= a | the
• <noun> ::= man | woman | bicycle | book
• <adjective> ::= red | big | beautiful
Example
• <sentence>
• <subject><verb><object>
• <article><noun><verb><object>
• the <noun><verb><object>
• the woman <verb><object>
• the woman watches <object>
• the woman watches <article><adjective><noun>
• the woman watches a <adjective><noun>
• the woman watches a beautiful <noun>
• the woman watches a beautiful bicycle
BNF for Expression
• Slightly more complicated example
– <expr> ::= <term> | <expr> + <term> |
<expr> - <term>
– <term> ::= <factor> | <term> * <factor> |
<term> / <factor>
– <factor> ::= <integer> | (<expr>)
– <integer> ::= <digit> | <integer> <digit>
– <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
– Expressions such as 2+3+4/5 and 2+3*(44-567)
BNF
• There is no upper limit to the length of
expressions that can be generated by these
particular rules.
• This is because these rules make use of
recursion
• Consider the generation or derivation of the
expression 1+2*3
BNF
• <expr>
• <expr> + <term>
• <term> + <term>
• <factor> + <term>
• 1 + <term>
• 1 + <term> * <factor>
• 1 + <factor> * <factor>
• 1 + 2 * <factor>
• 1+2*3
BNF
• Note particularly that in the expansion
– 1+2*3
– 1+ <term>
– 2*3 grouped as a single entity called a <term>
• BNF rules can be used to support the idea of
operator precedence
• * operator is higher than the precedence of the
+
• The parsing of the BNF rules allow the
specification of the precedence of operators.
BNF
• BNF rules can be used to express the
associativity of operators.
• consider the generation 1+2+3.
• Here, the 1+2 part of the expression is
grouped as a single entity called a <term>.
• Therefore, the expression 1+2+3 is
interpreted as (1+2)+3.
• The + operator is left-associative.
BNF
• If different associativity or precedence rules
are required, then the BNF could be
modified to express these different rules.
• It is perhaps surprising that such a simple
metalanguage can do all this.
Extended Backus–Naur Form (EBNF)
• There are several different variants of the
low-level syntax of BNF-like
metalanguages, but one variant became
popular after its use in the ISO Pascal
Standard.
• This variant was called Extended Backus–
Naur Form (EBNF)
• It retains the basic principles of BNF but the
syntactic detail is a little different.
Extended Backus–Naur Form (EBNF)
• expr = term | expr "+" term | expr "-" term.
• term = factor | term "*" factor | term "/"
factor.
• factor = integer | "("expr")".
• integer = digit | integer digit.
• digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" |
"7" | "8" | "9".
Extended Backus–Naur Form (EBNF)
• Some other key features.
– Parentheses can be used to indicate grouping in
the rule.
– There is a specific feature to indicate
optionality in a rule: [X] specifies zero or one
instance of X, in other words specifying that X
is optional.
– Repetition (not by recursion) is supported too:
{X} implies zero or more instances of X.
Extended Backus–Naur Form (EBNF)
• We can therefore write an essentially
equivalent set of rules:
• expr = term | expr ("+" | "-") term.
• term = factor | term ("*" | "/") factor.
• factor = integer | "("expr")".
• integer = digit {digit}.
• digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" |
"7" | "8" | "9".
Syntax Diagram
• Graphical representation of Syntax
• Complete syntax definition is made up of a
set of syntax diagrams
Grammars
• The term “grammar” has a wide range of
definitions
• We need a tight and formal definition for it
when used in the context of programming
languages.
• The idea of a set of BNF rules to represent
the grammar of a language has already been
discussed, but formally a little more is
required.
Grammars
• The grammar (G) of a language is defined
as a 4-tuple G = (N, T, S, P) where:
– N is the finite set of non-terminal symbols.
– T is the finite set of terminal symbols (N and T
are disjoint.)
– S is the starting symbol, S ∈ N. The starting
symbol is the unique non-terminal symbol that
is used as the starting point for the generation
of all the strings of the language.
Grammars
• P is the finite set of production rules
– A production rule defines a string transformation and it has the general
form
• α → β.
– This rule specifies that any occurrence of the string α in the string to
be transformed is replaced by the string β.
• There are constraints on the constitution of the strings α
and β.
– If U is defined by U = N ∪ T
• U is the set of all non-terminal and terminal symbols
• then α has to be a member of the set of all non-empty strings that
can be formed by the concatenation of members of U, and it has to
contain at least one member of N.
• β has to be a member of the set of all strings that can be formed by
the concatenation of members of U, including the empty string
– β ∈ U∗
Chomsky Hierarchy
Type 0
Type 1
Type 2
Type 3
Chomsky Hierarchy
• Chomsky type 0
– a free grammar or an unrestricted grammar
–α→β
• Chomsky type 1
– a context-sensitive grammar
– αAβ → αγβ
Chomsky Hierarchy
• Chomsky type 2
– a context-free grammar
– A→γ
– where A is a single non-terminal symbol.
– These productions correspond directly to BNF
rules.
• Chomsky type 3
– a regular grammar or a finite-state grammar
– A → a or A → aB