Why Intermediate Code
Generation?
Intermediate Code Generation
Intermediate code - interface between front end & back end in a
compiler
details of source language are confined to the front end and the details
of target machines to the back end
Parser
Static
Checker
Front end
Intermediate
Code Generator
Code
Generator
Back end
Logical structure of a compiler
A compiler might use a sequence of intermediate representations
Variants of syntax trees
Nodes in a syntax tree represent constructs in the source program;
the children of a node represent the meaningful components of a
construct.
It is sometimes beneficial to create a DAG instead of tree for
Expressions.
We can easily show the common sub-expressions and then use that
knowledge during code generation
Example: a+a*(b-c)+(b-c)*d
+
+
*
*
d
-
a
b
SDD for creating DAGs
Production
1)
2)
3)
4)
5)
6)
E -> E1+T
E -> E1-T
E -> T
T -> (E)
T -> id
T -> num
Example
1)p1=Leaf(id, entry-a)
2)P2=Leaf(id, entry-a)=p1
3)p3=Leaf(id, entry-b)
4)p4=Leaf(id, entry-c)
5)p5=Node(-,p3,p4)
6)p6=Node(*,p1,p5)
7)p7=Node(+,p1,p6)
Semantic Rules
E.node= new Node(+, E1.node,T.node)
E.node= new Node(-, E1.node,T.node)
E.node = T.node
T.node = E.node
T.node = new Leaf(id, id.entry)
T.node = new Leaf(num, num.val)
8)
9)
10)
11)
12)
13)
p8=Leaf(id,entry-b)=p3
p9=Leaf(id,entry-c)=p4
p10=Node(-,p3,p4)=p5
p11=Leaf(id,entry-d)
p12=Node(*,p5,p11)
p13=Node(+,p7,p12)
+
+
*
*
d
-
a
b
a+a*(b-c)+(b-c)*d
Value-number method for constructing DAGs
1
2
3
+
i
10
id
num
+
=
To entry for i
10
1
1
2
3
Algorithm (Input : Label op, node l, and node r.
Search the array for node M with label op, left child l & right child r
If there is such a node, return the value number of M
If not, create in the array a new node N with label op, left child l, and
right child r and return its value number
We may use a hash table
Three address code
In a three address code there is at most one operator at the
right side of an instruction
Example:
+
+
*
*
d
-
a
b
t1 = b c
t2 = a * t1
t3 = a + t2
t4 = t1 * d
t5 = t3 + t4
Addresses and
Instructions
Three-address code is built from two concepts: addresses and
instructions.
An address can be one of the following:
A name - we allow source-program names to appear as
addresses in three-address code. In an implementation, a
source name is replaced by a pointer to its symbol-table
entry, where all information about the name is kept.
A constant - compiler must deal with many different types of
constants and variables.
A compiler-generated temporary - It is useful, especially
in optimizing compilers, to create a distinct name each time a
temporary is needed.
List of Common 3-address
instruction forms
1. Assignment instructions of the form x = y op z, where op is a binary
arithmetic or logical operation, and x, y, and z are addresses.
2. Assignments of the form x = op y, where op is a unary operation.
(unary minus, logical negation, shift operators, and conversion
operators )
3. Copy instructions of the form x = y, where x is assigned the value of
y.
4. An unconditional jump goto L. The three-address instruction with
label L is the next to be executed.
5. Conditional jumps of the form if x goto L and if False x goto L.
These instructions execute the instruction with label L next if x is true
and false, respectively. Otherwise, the following three-address
instruction in sequence is executed next, as usual.
6. Conditional jumps such as if x relop y goto L
7. Procedure calls and returns are implemented using the following
instructions: param x for parameters; call p,n and y = call p,n for
procedure and function calls, respectively; and return y, where y,
representing a returned value, is optional. Their typical use is as the
sequence of three address instructions
param x1
param x2
param xn
call p,n
generated as part of a call of the procedure p(x1,x2, ,xn).
6. Indexed copy instructions of the form x=y[il and xlil=y.
7. Address and pointer assignments of the form x=&y, x =* y, and * x
= y.
Forms of three address instructions
x = y op z
x = op y
x = y
goto L
if x goto L and if False x goto L
if x relop y goto L
Procedure calls using:
param x
call p,n
y = call p,n
x = y[i] and x[i] = y
x = &y and x = *y and *x =y
Example
do i = i+1; while (a[i] < v);
L:
t1 = i + 1
i = t1
t2 = i * 8
t3 = a[t2]
if t3 < v goto L
Symbolic labels
100:
101:
102:
103:
104:
t1 = i + 1
i = t1
t2 = i * 8
t3 = a[t2]
if t3 < v goto 100
Position numbers
Data structures for three address
codes
Quadruples
Has four fields: op, arg1, arg2 and result
Triples
Temporaries are not used and instead references
to instructions are made
Indirect triples
In addition to triples we use a list of pointers to
triples
Quadruples
A quadruple (quad) has four fields, which we call op,
arg1, arg2, and result. The op field contains an
internal code for the operator.
For instance, the three-address instruction x = y + z
is represented by placing + in op, y in arg1, z in arg2,
and x in result.
The following are some exceptions to this rule:
Instructions with unary operators like x = minus y or x =
y do not use arg2. Note that for a copy statement
like x = y, op is =, while for most other operations,
the assignment operator is implied.
Operators like param use neither arg2 nor result.
Conditional and unconditional jumps put the target label
in result.
Triples
A triple has only three fields, which we call op, arg1, and
arg2
Using triples, we refer to the result of an operation x op y
by its position, rather than by an explicit temporary name.
Instead of the temporary ti, a triple representation would
refer to position (0).
Parenthesized numbers represent pointers into the triple
structure itself.
positions
numbers.
or
pointers
to
positions
were
called
value
Three address code
Example
b * minus c + b * minus c
Quadruples
op arg1 arg2 result
minus c
t1
*
b
t1 t2
minus c
t3
b
t3
t4
*
+
t2 t4 t5
=
t5
a
Triples
0
1
2
3
4
5
op arg1 arg2
minus c
*
b
(0)
minus c
b
(2)
*
+
(1) (3)
=
a
(4)
t1 = minus c
t2 = b * t1
t3 = minus c
t4 = b * t3
t5 = t2 + t4
a = t5
Indirect Triples
op
35 (0)
36 (1)
37 (2)
38 (3)
39 (4)
40 (5)
0
1
2
3
4
5
op arg1 arg2
minus c
*
b
(0)
minus c
b
(2)
*
+
(1) (3)
=
a
(4)
Static Single-Assignment Form
Static single-assignment form (SSA) is an intermediate
representation that facilitates certain code optimizations.
Two distinctive aspects distinguish SSA from threeaddress code.
All assignments in SSA are to variables with distinct
names; hence the term static single-assigment.
x
y
a
a
b
=
=
=
=
=
1
2
x + y
a + 3
x + y
(a) Normal form
x1
y1
a1
a2
b1
= 1
= 2
= x1 + y1
= a1 + 3
= x1 + y1
(b) SSA form
Types and Declarations
The applications of types can be grouped as follows
Type checking
uses logical rules to reason about the behavior of a program at run time.
ensures the types of the operands match the type expected by an operator.
e.g., && operator in Java expects its 2 operands and results to be booleans.
Translation Applications
From the type of a name, a compiler can determine the storage that will be
needed for that name at run time.
Type information is also needed to
calculate the address denoted by an array reference,
to insert explicit type conversions,
to choose the right version of an arithmetic operator, among other things.
Type Expressions
Types have structure, which we shall represent using
type expressions: a type expression is either a basic
type or is formed by applying an operator called a type
constructor to a type expression.
The sets of basic types and constructors depend on the
language to be checked.
The array type int [2][3] can be read as "array of 2
arrays of 3 integers each" - array(2, array(3, integer)).
Type Expressions
basic type is a type expression
type name is a type expression
type expression can be formed by applying array type constructor to a
number & a type expression.
record is a data structure with named field. A type expression can be
formed by applying record type constructor to field names & their
types.
type expression can be formed using type constructor for function
types
If s and t are type expressions, then their Cartesian product s*t is a
type expression
Type expressions may contain variables whose values are type
expressions
Type Equivalence
When type expressions are represented by graphs(DAG),
two types are structurally equivalent if and only if one of the
following conditions is true:
They are the same basic type.
They
are
formed
by
applying
the
same
structurally equivalent types.
One is a type name that denotes the other.
constructor
to
Declarations
Storage Layout for Local Names
Storage comes in blocks of contiguous bytes, where a byte is the
smallest unit of addressable memory.
byte is eight bits, and some number of bytes form a machine word.
Multibyte objects are stored in consecutive bytes & given address of
the first byte.
width of a type - number of storage units needed for objects of that
type.
basic type, such as a character, integer, or float, requires an integral
number of bytes.
For easy access, storage for aggregates such as arrays and classes
is allocated in one contiguous block of bytes.
Storage Layout for Local Names
Computing types and their widths
Storage Layout for Local Names
Syntax-directed translation of array types
Sequences of Declarations
Languages such as C and Java allow all the declarations in a
single procedure to be processed as a group.
The declarations may be distributed within a Java procedure,
but they can still be processed when the procedure is analyzed.
Therefore, we can use a variable, say offset, to keep track of
the next available relative address.
Before the first declaration is considered, offset is set to 0.
As each new name x is seen, x is entered into the symbol table
with its relative address set to the current value of offset, which
is then incremented by the width of the type of x.
Sequences of Declarations
Fields in Records and Classes
Record types can be added to the grammar by adding the
following production
The fields in this record type are specified by the sequence of
declarations generated by D.
The field names within a record must be distinct; that is, a name
may appear at most once in the declarations generated by D.
The offset or relative address for a field name is relative to the
data area for that record.
Fields in Records and Classes
The use of a name x for a field within a record does not conflict with
other uses of the name outside the record.
Thus, the three uses of x in the following declarations are distinct
and do not conflict with each other
A subsequent assignment x = p. x + q. x; sets variable x to the sum
of the fields named x in the records p and q.
The relative address of x in p differs from the relative address of x in
q.
Fields in Records and Classes
The embedded action before D saves the existing symbol
table, denoted by top and sets top to a fresh symbol table.
It also saves the current offset, and sets offset to 0.
The declarations generated by D will result in types and
relative addresses being put in the fresh symbol table.
The action after D creates a record type using top, before
restoring the saved symbol table and offset.
Translation of Expressions and
Statements
Operations Within Expressions
Incremental Translation
Addressing Array Elements
Translation of Array References
Operations Within Expressions
The synthesized attribute S.code represents the three-address
code for the assignment.
Expressions have two synthesized attributes:
1. E.addr, holds the address for the value of E;
2. E.code, the three-address code for evaluating E.
Temporary names are generated for intermediate computations.
The function newtemp() generates distinct temporary names t1,
t2, . . .
The function gen() generates three-address code such that:
1. Everything quoted is taken literally;
2. The rest is evaluated.
Three-address code for expressions
Incremental Translation
Code attributes can be long strings, so they are usually
generated incrementally.
In the incremental approach, gen not only constructs a
three-address instruction, it appends the instruction to the
sequence of instructions generated so far.
The sequence may either be retained in memory for further
processing, or it may be output incrementally.
Incremental Translation
Addressing Array Elements
Array elements can be accessed quickly if they are stored in a
block of consecutive locations.
In C and Java, array elements are numbered 0 , 1 , . . . , n 1,
for an array with n elements.
If the width of each array element is w, then the i th element of
array A begins in location
base + i * w
where base is the relative address of the storage allocated for
the array.
location of A[i]
any number)
= base+(i-low)*width (indices can start from
Addressing Array Elements
Layouts for a two-dimensional array:
Translation of Array References
Nonterminal L has 3 synthesized attributes
L.addr temporary that is used while computing the offset for
the array reference by summing the terms ij x wj
L.array is a pointer to the symbol table entry for the array
name.
L.array.base is used to determine the actual l-value of an array reference after
all the index expressions are analyzed.
L.type - type of the subarray generated by L .
For an array type t , t.elem gives the element type
Translation of Array References
Control Flow
The translation of statements such as if-else-statements and
while-statements is tied to the translation of boolean expressions.
In programming languages, boolean expressions are often used to
Alter the flow of control.
Boolean expressions are used as conditional expressions in
statements that alter the flow of control.
The value of such boolean expressions is implicit in a position
reached in a program.
For example, in if (E) S, the expression E must be true if
statement S is reached.
Compute logical values.
A boolean expression can represent true or false as values.
Such boolean expressions can be evaluated in analogy to
arithmetic expressions using three-address instructions with
logical operators.
Boolean Expressions
Boolean expressions are composed of the boolean operators
(which we denote &&, ||, and !, for the operators AND, OR, and
NOT, respectively) applied to elements that are boolean variables
or relational expressions.
Relational expressions are of the form E1 rel E2, where E1 and E2
are arithmetic expressions.
B -> B || B | B && B | ! B | ( B ) \ E rel E | true | false
We use the attribute rel. op to indicate which of the six comparison
operators <, < = , =, ! =, >, or >= is represented by rel.
|| and && are left-associative, and that || has lowest precedence,
then &&, then !.
Short- Circuit Code
In short-circuit (or jumping) code, the boolean
operators &&, || , and ! Translate into jumps.
The operators themselves do not appear in
the code; instead, the value of a boolean
expression is represented by a position in the
code sequence.
Short-Circuit Code
Flow-of-Control Statements
Syntax-directed definition
Control-Flow Translation of Boolean
Expressions
Avoiding Redundant Gotos
The translation of if ( x < 5 || x > 10 && x == y ) x = 3 ;
if x < 5 goto L2
goto L3
L3: if x > 10 goto L4
goto L1
L4: if x == y goto L2
goto L1
L2: x = 3
There are three extra gotos.
One is a goto the next statement.
Two others could be eliminated by using ifFalse.
if x > 200 goto L4
goto L1
L4 :
ifFalse x > 200 goto L1
L4 :
This ifFalse instruction takes advantage of the natural flow from
one instruction to the next in sequence, so control simply "falls
through" to label L4 if x > 200 is false, thereby avoiding a jump.
Semantic rules for B E1 rel E2
Semantic rules for B B1 || B2
Using the above rules if ( x < 100 || x > 200 && x != y ) x = 0; translates to
if x < 100 goto L2
ifFalse x > 200 goto L2
ifFalse x != y goto L1
L2 : x = 0
L1 :
Boolean Values and Jumping
Code
A boolean expression may also be evaluated for its value, as
in assignment statements such as x = t r u e ; or x = a<b;.
A clean way of handling both roles of boolean expressions is
to first build a syntax tree for expressions, using either of
the following approaches:
1. Use two passes: Construct a complete syntax tree for the
input, and then walk the tree in depth-first order, computing the
translations specified by the semantic rules.
2. Use one pass for statements, but two passes for
expressions: translate E in while (E) S1 before S1 is examined.
S id = E ; | if ( E ) S | while (E)S | S S
E E || E | E && E | E rel E | E + E | (E) | id | true | false
E.n represents syntax tree node for E.
Two methods jump and rvalue are used in evaluating expressions.
jump to generate jumping code at an expression node
When E appears in Swhile (E) S1 , call E.n.jump(t, f), where t is a
new label for the first instruction of S1.code and f is the label
S.next.
rvalue to generate code to compute value of node into a temporary.
When E appears in Sid = E; ,method rvalue is called at node E.n.
1. E has form of E E1 + E2 , call E.n.rvalue()
2. E has form of E E1 && E2, generate jumping code for E and then assign
true or false to a new temporary t at the true and false exits,
respectively,
from the jumping code.
Backpatching
A key problem when generating code for boolean expressions
and flow-of-control statements is that of matching a jump
instruction with the target of the jump.
For example, the translation of the boolean expression B in if ( B
) S contains a jump, for when B is false, to the instruction
following the code for S.
In a one-pass translation, B must be translated before S is
examined.
But a separate pass is then needed to bind labels to addresses.
Backpatching - lists of jumps are passed as synthesized
attributes.
Specifically, when a jump is generated, the target of the jump is
temporarily left unspecified.
Each such jump is put on a list of jumps whose labels are to be filled
in when the proper label can be determined.
All of the jumps on a list have the same target label.
One-Pass Code Generation Using
Backpatching
Backpatching can be used to generate code for boolean expressions
and flow of control statements in one pass
Synthesized attributes truelist and falselist of nonterminal B are
used to manage labels in jumping code for boolean expressions.
B. truelist will be a list of jump or conditional jump instructions into
which we must insert the label to which control goes if B is true.
B.falselist likewise is the list of instructions that eventually get the
label to which control goes when B is false.
As code is generated for B, jumps to the true and false exits are left
incomplete, with the label field unfilled. These incomplete jumps are
placed on lists pointed to by B.truelist and B.falselist, as appropriate
A statement S has a synthesized attribute S.nextlist,
denoting a list of jumps to the instruction immediately
following the code for S.
Instructions are generated into an instruction array, and
labels will be indices into this array.
To mAnipulate list of jumps use following functions:
makelist(i): create a new list containing only i, an index into the
array of instructions
Merge(p1,p2): concatenates the lists pointed by p1 and p2 and
returns a pointer to the concatenated list
Backpatch(p,i): inserts i as the target label for each of the
instruction on the list pointed to by p
Backpatching for Boolean
Expressions
A marker non-terminal
M in the grammar
causes a semantic
action to pick up, at
appropriate times, the
index of the next
instruction
to
be
generated.
Backpatching for Boolean
Expressions
Annotated parse tree for x < 100 || x > 200
&& x ! = y
Flow-of-Control
Statements
Translation of a switchstatement
Readings
Chapter 6 of the book