CS 346:
346 Syntax
S t Analyzer
A l
Resource: Textbook
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles,Techniques, and Tools”,
Addison-Wesley 1986
Addison-Wesley, 1986.
Syntax Analyzer
Syntax Analyzer: creates the syntactic structure of the given source program
Parser
Syntactic structure: parse tree
Syntax
S t off a programming:
i described
d ib d by
b a context-free
f grammar (CFG)
Steps
Parser checks whether a given source program satisfies the rules implied
b a CFG or not
by
If it satisfies, the parser creates the parse tree of that program
Otherwise the parser gives the error messages
Syntax Analyzer
CFG
gives a precise syntactic specification of a programming language
the design of the grammar is an initial phase of the design of a
compiler
p
a grammar can be directly converted into a parser by some tools
Parser
• Parser works on a stream of tokens
• Smallest
S ll t item:
it ttoken
k
source Lexical token pparse tree
program Parser
Analyzer get next token
Parsers (cont.)
Well-known categories of parsers:
1. Top-Down Parser
the
th parse tree
t createdt d top
t tot bottom,
b tt starting
t ti from
f th roott
the
2. Bottom-Up Parser
the parse created bottom to top; starting from the leaves
Both top
top-down
down and bottom
bottom-upup parsers scan the input from left to right (one
symbol at a time)
Efficient top-down and bottom-up parsers can be implemented only for sub-
classes of CFG
LL for top-down parsing
LR for f bottom-up
b tt parsing
i
Context-Free Grammars (CFG)
Inherently recursive structures of a programming language are defined by a
CFG
In a CFG, we have:
A finite set of terminals ((in our case,, this will be the set of tokens))
A finite set of non-terminals (syntactic-variables)
A finite set of productions rules in the following form
A where A is a non-terminal and
is a string of terminals and non-terminals (including the
empt string);
empty string) |A| <= ||
A start symbol: one of the non-terminal symbols
Example:
E E+E | E–E | E*E | E/E | -E
E (E)
E id
Derivations
E E+E
E+E derives from E
we ca
can replace
ep ace E by E+E
E E+E id + E id + id
A sequence of replacements of non-terminal symbols is called a derivation of id + id from E
In ggeneral a derivation step
p is
A if there is a production rule A in our grammar
where and are arbitrary strings of terminal and non-terminal symbols
1 2 ... n (n derives
d ffrom 1 or 1 derives
d n )
: derives in one step
: derives in zero or more steps
: derives in one or more steps
*
+
CFG - Terminology
L(G) is the language of G (the language generated by G) which is a set of
sentences
A sentence of L(G) is a string of terminal symbols of G
If S iis the
h start symbol
b l off G then
h
+
is a sentence of L(G) iff S where is a string of terminals of G
If G is a context-free grammar, L(G) is a context-free language
Two grammars are equivalent if they produce the same language
*
S - If contains non-terminals, it is called as a sentential form of G
- If does not contain non-terminals, it is called as a sentence of G
Derivation: Example
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)
At each derivation step, we can choose any of the non-terminals in the
sentential form of G for the replacement
left-most derivation: always
y chooses the left-most non-terminal in each
derivation step
right-most derivation: always chooses the right-most non-terminal in
each derivation step
Left Most and Right-Most
Left-Most Right Most Derivations
Left-Most Derivation
E -E -(E) -(E+E) -(id+E) -(id+id)
lm lm lm lm lm
Right-Most Derivation
E -E
E -(E)
(E) -(E+E)
(E+E) -(E+id)
(E+id) -(id+id)
(id+id)
rm rm rm rm rm
top-down parsers: finds the left-most derivation of the given source
program
bottom-up parsers: finds the right-most derivation of the given source
program in the reverse order
Parse Tree
• Intermediate nodes: Inner nodes of a parse tree
• Leaves:
eaves: Terminal
e a sy
symbols
os
• A parse tree can be seen as a graphical representation of a derivation
E -E E
-(E) E
-(E+E) E
- E - E - E
( E ) ( E )
E E E + E
- E - E
-(id+E) -(id+id)
( E ) ( E )
E + E E + E
id id id
Ambiguity
g y
• A grammar that produces more than one parse tree for a sentence is
called as an ambiguous grammar
E
E E+E id+E id+E*E E + E
id+id*E id+id*id
id E * E
id id
E
E E*E E+E*E id+E*E
id+id*E id+id*id E * E
E + E id
id id
Ambiguity (cont.)
For the most parsers, the grammar must be unambiguous
Unambiguous grammar
unique selection
l off the
h parse tree for
f a sentence
• Disambiguation
--Necessary to eliminate the ambiguity in the grammar during the
design phase of the compiler
Design unambiguous grammar
Choose one of the parse trees of a sentence to restrict to this choice
Ambiguity
g y ((cont.))
stmt if expr then stmt |
if expr
e pr then stmt else stmt | otherstmts
if E1 then
th if E2 then
th S1 else
l S2
IInterpretation-1:
1 S2 being
b i executedd when
h E1 is
i false
f l (thus
( h attaching
hi the
h else
l to the
h
first if)
if E1 then (if E2 then S1) else S2
IInterpretation-I1:
i I1 E1 is
i true
t andd E2 is
i false
f l (thus
(th attaching
tt hi theth else
l tot the
th secondd if)
if E1 then (if E2 then S1 else S2)
Ambiguity (cont.)
(cont )
stmt if expr then stmt |
if expr then stmt else stmt | otherstmts
if E1 then if E2 then S1 else S2
stmt stmt
if expr then stmt else stmt if expr then stmt
E1 if expr
p then stmt S2 E1 if expr
p then stmt else stmt
E2 S1 E2 S1 S2
1 2
Ambiguity (cont.)
• We prefer the second parse tree (else matches with closest if)
So, we have to disambiguate our grammar to reflect this choice
• Unambiguous grammar:
stmt matchedstmt | unmatchedstmt
matchedstmt if expr then matchedstmt else matchedstmt |
otherstmts
unmatchedstmt if expr
p then stmt |
if expr then matchedstmt else unmatchedstmt
Ambiguity – Operator Precedence
Ambiguous grammars (because of ambiguous operators) can be
disambiguated according to the precedence and associativity rules
E E+E | E*E | E^E | id | (E)
disambiguate the grammar
precedence: ^ (right to left)
* (left to right)
+ (left to right)
g
E E+T | T
T T*F | F
F G^F | G
G id | (E)
Left Recursion
A grammar is left recursive if it has a non
non-terminal
terminal A such that
there is a derivation
+
A A for some string
Top-down
p p
parsing
g techniques
q cannot handle left-recursive
grammars
Conversion of left-recursive grammar into an equivalent non-
recursive grammar is essentiall
Possible ways of left-recursion
may appear in
i a single
i l step off the
h derivation
d i i (immediate
(i di left-recursion)
lf i ) or
may appear in more than one step of the derivation
Immediate Left-Recursion
A A | where does not start with A
eliminate immediate left recursion
A A’
A’ A’ | an equivalent grammar
In general,
A A 1 | ... | A m | 1 | ... | n h 1 ... n do
where d not start with
i hA
eliminate immediate left recursion
A 1 A’ | ... | n A’
A’ 1 A’ | ... | m A’ | an equivalent grammar
Immediate Left
Left-Recursion
Recursion -- Example
E E+T | T
T T*F | F
F id | (E)
eliminate immediate left recursion
E T E’
E’ +T E’ |
T F T’
T’ *F T’ |
F id | (E)
Left-Recursion -- Problem
• A grammar cannot be immediately left-recursive, but it still can be left-recursive
• Just elimination of the immediate left-recursion does not guarantee a grammar
which is not left-recursive
S Aa | b
A Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive
S Aa Sca or
A Sc
S Aac
A causes to
t a lleft-recursion
ft i
f
• Solution: eliminate all left-recursions ffrom the ggrammar
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for
f i from f 1 to
t n dod {
- for j from 1 to i-1 do {
replace
l eachh production
d i
Ai Aj
by
Ai 1 | ... | k
h Aj 1 | ... | k
where
}
- eliminate
l immediate
d lleft-recursions
f among Ai productions
d
}
Eliminate Left-Recursion -- Example
S Aa | b
A Ac | Sd | f
- Order of non-terminals: S, A
for S:
p
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A Sd with A Aad | bd
So, we will have A Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A bdA’ | fA’
A’ cA’ | adA’ |
So,, the resultingg equivalent
q grammar
g which is not left-recursive is:
S Aa | b
A bdA’ | fA’
A’ cA’ | adA’ |
Eliminate Left-Recursion – Example2
S Aa | b
A Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop
- Eliminate the immediate left-recursion in A
A SdA’ | fA’
A’ cA’ |
for S:
- Replace S Aa with S SdA’a | fA’a
So, we will have S SdA’a | fA’a | b
- Eliminate the immediate left
left-recursion
recursion in S
S fA’aS’ | bS ’
S’ dA’aS’ |
So, the resulting equivalent grammar which is not left-recursive
So left recursive is:
S fA’aS’ | bS ’
S’ dA’aS’ |
A SdA’ | fA’
A’ cA’ |
Left-Factoring
Top-down parser without backtracking (predictive parser) insists that the
grammar must be left
left-factored
factored
grammar a new equivalent grammar suitable for predictive parsing
stmt if expr then stmt else stmt |
if expr then stmt
After seeing if,
if we cannot decide which production rule to choose
to re-write stmt in the derivation
Left-Factoring (cont.)
In general,
A 1 | 2 where is non-empty and the first symbols
of 1 and 2 (if they have one) are different
Choice involved when processing
A to
1 or
A to 2
Re-write the grammar as follows:
A A’
A’ 1 | 2 so, we can immediately expand A to A’
Left-Factoring -- Algorithm
For each non-terminal A with two or more alternatives (production rules)
with a common non-empty
non empty prefix,
prefix let say
A 1 | ... | n | 1 | ... | m
convert it into
A A’ | 1 | ... | m
A’ 1 | ... | n
Left-Factoring – Example1
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aAA’ | cdA
dA’’
A’ bB | B
A’’ g | eB | fB
Left-Factoring
Left Factoring – Example2
A ad | a | ab | abc | b
A aA’ | b
A’ d | | b | bc
A aA’
A’ | b
A’ d | | bA’’
A’’ | c
Non-Context Free Language
g g Constructs
Some language constructions in the programming languages are not
context free
context-free
1 L1 = { c | is in (a|b)*}
Example-1:
Example
declaring an identifier and checking whether it is declared or not
later We cannot do this with a context-free
later. context free language.
language We need
semantic analyzer (which is not context-free)
Example-2: L2 = {anbmcndm | n1 and m1 }
declaring two functions (one with n parameters, the other one with
m parameters), and then calling them with actual parameters