Lex and Yacc
A Quick Tour
HW8Use Lex/Yacc to
Turn this: Into this:
<P> Here's a list:
Here's a list: * This is item one of a list
<UL> * This is item two. Lists should be
<LI> This is item one of a list indented four spaces, with each item
<LI>This is item two. Lists should be marked by a "*" two spaces left of
indented four spaces, with each item four-space margin. Lists may contain
marked by a "*" two spaces left of four- nested lists, like this:
space margin. Lists may contain * Hi, I'm item one of an inner list.
nested lists, like this:<UL><LI> Hi, I'm * Me two.
item one of an inner list. <LI>Me two. * Item 3, inner.
<LI> Item 3, inner. </UL><LI> Item 3, * Item 3, outer list.
outer list.</UL> This is outside both lists; should be back
This is outside both lists; should be to no indent.
back to no indent.
<P><P> Final suggestions:
Final suggestions
2
if myVar == 6.02e23**2 then f( .. !
char stream
LEX token stream
if myVar == 6.02e23**2 then f( !
tokenstream
YACC
parse tree
if-stmt
== fun call
var ** Arg 1 Arg 2
float-lit int-lit ... !
3
Lex / Yacc History
Origin early 1970s at Bell Labs
Many versions & many similar tools
Lex, flex, jflex, posix,
Yacc, bison, byacc, CUP, posix,
Targets C, C++, C#, Python, Ruby, ML,
Well use jflex & byacc/j, targeting java
(but for simplicity, I usually just say lex/yacc)
4
Uses
Front end of many real compilers
E.g., gcc
Little languages:
Many special purpose utilities evolve some
clumsy, ad hoc, syntax
Often easier, simpler, cleaner and more
flexible to use lex/yacc or similar tools from
the start
5
Lex:
A Lexical Analyzer Generator
Input:
Regular exprs defining "tokens"
my.flex
Fragments of declarations & code
Output:
A java program yylex.java
jflex
Use: yylex.java
Compile & link with your main()
Calls to yylex() read chars & return
successive tokens.
7
yacc:
A Parser Generator
Input:
my.y
A context-free grammar
Fragments of declarations & code
Output: byaccj
A java program & some header files
Use: ParserVal.java
Compile & link it with your main()
Call yyparse() to parse the entire input Parser.java
yyparse() calls yylex() to get successive tokens
9
Lex Input: "mylexer.flex"
// java stuff
%: Lex %%
section %byaccj Declarations & code: most
delims
%{ copied verbatim to java pgm
public foo()
%}
%% Token code
Rules/ [a-zA-Z]+ {foo(); return(42); }
regexps [ \t\n] {; /* skip whitespace */}
+
{Actions}
No action
11
Lex Regular Expressions
Letters & numbers match themselves
Ditto \n, \t, \r
Punctuation often has special meaning
But can be escaped: \* matches *
Union, Concatenation and Star
r|s, rs, r*; also r+, r?; parens for grouping
Character groups
[ab*c] == [*cab], [a-z2648AEIOU], [^abc]
^ for not only in char groups, not complementation
12
SE
E E+n | E-n | n
Yacc Input: expr.y
%{
Java decls import java.io.*; Parser.java
%}
Yacc decls %token NUM VAR Parser.java
%%
stmt: exp { printf(%d\n,$1);}
;
Rules exp : exp + NUM { $$ = $1 + $3; }
and | exp - NUM { $$ = $1 - $3; }
{Actions}
| NUM { $$ = $1; }
; C code; java ex later
%%
Java code public static void main( Parser.java
14
Expression lexer: expr.l
%{ y.tab.h:
#include "y.tab.h" #define NUM 258
#define VAR 259
#define YYSTYPE int
%} extern YYSTYPE yylval;
%%
[0-9]+ { yylval = atoi(yytext); return NUM;}
[ \t] { /* ignore whitespace */ }
\n { return 0; /* logical EOF */ }
. { return yytext[0]; /* +-*, etc. */ }
%%
yyerror(char *msg){printf("%s,%s\n",msg,yytext);}
int yywrap(){return 1;}
15
Lex/Yacc Interface:
Compile Time
my.y my.flex more.java
byaccj jflex
Parser.java ParserVal.java Yylex.java
javac
Parser.class
17
Lex/Yacc Interface:
Run Time
main()
yyparse()
Token code
yylex() yylval
Myaction:
... Token value
yylval = ...
...
return(code)
18
Parser Value class
public class ParserVal //then do !
{
yylval = new ParserVal(3.14);
public int ival;
public double dval; yylval = new ParserVal(42);
public String sval; // ...or something like...
public Object obj; yylval = new ParserVal(new
public ParserVal(int val) myTypeOfObject());
{ ival=val; }
public ParserVal(double val)
{ dval=val; }
public ParserVal(String val) // in yacc actions, e.g.:!
{ sval=val;} $$.ival = $1.ival + $2.ival;
public ParserVal(Object val) $$.dval = $1.dval - $2.dval;!
{ obj=val; }
}//end class!
20
More Yacc Declarations
Token %token BHTML BHEAD BTITLE BBODY P BR LI
names & %token EHTML EHEAD ETITLE EBODY
types %token <sval> TEXT
Type of yylval (if any)
Nonterm
%type <obj> page head title
names &
types %type <obj> words list item items
Start sym %start page
22
Calculator example On this &
next 3 slides,
From http://byaccj.sourceforge.net/ some details
may be
%{! missing or
import java.lang.Math;! wrong, but
import java.io.*;! the big
import java.util.StringTokenizer;! picture is OK
%}!
/* YACC Declarations; mainly op prec & assoc */!
%token NUM!
%left '-' '+!
%left '*' '/!
%left NEG /* negation--unary minus */!
%right '^' /* exponentiation */!
/* Grammar follows */!
%%!
... !
25
... !
/* Grammar follows */!
%%!
input: /* empty string */! input is one expression per line;
| input line! output is its value
;!
line: \n!
| exp \n { System.out.println(" + $1.dval + " "); }!
;!
exp: NUM !{ $$ = $1; } !
| exp '+' exp !{ $$ = new ParserVal($1.dval + $3.dval); }!
| exp '-' exp !{ $$ = new ParserVal($1.dval - $3.dval); }!
| exp '*' exp !{ $$ = new ParserVal($1.dval * $3.dval); }!
| exp '/' exp !{ $$ = new ParserVal($1.dval / $3.dval); }!
| '-' exp %prec NEG!{ $$ = new ParserVal(-$2.dval); }!
| exp '^' exp !{ $$=new ParserVal(Math.pow($1.dval, $3.dval));}!
| '(' exp ')' !{ $$ = $2; }!
;!
%%! Ambiguous grammar; prec/assoc decls are a (smart) hack to fix that.
...!
26
%%!
String ins;!
StringTokenizer st;!
void yyerror(String s){!
System.out.println("par:"+s);!
}!
boolean newline;! NOT using lex; barehanded
int yylex(){! lexer with same interface
String s; int tok; Double d; !
if (!st.hasMoreTokens()) !
if (!newline) { !
token code newline=true; !
via return return \n'; //As in classic YACC example !
} else return 0; !
s = st.nextToken(); ! value via yylval
try { !
d = Double.valueOf(s); /*this may fail*/ !
yylval = new ParserVal(d.doubleValue()); !
tok = NUM; } !
catch (Exception e) { ! See slide 20
tok = s.charAt(0);/*if not float, return char*/ !
}
return tok;!
}!
27
void dotest(){!
BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); !
System.out.println("BYACC/J Calculator Demo");!
System.out.println("Note: Since this example uses the StringTokenizer"); !
System.out.println("for simplicity, you will need to separate the items"); !
System.out.println("with spaces, i.e.: '( 3 + 5 ) * 2'"); !
while (true) {
System.out.print("expression:"); !
try { !
ins = in.readLine(); !
} !
catch (Exception e) { } !
st = new StringTokenizer(ins); !
newline=false; !
yyparse(); !
}!
}!
public static void main(String args[]){ !
Parser par = new Parser(false); !
par.dotest();!
}!
28
Lex and Yacc
More Details
# set following 3 lines to the relevant paths on your system
JFLEX = ~ruzzo/src/jflex-1.4.3/jflex-1.4.3/bin/jflex
BYACCJ = ~ruzzo/src/byaccj/yacc.macosx
JAVAC = javac
LEXDEBUG = 0 # set to 1 for token dump Makefile:
# targets: Not required, but convenient
run: Parser.class
java Parser $(LEXDEBUG) test.ratml
Parser.class: Yylex.java Parser.java Makefile test.ratml
$(JAVAC) Parser.java
General form
Yylex.java: jratml.flex
$(JFLEX) jratml.flex A: B C!
(tab) D!
Parser.java: jratml.y
$(BYACCJ) -J jratml.y Means A depends on B &
C and is built by running D
clean:
rm -f *~ *.class *.java
31
Parser states
Not exactly elements of PDAs Q, but similar
A yacc "state" is a set of "dotted rules" rules in G with
a "dot (or _) somewhere in the right hand side.
In a state, "A _" means this rule, up to and
including is consistent with input seen so far; next
terminal in the input must derive from the left end of
some such . E.g., before reading any input, "S _ "
is consistent, for every rule S " (S = start symbol)
Yacc deduces legal shift/goto actions from terminals/
nonterminals following dot; reduce actions from rules
with dot at rightmost end. See examples below
32
state 0!
State Diagram state 2!
$acc : . S $end! $acc : S . $end!
S : . 'a' 'b' C 'd'! S
(partial) S : . 'a' 'e' F 'g'! $end accept!
'a' shift 1!
S goto 2!
0 $accept : S $end!
1 S : 'a' 'b' C 'd'! a $end
2 | 'a' 'e' F 'g'! state 1!
3 C : 'h C! S : 'a' . 'b' C 'd (1)!
4 | 'h'! S : 'a' . 'e' F 'g (2)!
5 F : 'h' F! accept!
6 | 'h'! 'b' shift 3!
'e' shift 4!
h b e
state 5! state 3! state 4!
C : 'h' . C! S : 'a' 'b' . C 'd' (1)! S : 'a' 'e' . F 'g' (2)!
C : 'h' . ! h
'h' shift 5! 'h' shift 5! 'h' shift 7!
'd' reduce 4! C goto 6! F goto 8!
C goto 9!
C C
state 9! state 6! state 10!
C : 'h' C .! S : 'a' 'b' C . 'd' (1)! d S : 'a' 'b' C 'd' . (1)!
. reduce 3! 'd' shift 10! . reduce 1!
33
Yacc Output: state 3! state 7!
S : 'a' 'b' . C 'd' (1)! F : 'h' . F (5)!
Same Example# F : 'h' . (6)!
0 $accept : S $end! 'h' shift 5!
1 S : 'a' 'b' C 'd'! . error! 'h' shift 7!
2 | 'a' 'e' F 'g'! 'g' reduce 6!
3 C : 'h' C! C goto 6!
4 | 'h'! F goto 11!
5 F : 'h' F! state 4! state 8!
6 | 'h'! S : 'a' 'e' . F 'g' (2)! S : 'a' 'e' F . 'g' (2)!
state 0! 'h' shift 7! 'g' shift 12!
$accept : . S $end (0)! . error! . error!
state 9!
'a' shift 1! F goto 8! C : 'h' C . (3)!
. error!
state 5! . reduce 3!
S goto 2!
C : 'h' . C (3)! state 10!
C : 'h' . (4)! S : 'a' 'b' C 'd' . (1)!
state 1!
S : 'a' . 'b' C 'd' (1)! 'h' shift 5! . reduce 1!
S : 'a' . 'e' F 'g' (2)! 'd' reduce 4! state 11!
F : 'h' F . (5)!
'b' shift 3! C goto 9!
'e' shift 4! . reduce 5!
. error! state 6! state 12!
state 2! S : 'a' 'b' C . 'd' (1)! S : 'a' 'e' F 'g' . (2)!
$accept : S . $end (0)!
'd' shift 10! . reduce 2!
$end accept! . error!
34
Yacc In Action PDA stack: alternates
between "states" and
initially, push state 0 symbols from (V ).
while not done {
let S be the state on top of the stack;
let i in be the next input symbol;
look at the action defined in S for i:
if "accept", halt and accept;
if "error", halt and signal a syntax error;
if "shift to state T", push i then T onto the stack;
if "reduce via rule r (A )", then:
pop exactly 2*|| symbols
(the 1st, 3rd, ... will be states, and
the 2nd, 4th, ... will be the letters of );
let T = the state now exposed on top of the stack;
T's action for A is "goto state U" for some U;
push A, then U onto the stack.
}
Implementation note: given the tables, its deterministic,
and fast -- just table lookups, push/pop. 35
expr: expr '+' term | term ;
Yacc "Parser Table" term: term '*' fact | fact ;
fact: '(' expr ')' | 'A' ;
State Dotted Rules Shift Actions Goto Actions
A + * ( ) $end expr term fact (default)
0 $accept : _expr $end 5 4 1 2 3 error
$accept : expr_$end
1 6 accept error
expr : expr_+ term
expr : term_ (2)
2 7 reduce 2
term : term_* fact
3 term : fact_ (4) reduce 4
4 fact : (_expr ) 5 4 8 2 3 error
5 fact : A_ (6) reduce 6
6 expr : expr +_term 5 4 9 3 error
7 term : term *_fact 5 4 10 error
expr : expr_+ term
8 6 11 error
fact : ( expr_)
expr : expr + term_ (1)
9 7 reduce 1
term : term_* fact
10 term : term * fact_ (3) reduce 3
11 fact : ( expr )_ (5) reduce 5
36
shift/goto # # is a state #
reduce # # is a rule #
A : _ (#) # is this rule #
Yacc Output . default action
state 0 state 1
$accept : _expr $end $accept : expr_$end
expr : expr_+ term
( shift 4
$end accept
A shift 5
+ shift 6
. error
. error
state 2
expr goto 1 expr : term_ (2)
term goto 2 term : term_* fact
fact goto 3
* shift 7
. reduce 2
... 37
Implicit Dotted Rules
state 0 $accept: _ expr $end
$accept : _expr $end expr: _ expr '+ term
expr: _ term
term: _ term '*' fact
( shift 4
term: _ fact
A shift 5
. error fact: _ '(' expr ')'
fact: _ 'A'
expr goto 1
term goto 2
fact goto 3
38
Goto & Lookahead
state 0 $accept: _ expr $end
$accept : _expr $end expr: _ expr '+ term
expr: _ term
term: _ term '*' fact
( shift 4
term: _ fact
A shift 5
. error fact: _ '(' expr ')'
fact: _ 'A'
expr goto 1
term goto 2
fact goto 3
39
using the unambiguous expr: expr '+' term | term ;
expression grammar here term: term '*' fact | fact ;
& parse table on slide 36 fact: '(' expr ')' | 'A' ;
Example: input "A + A $end"
Action: Stack: Input:
0 A + A $end
shift 5
0 A 5 + A $end
reduce fact A, go 3
state 5 says reduce rule 6 on +; state 0
(exposed on pop) says goto 3 on fact 0 fact 3 + A $end
reduce fact term, go 2
0 term 2 + A $end
reduce expr term, go 1
0 expr 1 + A $end
shift 6
40
Action: Stack: Input:
shift 6
0 expr 1 + 6 A $end
shift 5
0 expr 1 + 6 A 5 $end
reduce fact A, go 3
0 expr 1 + 6 fact 3 $end
reduce term fact, go 9
0 expr 1 + 6 term 9 $end
reduce expr expr + term, go 1
0 expr 1 $end
accept
41
An Error Case: "A ) $end":
Action: Stack: Input:
0 A ) $end
shift 5
0 A 5 ) $end
reduce fact A, go 3
0 fact 3 ) $end
reduce fact term, go 2
0 term 2 ) $end
reduce expr term, go 1
0 expr 1 ) $end
error
42
Q: Do you have any advice for
up-and-coming programmers?
A: ... One more piece of advice
take a theoretician to lunch...
From the end of a 2008 interview with
Steve Johnson, creator of yacc
http://www.techworld.com.au/article/252319/a-z_programming_languages_yacc
44