0% found this document useful (0 votes)

106 views15 pages

Lexnyacc

The document discusses the format and usage of lex and yacc input files. Lex is used to generate scanners or lexical analyzers while yacc is used to generate parsers. The document covers topics like definitions, rules, and actions for lex files as well as grammar, ambiguity, and attributes for yacc files.

Uploaded by

Samuel Giftson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views15 pages

Lexnyacc

Uploaded by

Samuel Giftson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Notes about lex and yacc

Pablo Nogueira Iglesias

December 26, 1999

Contents
1 Format of lex and yacc input files 1

2 LEX 2
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Operation of yylex() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Avoiding the ECHO default action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.5 Inter-operation with yacc or otherwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.6 Variables and functions used in lex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.7 Changing the input buffer size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.8 Taking input from strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.9 Taking input from pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.10 Error control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 YACC 6
3.1 Grammar ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 What yacc cannot parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Left recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Value stack type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Computing attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Variables and functions used in yacc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.8 Error control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.9 Placement of error tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Writing compilers with lex and yacc 13

Bibliographic references 14

1 Format of lex and yacc input files

Both lex and yacc input files must have the following structure:
definitions
%%
rules
%%
user supplied routines

1
2 LEX
In sections definitions and rules, lines that start with whitespace are copied verbatim to the generated
lex.yy.c filethis might change from one particular implementation to another. Useful for placing
comments in both sections.
%{ %} are allowed in both definitions and rules. If placed in definitions, lex puts the code enclosed
between %{ %} at the very beginning of the lex.yy.c file, before the definition of yylex(). Useful to
declare prototypes of functions used in the user routines section, for header file inclusion, for global
variable declarations, etc.

2.1 Definitions

In this section we can give names to parts of regular expressions that will be used in the rules section for
the purpose of reusability and readability. These definitions are literally substituted in the rules section.

2.2 Rules

Lines at the beginning of this section are placed at the beginning of yylex(). They must contain local
variable declarations or initialisation codebut better use definitions for that purpose.
Rules follow the syntax: regexp whitespace { statements }. Braces are optional, they are required
when the statements take more than one line.

Context: / is the trailing context. For example, the rule abc/de means abc only when followed by de;
the latter is not included as part of yytext. It is equivalent to abcde yyless(3);. If input is abdeabcde
at the point in which abc is recognised, the next character in the input to be recognised will be d. Also,
and $ are used for context, they are not part of part of yytext. $ is equivalent to /\n.

2.3 Operation of yylex()

When lex compiles the input specification, it generates the C file lex.yy.c that contains the routine
int yylex(void). This routine reads the input trying to match it with any of the token patterns spec-
ified in the rules section. On a match, the associated action is executed. If there is more than one match,
the action associated with the pattern that matches more text (included context) is executed. If still there are
two or more patterns that match the same amount of text, the action associated with the pattern listed first in
the specification file is executed. If no match is found, the default action ECHO is executed (see Section 2.4).
The input text (lexeme) associated with the recognised token is placed in the global variable yytext.
Example: given these rules
";" /* ignore separator */
int ECHO;
[a-z]+ printf("ID");
if the input is int;. . . then int is echoed, but if the input is integer;. . . then ID is printed.
When specifying a lexer, take this mechanism into account. Place patterns from the more restrictive to the
more general. In this example
{ID} return ID_TOKEN;
"IF" return IF_KEYWORD;
the lexer will always return ID TOKEN for any IF string encountered. The rules should be placed form
specific to general:

2
"IF" return IF_KEYWORD;
{ID} return ID_TOKEN;
Another issue to have in mind when designing a lexer with lex is that the speed of a lex scanner is inde-
pendent of the number and complexity of the patterns specifiedsee Levine et al. (1995), p. 94. Take for
example a case-insensitive programming language with reserved keywords. We may declare a token for
each keyword and another for identifiers in the following fashion
%%
"IF" return IF;
"if" return IF;
"THEN" return THEN;
"then" return THEN;
"ELSE" return ELSE;
"else" return ELSE;
...

{ID} { yylval.strval = strdup(yytext); return NAME; }

%%
This solution will work slightly faster than a solution based on just one pattern for all names and a keyword
symbol table lookup to determine the kind of token, due to the extra lookup processing.

2.4 Avoiding the ECHO default action

When yylex() cannot find a pattern that matches the input, it just writes yytext to yyout. Such default
action is called ECHO and is equivalent to fprintf(yyout,"%s", yytext);.
In many cases this can be undesirable, so it might be convenient to write a very last rule to do some action
with all the unexpected input. A simple example can be
. return yytext[0];
Now the parser program will have to take care of lexical errors concerning unexpected input.

2.5 Inter-operation with yacc or otherwise

When the lexer analyses the input, for each recognised token it carries out its associated action. Such action
can be any statement, so lex can be used for many purposes, not only lexical analysers.
The only way to return from yylex() is to place a return statement in an action. The value returned must
be an integer (either a literal or a symbolic constant), otherwise a type error or a type coercion will occur.
If no return statement is written, execution continues inside yylex(), recognising input and carrying out
the actions until the end of input is reached (this is useful for example to skip comments, etc.) Every time
yylex() is called, it picks up where it left off.
The integer returned by yylex() must be a token code-number. When interoperating with yacc, such
codes must be known to both yylex() and yyparse(). yacc can create the y.tab.h file (see Section 3.7)
containing one #define for each symbol specified as %token in the yacc file. Such symbols cannot be C
reserved keywords.
This is a typical (but system-dependent) invocation:
$ yacc -d file.y
$ lex file.c
$ gcc -o parser y.tab.c lex.yy.c -ly -ll
Note that yacc is invoked first.

3
2.6 Variables and functions used in lex

The routine yylex() provides information by way of global variables and the value it returns, but also by
way of macros, functions, and options. Figure 1 shows the basic variables and functions.

yyin, yyout These are FILE*s to the input and output file respectively. Both are accessible and
can be assigned to a value other than stdin and stdout (see Section 2.9).
input() Read next character from yyin. This is the function invoked by yylex() to read
the input.
output() Write yytext to yyout. This is the function invoked by yylex() to do an ECHO.
This function is not supported when generating C++ scanners (see man pages).
yyless(int n) Keeps the nth first characters of yytext and puts the rest back to yyin. For exam-
ple: abc/de is equivalent to abcde yyless(3);.
yymore() Keep this tokens lexeme in yytext when another pattern is matched.
yywrap() Used for multiple input files. Specifies what to do when the EOF token is recog-
nised. flexs option %option noyywrap instructs the lexer to scan only one file
(generates default yywrap()), otherwise the funcion must be specified by user.
See also yyrestart() in documentation.
yytext Array of or pointer to (see man) char where lex places the current tokens lexeme.
The string is automatically null-terminated. It can be modified but not lengthened.
yyleng Integer that holds strlen(yytext).
yylineno Integer that keeps count of lines read. It is maintained automatically by yylex()
for each \n encountered. In flex, it is an optional feature that must be specified
using %option yylineno, for there is a loss of performance.

Figure 1: lexs global variables and functions

2.7 Changing the input buffer size

lex: #undef YYLMAX
#define YYLMAX (size)

flex: setbuf(int size) {

yy_current_buffer = yy_create_buffer(yyin,size);
}

2.8 Taking input from strings

Normally lex reads from FILE* yyin, but sometimes we might want to read input from a string or array of
char in memory. All versions of lex allow this. Here we discuss flexs: redefine the macro YY INPUT used
to read blocks of data. Its syntax is

YY INPUT(buffer, result, maxsize)

where buffer is the character array, result is a variable in which to store the number of characters
actually read (e.g. yytext), and maxsize is the size of the buffer. To read from a string, have your version
of YY INPUT copy data from your string buffer. Here is an example:
%{
#undef YYINPUT
#define YYINPUT (b,r,ms) (r = my_yyinput(b,ms))

4
%}
...
extern char myinput[]; /* char array buffer */
extern char *myinputptr; /* current position in myinput */
extern int myinputlim; /* end of data */

int my_yyintput(char *buf, int maxsize) {

int n = min(maxsize, myinputlim - myinputptr);

if (n>0) {
memcpy(buf, myinputptr, n);
myinputptr += n;
}
return n;
}

2.9 Taking input from pipes

Imagine we want a scanner for C source files but we want to make use of the C preprocessor, which
can remove comments and perform file inclusion, generating line number stamps of the form # lineno
filename. To avoid renaming of files by using temporaries, an elegant solution that makes use of pipes is
presented in Schreiner et al (1995) and summarised here.
#undef CPP
#define CPP "/lib/cpp"

sprintf(buf, "%s %s", CPP, args);

if (yyin = popen(buf, "r"))
/* Able to open pipe */
else ...
Since line numbering is altered by preprocessing, a way to correctly update yylineno is to include a pattern
in the scanner file that updates yylineno according to line stamps, such as
^"#"[ \t]*[0-9]+([ \t]+.*)?\n sscanf(yytext, "# %d", &yylineno);

2.10 Error control

In general, error control consists of three related activities: detection, reporting, and recoverysee p. 165-
169 of Aho et al. (1990).

1. Detection. In lex, there is no implicit error detection. The default action is ECHO. Error detenc-
tion must be done explicitly, writing patterns for illegal input or as mentioned in Section 2.4. The
following is an example of accepting illegal input to detect errors:

\"[^\"\n]*\" { yylval.string = strdup(yytext); return DQS; }

\"[^\"\n]*$ { warning("unterminated string"); }

2. Reporting. The detected errors must be reported in a precise and detailed fashion. Precise means
that the report should include a reference to the exact location where the error occurred. Detailed
means that the report message gives enough information about the type of the error.
In lex, reporting can be done in the actions associated to the patterns for illegal input. The following
is a code that may be added to any lexer. Its purpose is to report informative error messages (Taken
from Levine et al. (1995) p. 246).

5
%{
#include <string.h>
char linebuf[...]; /* Put a size of line */
int yylineno=0, tokenpos;
%}
%%

\n.* { strcpy(linebuf, yytext-1); /* Save next line */

yylineno++;
tokenpos = 0;
yyless(1); /* Flush back all but \n to input */
}
%%

void yyerror(char *s)

{
printf("%d: %s:\n%s\n", yylineno, s, linebuf);
printf("%*s\n", 1+tokenpos, "^"); /* * pads w/ whitespace */
}

The pattern \n.* will take precedence over all others, for it is listed first and it will match the largest
amount of input text possible. Every line will be scanned at least twice: one to save the entire line
into the variable linebuf and again to scan each token in that line.
Variable tokenpos is used to keep the position in the line of the mismatched token. Its value must be
updated properly and reset to 0 only in the action for \n.*. A possible wat to update its value could
be to place the statement tokenpos += yyleng; as the last statement of each patterns action.
Here is a simpler definition of yyerror() without tokenpos is

printf("%d: %s %s in %s\n", yylineno, s, yytext, linebuf);

in which the offending token text is printed along with the message and the line number.
3. Recovery. It is undesirable to stop processing the input when an error is detected. It should be
possible to do as much processing as possible to shorten the edit-compile-test cycle. Error recovery
is enhanced with language design (e.g. the use of terminators such as ; that serve as synchronisation
points).
Since there is no error detection in lex, it does not have an error recovery mechanism either. Re-
covery must be done explicitly in the actions associated to illegal input patterns. A typical recovery
mechanism would be to discard yytext and to wait for the next matched pattern.

3 YACC
yacc is a parser generator that takes an input file with an attribute-enriched BNF grammar specification,
and generates the output C file y.tab.c containing the function int yyparse(void) that implements its
table-driven LALR(1) parser. This function automatically invokes yylex() every time it needs a token to
continue parsing. Such function can be provided by the user or by a lex-generated scanner.
Section definitions contains token declarations, the value-stack type, the type declaration of attributes
associated to non-terminals and tokens, and the precedences and associativity of tokens.
%{ %} are allowed only in definitions. Useful to declare prototypes of routines used in the user
routines section, for header file inclusion, for declaring of global variables, etc.

6
The rules section contains the BNF grammar specification together with the semantic actions for each
production rule. It is actually a syntax-directed translation scheme. It is customary to write tokens in
uppercase and non-terminals in lower case.
The grammar is automatically checked by yacc to assure that:

1. There are rules (productions) for all non-terminals.

2. All non-terminals must be reachable from the axiom.

3. The grammar is not ambiguous
4. The grammar is LALR(1) parsable.

The yacc program alerts us if any of these conditions do not hold. The first two are trivial to arrange.

3.1 Grammar ambiguity

If the grammar is ambiguous, it can be disambiguated using disambiguating rules. yacc uses two disam-
biguating rules by defaulthence, there are ambiguous grammars for which yacc is able to generate a
parser, and also allows the user to disambiguate the grammar by declaring the precedence and associa-
tivity of tokens. Production rules take the precedence of the rightmost token on their right hand side. If
a production has no tokens then it has no precedence of its own. If the rightmost token does not provide
the right precedence, it can be overriden with %prec T where T is the token whose precedence we want
assigned to the rule (Aho et al. (1990) p . 271).
Example: dangling ELSE
stmts stmt
| stmts stmt
stmt IF expr THEN stmts (1)
| IF expr THEN stmts ELSE stmts (2)
| other.stmt
There is a shift/reduce conflict in the grammar: when the stack contains the parsing state [IF expr THEN
stmts] and the lookahead token is ELSE, yacc can either reduce by rule (1) or shift the ELSE and attempt
to reduce later by rule (2). Notice that ELSE FOLLOW(stmts). If instead than using the default disam-
biguating rule we want to disambiguate the grammar explicitly, we must declare the following precedences:
%nonassoc THEN
%nonassoc ELSE
which gives higher precedence to the token ELSE. Thus, rule (1) has lower precedence than rule (2). The
precedence of the token to shift must be higher than the precedence of the rule to reduce. (See Aho et al.
(1990) p. 257-259.)
Precedence and associativity allows the user to keep a readable grammar and indeed helps yacc to build
fast parsers that perform lesser reductions. See Aho et al. (1990) p. 254-256 for a typical example. On the
other hand Levine et al. (1995) recomends the use of precedence and associativity only in expressions. It
is usually convenient to rewrite the grammar. If yacc cannot understand the grammar, usually neither can
people.
By default, a shift/reduce conflict is resolved in favour of shift (try to match as much input as possible),
and a reduce/reduce conflict is resolved in favour of the rule listed first in the yacc specification. This may
cause troubles if the specification text is rearranged and rules are moved!
Aho et al. (1990) p. 270, Johnson (1978) Sections 5 and 6 (especially p. 16), and Levine et al. (1995) Chp.
7, provide a detailed discussion.

7
3.2 What yacc cannot parse

There are unambiguous grammars that yacc cannot parse. This is due to the limitations of the LALR(1)
technique. For example, the following unambiguous grammar is not LALR(1)-parsable:
S Aab
| Bac
A d|e
B d| f
A LALR(1) parser fails to parse the sentence dab, for two lookahead symbols are needed to select the right
production for reduction. Indeed, after shifting d, the lookahead token is a. The parser cannot tell whether
to reduce d by A d or by B d since a belongs to both FOLLOW(A) and FOLLOW(B), which amounts
to a reduce/reduce conflict.

3.3 Left recursion

The grammars productions must be left recursive instead than right recursive. The reasons are explained
in Johnson (1978) p. 19. Intuitively, we can see that right-recursive lists make their two productions appear
on the same item:
1: list TOKEN
2: | TOKEN list
Suppose that list is the axiom; the grammar is enlarged with production 0:
0: list 0 list
The initial state is the closure of {[0, 0]}, that is, {[0, 0][1, 0][2, 0]}. Computing the list-successor and the
TOKEN-successor of this state yields two new states:
goto( {[0, 0][1, 0][2, 0]}, list) = closure({[0, 1]}) = {[0, 1]}
goto( {[0, 0][1, 0][2, 0]}, TOKEN) = closure({[1, 1][2, 1]}) = {[1, 1][2, 1][1, 0][2, 0]}
When the table is built, Action[ {[1, 1][2, 1][1, 0][2, 0]}, TOKEN] will be a shift to the same state
goto( {[1, 1][2, 1][1, 0][2, 0]}, TOKEN) = closure({[1, 1][2, 1]}) = {[1, 1][2, 1][1, 0][2, 0]}
pushing TOKEN into the stack as well. There exists the danger of the parsers stack being filled with TOKENs
if the list is lengthy.
With left-recursive lists, this danger no longer exists, for a reduction is always performed when the first
TOKEN of the list is encountered:
0: list 0 list
1: list TOKEN
2: | list TOKEN

goto( {[0, 0][1, 0][2, 0]}, list) = closure({[0, 1][2, 1]}) = {[0, 1][2, 1]}
goto( {[0, 0][1, 0][2, 0]}, TOKEN) = closure({[1, 1]}) = {[1, 1]}
When the table is built, Action[ {[1, 1]}, TOKEN] will be a reductionthe state contains a completed item,
which will pop the TOKEN from the stack and push the list non-terminal and the new state.

3.4 Value stack type

Along with the parsing state stack, yacc maintains a parallel stack of attribute values associated with each
token and non-terminal in the parsing stack. The type of an attribute is defined as YYSTYPE, which is int
by default, but can be redefined within the specification in two ways:

8
1. Redefining YYSTYPE to a single type name as in: #define YYSTYPE double. Being a preprocessor
definition, for composite type names we must use typedef:

typedef struct att_type {

char *sval;
int ival;
};
#define YYSTYPE att_type

2. Declaring the value stack type with a %union declaration within the yacc specification file:

%union {
char *sval;
int ival;
}

letting yacc create the appropriate #defines in the file y.tab.h (see Section 3.7). It will also create
the global variables yylval and yyval of type YYSTYPE (see Figure 2).

3.5 Computing attributes

In semantic actions, the values of attributes associated to the non-terminal on the left hand side of a produc-
tion can be computed in terms of the values of attributes associated with the grammatical symbols on the
right hand side. The symbol $$ denotes the former and $i denotes the attribute value of the ith grammatical
symbol on the right hand side. The default action is {$$=$1;}. Semantic actions are usually placed at
the end of a production, for yacc generates an ascent or bottom-up parser, and semantic actions should im-
plement a syntax directed definition with synthesised attributes only. The actions are thus executed before
reduction, and the attribute values of the grammatical symbols on the right hand side will be popped from
the value stack and replaced with the attribute value of the left hand side non-terminal.
Nevertheless, evaluation of inherited attributes is possible as long as the syntax directed definition is an
l-attributed definition, by means of accessing atribute values deep in the stack, as described in Aho et al.
(1990), Section 5.6 pp. 318325. The same technique allows the use of interleaved actions in productions,
as in A a {. . .} B {. . .} C. In fact, yacc creates anonymous markers replacing the former production with
A a M1 B M2 C
M1 {. . .}
M2 {. . .}
Interleaved actions can have attributes associated to them (associated to the non-terminal marker). Both
for these and for inherited attributes deep in the stack, yacc has no easy way of knowing their type, so it
must be specified explicitly. The notation $<type>$ explicitly declares the type of the attribute value of the
interleaved action,1 and $<type>n declares the type of the attribute value located on tos n (n : 0, 1, . . .),
where tos is the top of the value stack. The following example illustrates the use of inherited attributes:
C-like declarations have the form
decl type decl list ;
dec list ID
| decl list , ID
type INT | VOID | . . .
The type of each identifier on the list is specified first, so decl list should have an associated inherited
attribute t (for type), whose value is taken from the value of type and that is available for each identifier
1 Do not confuse $<..>$ with $$. The type of the left hand side non-terminal attribute is knownit has been declared with a

%type, so only $$ stands for that attribute.

9
on the list when creating its symbol table entry:
decl type
{dec list.t := type.t} /* Inherited attr. */
decl list ;
decl list ID
{ st insert(ID.lexeme, decl list.t) }
decl list decl list , ID
{ st insert(ID.lexeme, decl list.t) }
Function st insert(id,t) creates a symbol table entry for token ID with lexeme id and type t. To
implement this l-attributed definition using yacc we need only to see that the attribute value of type will be
on top of the value stack before the first identifier on the list is read, so that after shifting it, the attribute of
type will be available as an offset from the top of the value stack:
%token ID
%token INT
%token VOID
%union {
enum Types tval;
char *sval;
}
%type <tval> type decl_list
%type <sval> ID
%%
type : INT { $$ = INT_TYPE;}
| VOID { $$ = VOID_TYPE;}
;
decl : type decl_list ;
;
decl_list : ID
{ st_insert($1, $<tval>0); $$ = $0; }
| decl_list , ID
{ st_insert($3, $1); }
Note how the type tval has been explicitly specified for the inherited attribute in the action.
Implementing inherited attributes this way is more elegant and more readable than using global variables
to pass attribute information. Nothing prevents us from having a global variable t that holds the type of the
current declaration and whose value is set in the actions for type and read in the actions for decl list, but
appart from being an ad-hack solution, parse errors and error recovery can create havoc when discarding
input and partially parsed phrases.
Interleaved actions are somewhat similar, they are usually used for housekeeping before parsing certain
phrases or for creating new attribute values depending on the partially parsed phrase. For example, logical
operators in C must be lazily evaluated (in short-circuit), so that code generation to assembly shoud create
appropriate labels and produce branching instructionsthe following example assumes a stack discipline
of expression evaluation:
%type <tval> expr
expr : expr AND
{ char *l = new_label();
emit("pop");
emit("jump z"); emit(l);
$<sval>$ = l;
}

10
expr
{
emit("pop");
emit($<sval>3); emit(":");
emit("push");
$$ = check_type($1, $4);
}
In the second action, $<sval>3 refers to the attribute value of the first action (or its marker), which is the
3rd element in the right hand side of the production. Note that its type has been explicitly specified.
Interleaved actions require special care: one should give value to $$ only in a rightmost action. Recall
that $$ is pushed on the value stack after reduction. If placed in an interleaved action, attribute values of
elements to the right of such action will be pushed on top of it, reduction will have trouble when popping
from the value stack as many attributes as right hand side elements, and it is undefined what value will be
pushed for the non-terminal.

3.6 Variables and functions used in yacc

Figure 2 shows the basic functions and variables.

3.7 Invocation

The -d option generates the file y.tab.h that contains for each token constant TOKENi a line alike
the following
#define TOKEN1 447
This way, the token codes returned by lex are those expected by yacc. The header file imust be
included in the lex specification at the %{%}s.

For execution of the parser in tracing mode, YYDEBUG must be defined at compile time. A possible
solution is
gcc -DYYDEBUG y.tab.c lex.yy.c main.c error.c -ll
If we have a .y file that contains its own main(), we can call yacc with the -t (trace) flag which
automatically defines YYDEBUG in the output C file. Defining YYDEBUG is not enough; for yacc to
print trace information (states pushed, rules used in reductions,. . . ) variable yydebug must be set to
a non-zero value before calling yyparse().

3.8 Error control

As mentioned in Section 2.10, error control consists of three related activities:

1. Detection. In yacc, detection is accomplished when given the current parsing state in top of the stack
and the next input token, no shift or reduce action is possible. Empty entries are taken as error entries.
2. Reporting. In yacc, reporting is done via calls to yyerror(). It will print the message parse error
unless the constant YYERROR VERBOSE is defined in the specification file. In that case it will print the
next expected terminal or non-terminal for the current state.

3. Recovery. A few error recovery techniques for parsers are discussed in Aho et al. (1990) p. 168-169.
Here we mention two that are related to yaccs error recovery mechanism.

11
(a) Error productions: just like we did in the lex specification, we can accept illegal input with
yacc by writing productions for erroneous syntactic constructions. This way detection, reporting
and recovery is straightforward. For a given legal production
A : {. . .}
we might add more rules that will detect illegal constructions, such as
A : {. . .}
| {recover(...);}
This technique, however, is not advisable for several reasons: (1) it is nearly impossible to
anticipate all incorrect input constructions, (2) it might be difficult to write productions for
them in a natural way, and (3) it might impair the readability of the grammar and, worst, it
might render the grammar ambiguous with many s/r or r/r conflicts.
(b) Panic mode synchronisation: consists of discarding input tokens until the parser finds one or
more from which parsing can continue in an error-free state. It is the simplest method but has
the disadvantage of generating spurious and cascading errors.2 The discarded tokens should be
as few as possible. To avoid the cascading of errors, a good technique is to keep a count of the
number of errors and stop parsing when more than 20, say, have been encountered.
In yacc, these two techniques are combined in a new elegant way. Recovery is performed by means
of the reserved token error, that might be added to any production in the grammar. yacc will treat
productions with error as any other production when generating the parsing table.

. . . [it] is used to find a synchronisation point in the grammar from which it is likely that
processing can continue. Note that we said likely. Sometimes our attempts at recovery
will not remove enough of the erroneous state to continue, and the error messages will
cascade. Either the parser will reach a point from which processig can continue or the
entire parser will abort.
After reporting a [parser error], a yacc parser discards any partially parsed rules [if any]
until it finds one in which it can shift an error token. It then reads and discards input tokens
until it finds one which can follow the error token in the grammar. . . Normally, the parser
refrains from generating any more error messages until it has successfully shifted three
tokens without an intervening syntax error, at which point decides it has resynchronized
and returns to its normal state.3

More precisely, given the current state on top of the stack and the next lookahead token, if there is no related
shift or reduce action, yyparse() invokes yyerror() to issue an error message and increments the global
variable yynerrs. It then inserts the error token before the one that caused the error as if error where the
next input token. If the state on top of stack contains a shift action for error then the action is carried out,
but if the associated action is error, then yacc pops its stack looking for a state on top of stack that contains a
shift for error (note that a state contains a shift action for error when it contains the item A . error).
If the stack is emptied then yyparse() returns with a value of 1.
If error is shifted, recovery takes place by way of discarding input. By default, input is discarded until a
token is found that may follow error in the grammara token in FOLLOW(A). Then, the semantic action
associated to the production is executed and reduction takes place replacing error for A in the stack. To
avoid cascading, no more error messages will be issued until three tokens are successfully shifted after this
reduction. If succesfully recovered, it will continue parsing to the end of input and return with a value of 0.
We know that an error occurred because yynerrs> 0.
Both discarding and point of recovery can be configured:
2 For example in C, if an error occurs inside a declaration and input is skipped to the semicolon that ends the declaration, all

identifiers from the point of error will be ignored and an undeclared id (semantic) error will be issued for each occurrence in the
program.
3 Levine et al. (1995) p. 248-249.

12
Discarding. When the item that produced the shift action for error is A . error , there are
three possible operations:
1. If = , after shifting error the semantic action is executed and reduction takes place. Input
is discarded until a token in FOLLOW(A) is found. This is the default operation.
2. If consists of a sequence of tokens (synchronisation terminals), then input is discarded until a
string that matches the sequence is found. Then, the semantic action is executed and . error
is reduced for A.
3. If is a sentential form, then yyparse() looks for a substring that can be reduced to . Then,
the semantic action is executed and . error is reduced for A.
Error messages. By default, no more error messages will be issued after the point of error until three
tokens are shifted. This operation can be overriden by inserting yyerrok; in semantic actions, which
instructs yyparse() to continue in normal mode as if resynchronised.

3.9 Placement of error tokens

Good recommendations for placement of error tokens are given in Chapter 4 of Schreiner et al (1995).
The following excerpt is taken from Levine et al. (1995) p. 251-252..

You want to be as sure as possible that resynchronization will succed, so you want error tokens
in the highest level rules in the grammar, maybe even the start rule, so there will always be a
rule to which the parser can recover. On the other hand, you want to discard as little input as
possible before recovering, so you want the error tokens in the lowest level rules to minimize
the number of partially matched rules the parser has to discard. . .

We must be cautious in the use of error and yyerrok;. There exists, for example, the possibility of infinite
loops. A production like
A error { ... yyerrok; ... } (1)
will make yyparse() fall into an infinite loop. The reason is that on an error, error is taken as the next
input token, inserting it before the lookahead that caused error, which remains as unread input. After error
is shifted, yyerrok; instructs yyparse() to continue as if recovered and then reduction by production (1)
takes place. No input has been discarded and the erroneous token is the next lookahead token again!
The macro yyclearin; will clear the lookahead buffer which contains the erroneous token. It can be
useful when input is discarded in semantic actions instead than leaving it to the automatic operation of
yyparse(). In the following example, recover() calls yylex() to skip input and the parser expects
tokens in FIRST().
A error {yyclearin; recover(); yyerrok; }
{...}
Semantic actions in error productions usually do some housekeeping like updating symbol table structures
accordingly with the discarded input, reclaim syntax tree storage, etc.

4 Writing compilers with lex and yacc

(1) Write scanner for lex and test it. (2) Write parser for yacc with just the grammar. Check conflicts and
do some trials. (3) Add error productions. This may introduce conflics and force some rewriting. (4)
Join together scanner and parser and do some trace trials. Cannot declare type of value stack with %union,
assume default int and make scanner return just token codes, otherwise default action $$=$1 will force
type declaration of some non-terminals and type conflicts may arise. (5) Declare attributes and their types.
(6) Add semantic actions: it is a syntax directed translation where the grammar serves as a control structure

13
guiding the implementation process. (6.1) Add semantic actions regarding names and constants which are
controled by the use of a symbol table. Using the grammar as a guide, check for each ocurrence of a name
or a constant, decide what needs to be checked and call the appropriate routines. (6.2) Add semantic actions
for type checking declaring type attribute for appropriate nom-terminals. (6.2) Add semantic actions for
calculating offsets of variablesmemory allocation. (6.3) Add semantic actions for code generation.

References
A. Aho, R. Sethi, and J. Ullman (1990). Compiladores: Principios, Tecnicas y Herramientas. Addison-
Wesley.
Stephen C. Johnson (July 31, 1978). YACC: Yet Another Compiler-Compiler. Bell Laboratories. Murray
Hill, New Jersey 07974.
J. Levine, T. Mason and D. Brown (1995). Lex & Yacc. OReilly and Associates.

Axel T. Schreiner and H. George Friedman, Jr (1995). Introduction to Compiler Construction with UNIX.
Prentice-Hall.

14
yyerror() This function must be supplied by the user. It is called from yyparse() on a parse
error. Used for error reporting.
yyparse() The parser function. It returns an integer value: zero if there is success or non-zero
if unable to parse the token sentence provided by yylex().
yylval Global variable of type YYSTYPE that contains the value pushed on a shifti.e., a
token attribute value, so it must be set by yylex(), usually to a copy of yytext
(yytext is modified each time a token pattern matches the input).
yyval Global variable of type YYSTYPE that contains the value pushed on a gotoi.e.,
the value $$.
yynerrs Global int variable that holds the number of errors found. It is automatically
incremented by yyparse() on each call to yyerror().
yyerrok; It is an action that instructs the parser to continue in normal mode, as if recovered
from an error.
YYRECOVERING() If the user calls its own error-reporting routine whose name is other than
yyerror(), such routine should use the yacc macro YYRECOVERING() to test if
the parser is trying to resynchronise, in which case no error messages should be
printed:
warning(char *msg)
{
if (YYRECOVERING()) return;
...
/* print error */
}

yydebug Global int variable that should be set to 1 if YYDEBUG has been defined and we
want to output trace information (see Section 3.7). It should be declared as extern
in main() and set to the proper value there.
main()
{
#ifdef YYDEBUG
extern int yydebug;
#endif
...
yydebug = 1;
...
}

Figure 2: yaccs variables and functions

Notes About Lex and Yacc: Pablo Nogueira Iglesias December 26, 1999
No ratings yet
Notes About Lex and Yacc: Pablo Nogueira Iglesias December 26, 1999
15 pages
Lex and Yacc: A Brisk Tutorial
No ratings yet
Lex and Yacc: A Brisk Tutorial
25 pages
Lex & Yacc Guide for Developers
No ratings yet
Lex & Yacc Guide for Developers
17 pages
Lex-Yacc For Exam
100% (1)
Lex-Yacc For Exam
17 pages
System Programming (BTHU-301A) : Bachelor of Technology
No ratings yet
System Programming (BTHU-301A) : Bachelor of Technology
22 pages
Yaac and Lex
No ratings yet
Yaac and Lex
13 pages
SS Lab Manual
No ratings yet
SS Lab Manual
38 pages
LEX and YACC
No ratings yet
LEX and YACC
31 pages
Lex Yacc
No ratings yet
Lex Yacc
22 pages
Yacc Guide for Unix Programmers
No ratings yet
Yacc Guide for Unix Programmers
15 pages
Lex Yaac
No ratings yet
Lex Yaac
24 pages
CD File
No ratings yet
CD File
7 pages
SS & OS Final Lab Manual
No ratings yet
SS & OS Final Lab Manual
46 pages
SS Manual GEC 18CSL66
No ratings yet
SS Manual GEC 18CSL66
49 pages
Lex and Yacc
No ratings yet
Lex and Yacc
8 pages
Lex & Yacc Tutorial for CSE Students
No ratings yet
Lex & Yacc Tutorial for CSE Students
31 pages
SS Lab Manual
No ratings yet
SS Lab Manual
66 pages
Lex and Yacc Tool
No ratings yet
Lex and Yacc Tool
39 pages
LEX and YACC
No ratings yet
LEX and YACC
3 pages
Overview of Yacc Program
No ratings yet
Overview of Yacc Program
14 pages
LexYacc Final
No ratings yet
LexYacc Final
44 pages
Lex & Yacc Tutorial for Programmers
No ratings yet
Lex & Yacc Tutorial for Programmers
38 pages
Language Processing: Introduction To Compiler Construction: Andy D. Pimentel Computer Systems Architecture Group
No ratings yet
Language Processing: Introduction To Compiler Construction: Andy D. Pimentel Computer Systems Architecture Group
91 pages
More On LEX Programming
50% (2)
More On LEX Programming
42 pages
Lex and YACC primer/HOWTO
No ratings yet
Lex and YACC primer/HOWTO
20 pages
Lex Tool and Program Examples
No ratings yet
Lex Tool and Program Examples
39 pages
Compiler File
No ratings yet
Compiler File
47 pages
A4 68 Prac2
No ratings yet
A4 68 Prac2
13 pages
Lex & Yacc
No ratings yet
Lex & Yacc
46 pages
Lab Session
No ratings yet
Lab Session
27 pages
1 Introduction To LEX: Input - File.l
No ratings yet
1 Introduction To LEX: Input - File.l
19 pages
CompilerDesignLabManual PDF
No ratings yet
CompilerDesignLabManual PDF
11 pages
Lex Material 1
No ratings yet
Lex Material 1
37 pages
093 CD
No ratings yet
093 CD
10 pages
A3 47 Practical2
No ratings yet
A3 47 Practical2
14 pages
Compiler Design Practical List
No ratings yet
Compiler Design Practical List
5 pages
Lab Programs
No ratings yet
Lab Programs
18 pages
Compiler Tools: Lex, YACC, and PLY
No ratings yet
Compiler Tools: Lex, YACC, and PLY
6 pages
Lex Tool
No ratings yet
Lex Tool
7 pages
Lex & Yacc for Compiler Design
No ratings yet
Lex & Yacc for Compiler Design
4 pages
Lex
No ratings yet
Lex
41 pages
Compiler Design File Part 1
No ratings yet
Compiler Design File Part 1
9 pages
Lex & Yacc for Compiler Developers
No ratings yet
Lex & Yacc for Compiler Developers
42 pages
Lex Yacc
No ratings yet
Lex Yacc
9 pages
CD Practical Manual 21-22
No ratings yet
CD Practical Manual 21-22
45 pages
Lex and Yacc Roll No 23
No ratings yet
Lex and Yacc Roll No 23
7 pages
Yacc
No ratings yet
Yacc
12 pages
SPCC Exp7
No ratings yet
SPCC Exp7
8 pages
Compiler Construction
No ratings yet
Compiler Construction
7 pages
Compiler Design Lecture Notes
No ratings yet
Compiler Design Lecture Notes
98 pages
PCD Notes - Unit - 1
No ratings yet
PCD Notes - Unit - 1
15 pages
Compiler Construction Course Guide
No ratings yet
Compiler Construction Course Guide
3 pages
Compiler Design Lab Manual
91% (22)
Compiler Design Lab Manual
29 pages
Lex and Yacc Calculator Example
No ratings yet
Lex and Yacc Calculator Example
13 pages
Module-Iv (9 Lectures) Intermediate Code Generation
No ratings yet
Module-Iv (9 Lectures) Intermediate Code Generation
12 pages
Course Title: Compiler Design Course Code: CS 451 Topic: Introduction Branch: CSE Semester: VII TH
No ratings yet
Course Title: Compiler Design Course Code: CS 451 Topic: Introduction Branch: CSE Semester: VII TH
23 pages
RBT Levels Marks L1 10marks: Answer Any Three Full Questions
No ratings yet
RBT Levels Marks L1 10marks: Answer Any Three Full Questions
1 page
Compiler Design Question Bank-UNIT 1
No ratings yet
Compiler Design Question Bank-UNIT 1
12 pages
Compiler Basics for Developers
No ratings yet
Compiler Basics for Developers
5 pages
Compiler Design Questions Fixed
No ratings yet
Compiler Design Questions Fixed
23 pages
CD - Cat-2-Qp - Key
No ratings yet
CD - Cat-2-Qp - Key
6 pages
Compiler Design Course Guide
No ratings yet
Compiler Design Course Guide
183 pages
Compiler HandMade Notes
No ratings yet
Compiler HandMade Notes
7 pages
Compiler Design Theory 1
No ratings yet
Compiler Design Theory 1
23 pages
Lex & Yacc Programming Guide
No ratings yet
Lex & Yacc Programming Guide
5 pages
Compiler Construction: PDF Generated At: Fri, 04 May 2012 09:19:18 UTC
100% (1)
Compiler Construction: PDF Generated At: Fri, 04 May 2012 09:19:18 UTC
543 pages
Compiler Design Lab r23
No ratings yet
Compiler Design Lab r23
2 pages
Implementation of Predictive Parser.: CD Lab Programs
No ratings yet
Implementation of Predictive Parser.: CD Lab Programs
15 pages
Designing A Compiler - CodeProject
No ratings yet
Designing A Compiler - CodeProject
11 pages
MID 1 FLCD (Part 1)
No ratings yet
MID 1 FLCD (Part 1)
6 pages
BKS UNIT I Lecture 2 Structure of Compiler
No ratings yet
BKS UNIT I Lecture 2 Structure of Compiler
37 pages
Writing A C Compiler Early Access Nora Sandler Download
100% (3)
Writing A C Compiler Early Access Nora Sandler Download
47 pages
Compiler Design Quantum PDF
100% (1)
Compiler Design Quantum PDF
211 pages
Role of Parser
No ratings yet
Role of Parser
10 pages
Compiler Design Viva Prep
100% (1)
Compiler Design Viva Prep
22 pages
Compiler Set2
No ratings yet
Compiler Set2
2 pages
Lexical Analysis for CS Students
No ratings yet
Lexical Analysis for CS Students
27 pages
Compiler Design Exam Paper
No ratings yet
Compiler Design Exam Paper
2 pages
Compiler For Flat Tiny C
No ratings yet
Compiler For Flat Tiny C
24 pages

Lexnyacc

Uploaded by

Lexnyacc

Uploaded by

Notes about lex and yacc

Pablo Nogueira Iglesias

4 Writing compilers with lex and yacc 13

1 Format of lex and yacc input files

2.3 Operation of yylex()

{ID} { yylval.strval = strdup(yytext); return NAME; }

2.4 Avoiding the ECHO default action

2.5 Inter-operation with yacc or otherwise

Figure 1: lexs global variables and functions

2.7 Changing the input buffer size

flex: setbuf(int size) {

2.8 Taking input from strings

YY INPUT(buffer, result, maxsize)

int my_yyintput(char *buf, int maxsize) {

2.9 Taking input from pipes

sprintf(buf, "%s %s", CPP, args);

2.10 Error control

\"[^\"\n]*\" { yylval.string = strdup(yytext); return DQS; }

\n.* { strcpy(linebuf, yytext-1); /* Save next line */

void yyerror(char *s)

printf("%d: %s %s in %s\n", yylineno, s, yytext, linebuf);

1. There are rules (productions) for all non-terminals.

2. All non-terminals must be reachable from the axiom.

3.1 Grammar ambiguity

3.3 Left recursion

3.4 Value stack type

typedef struct att_type {

3.5 Computing attributes

%type, so only $$ stands for that attribute.

3.6 Variables and functions used in yacc

Figure 2 shows the basic functions and variables.

3.8 Error control

As mentioned in Section 2.10, error control consists of three related activities:

3.9 Placement of error tokens

4 Writing compilers with lex and yacc

Figure 2: yaccs variables and functions

You might also like