0% found this document useful (0 votes)
14 views20 pages

Acd 2.1

The document provides an overview of compiler structure, including phases such as lexical analysis, syntax analysis, and semantic analysis, along with their respective roles. It discusses the importance of the symbol table, error recovery techniques, and the process of removing ambiguity in grammar. Additionally, it explains the functionality of Lex as a lexical analyzer and provides examples of ambiguous versus unambiguous grammar.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

Acd 2.1

The document provides an overview of compiler structure, including phases such as lexical analysis, syntax analysis, and semantic analysis, along with their respective roles. It discusses the importance of the symbol table, error recovery techniques, and the process of removing ambiguity in grammar. Additionally, it explains the functionality of Lex as a lexical analyzer and provides examples of ambiguous versus unambiguous grammar.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT II

Compiler Structure: Compilers and Translators, Various Phases of Compiler, Pass Structure of Compiler,
Bootstrapping of Compiler. Lexical Analysis: The role of Lexical Analyzer, The Lexical-Analyzer Generator Lex ,
Syntax Analysis : CFG, Lexical Versus Syntactic Analysis , Eliminating Ambiguity , Elimination of Left Recursion , Left
Factoring , Top-Down parsers: Recursive-Descent Parsing, FIRST and FOLLOW, LL(1) Grammars, Non recursive
Predictive Parsing, Error Recovery in Predictive Parsing.

We have two phases of compilers, namely the Analysis phase and Synthesis phase.
The analysis phase creates an intermediate representation from the given source code.
The synthesis phase creates an equivalent target program from the intermediate
representation.

Symbol Table – It is a data structure being used and maintained by the compiler,
consisting of all the identifier’s names along with their types. It helps the compiler to
function smoothly by finding the identifiers quickly.

The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from left to
right. It is then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are categorized
hierarchically into nested groups.

3. Semantic Analysis-
This phase is used to check whether the components of the source program are
meaningful or not.

The compiler has two modules namely the front end and the back end. Front-end
constitutes the Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate
code generator. And the rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which performs file
inclusion and macro expansion) as the input which is in a pure high-level language. It
reads the characters from the source program and groups them into lexemes
(sequence of characters that “go together”). Each lexeme corresponds to a token.
Tokens are defined by regular expressions which are understood by the lexical
analyzer. It also removes lexical errors (e.g., erroneous characters), comments, and
white space.
2. Syntax Analyzer – It is sometimes called a parser. It constructs the parse tree. It
takes all the tokens one by one and uses Context-Free Grammar to construct the
parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few productions. Using
these productions we can represent what the program actually is. The input has to be
checked whether it is in the desired format or not.
The parse tree is also called the derivation tree. Parse trees are generally constructed
to check for ambiguity in the given grammar. There are certain rules associated with
the derivation tree.
● Any identifier is an expression
● Any number can be called an expression
● Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.

Syntax error can be detected at this level if the input is not in accordance with the
grammar.
● Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It
furthermore produces a verified parse tree. It also does type checking, Label checking,
and Flow control checking.

● Intermediate Code Generator – It generates intermediate code, which is a form that


can be readily executed by a machine We have many popular intermediate codes.
Example – Three address codes etc. Intermediate code is converted to machine
language using the last two phases which are platform dependent.
Till intermediate code, it is the same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and
build the last two parts.
● Code Optimizer – It transforms the code so that it consumes fewer resources and
produces more speed. The meaning of the code being transformed is not altered.

● Optimization can be categorized into two types: machine-dependent and machine-


independent.

● Target Code Generator – The main purpose of the Target Code generator is to write
a code that the machine can understand and also register allocation, instruction
selection, etc.

● The output is dependent on the type of assembler. This is the final stage of
compilation.

● The optimized code is converted into relocatable machine code which then forms the
input to the linker and loader.
All these six phases are associated with the symbol table manager and error handler as
shown in the above block diagram.
Lexical analyzer scans the entire source code of the program. It identifies each token one
by one. Scanners are usually implemented to produce tokens only when requested by a
parser. Here is how recognition of tokens in compiler design works-
Lexical Analyzer Architecture

1. “Get next token” is a command which is sent from the parser to the lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
3. It returns the token to Parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error
is present, then Lexical analyzer will correlate that error with the source file and line
number.

Roles of the Lexical analyzer


Lexical analyzer performs below given tasks:

● Helps to identify token into the symbol table


● Removes white spaces and comments from the source program
● Correlates error messages with the source program
● Helps you to expands the macros if it is found in the source program
● Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens


Consider the following code that is fed to Lexical Analyzer
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
Int Keyword
Maximum Identifier
( Operator
Int Keyword
X Identifier
, Operator
Int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t

Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:

● Lexical errors are not very common, but it should be managed by a scanner
● Misspelling of identifiers, operators, keyword are considered as lexical errors
● Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.

Error Recovery in Lexical Analyzer


Here, are a few most common error recovery techniques:

● Removes one character from the remaining input


● In the panic mode, the successive characters are always ignored until we reach a well-
formed token
● By inserting the missing character into the remaining input
● Replace a character with another character
● Transpose two serial characters

LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
o It reads the input stream and produces the source code as output through implementing the
lexical analyzer in the C program.

The function of Lex is as follows:


o Firstly lexical analyzer creates a program lex.l in the Lex language. Then Lex compiler runs
the lex.l program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

Lex file format


A Lex program is separated into three sections by %% delimiters. The format of Lex source is as
follows:

{ definitions }
%%
{ rules }
%%
{ user subroutines }

Definitions include declarations of constant, variable and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.

Where pi describes the regular expression and action1 describes the actions what action the lexical
analyzer should take when pattern pi matches a lexeme.

User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded
with the lexical analyzer and compiled separately.

For example:

declaration
number[0-9]
%%
translation
if {return(IF);}
%%
auxiliary function
int numberSum()

An example implementation of a Lexical Analyser using Lex

The following Lex Program counts the number of words-

Code
%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}
/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/
"\n" {printf("Total Number of Words : %d\n", count); count = 0;}
%%
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}

How to execute
Type lex lexfile.l
Type gcc lex.yy.c.
Type ./a.exe

Output

Explanation

In the definition section of the program, we have included the standard library for input-output
operations and string operations and a variable count to keep count of the number of words.

In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches any
string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.

There is a rule for a newline too. When a newline character is encountered, the current count of
words is printed, and the counter is set to zero.
Removal of
Ambiguity
(Converting an
Ambiguous
Grammar into
Unambiguous
Grammar)

What is
Ambiguity?

Ambiguity in
grammar can be
explained as a
case when any one
single string or
expression can be
interpreted in
more than one
way owing to
different parse
trees.

This happens when, with the same input, a grammar gives rise to more than one valid derivation or more than one
parse tree displaying different meanings or structures. In the context of programming languages, ambiguity is
undesira

ble because it may cause a compiler or interpreter to give indeterminate results, and thus inconsistent, as to which
of several possible interpretations was intended.

For example, take the math expression “3 + 4 * 5.” Because the grammar itself does not define precedence, this
can be understood either as “(3 + 4) * 5” or as “3 + (4 * 5)”-both versions bound by different results.

Indeed, ambiguity removal does require designing grammars that map every valid input string to a unique parse
tree and hence predictable behavior, so as not to create confusion during language processing.

The removal of ambiguity from grammar using suitable examples.


Ambiguous vs Unambiguous Grammar

The grammars which have more than one derivation tree or parse tree are ambiguous. This grammar is not parsed
by any parsers.

Removal of Ambiguity
We can remove ambiguity solely on the basis of the following
two properties –
1. Precedence
If different operators are used, we will consider the precedence
of the operators. The three important characteristics are :
1. The level at which the production is present denotes the
priority of the operator used.
1. The production at higher levels will have operators with
less priority. In the parse tree, the nodes which are at top
levels or close to the root node will contain the lower
priority operators.
1. The production at lower levels will have operators
with higher priority. In the parse tree,
the nodes which are at lower levels or close to the leaf
nodes will contain the higher priority operators.
2. Associativity
If the same precedence operators are in production, then we
will have to consider the associativity.
 If the associativity is left to right, then we have to prompt
a left recursion in the production.The parse tree will also
be left recursive and grow on the left side.
+, -, *, / are left associative operators.
 If the associativity is right to left, then we have to prompt
the right recursion in the productions.
The parse tree will also be right recursive and grow on
the right side.
^ is a right associative operator.
Example 1 – Consider the ambiguous grammar
E -> E-E | id
The language in the grammar will contain { id, id-id, id-id-id, ….}
Say, we want to derive the string id-id-id. Let’s consider a
single value of id=3 to get more insights. The result should be :
3-3-3 =-3
Since the same priority operators, we need to consider
associativity which is left to right.
Parse Tree
The parse tree which grows on the left side of the root will be
the correct parse tree in order to make the grammar
unambiguous.

So, to make the above grammar unambiguous, simply make


the grammar Left Recursive by replacing the right most non-
terminal E in the right side of the production with another
random variable, say P.
The grammar becomes :
E -> E – P | P
P -> id
The above grammar is now unambiguous and will contain only
one Parse Tree for the above expression
as shown below –

Similarly, the unambiguous grammar for the expression


: 2^3^2 will be –
E -> P ^ E | P // Right Recursive as ^ is right associative.
P -> id
Example 2 – Consider the grammar shown below, which has
two different operators :
E -> E + E | E * E | id
Clearly, the above grammar is ambiguous as we can draw two
parse trees for the string “id+id*id” as shown below. Consider
the expression :
3+2*5 // “*” has more priority than “+”
The correct answer is : (3+(2*5))=13
The “+” having the least priority has to be at the upper level
and has to wait for the result produced by the “*” operator
which is at the lower level. So, the first parse tree is the correct
one and gives the same result as expected.
The unambiguous grammar will contain the productions having
the highest priority operator (“*” in the example) at the lower
level and vice versa. The associativity of both the operators
are Left to Right. So, the unambiguous grammar has to be left
recursive. The grammar will be :
E -> E + P // + is at higher level and left associative
E -> P
P -> P * Q // * is at lower level and left associative
P -> Q
Q -> id
(or)
E -> E + P | P
P -> P * Q | Q
Q -> id
E is used for doing addition operations and P is used to perform
multiplication operations. They are independent and will
maintain the precedence order in the parse tree. The parse
tree for the string ” id+id*id+id ” will be –

Note : It is very important to note that while converting an


ambiguous grammar to an unambiguous grammar, we
shouldn’t change the original language provided by the
ambiguous grammar.
So, the non-terminals in the ambiguous grammar have to be
replaced with other variables in such a way that we get the
same language as it was derived before and also maintain the
precedence and associativity rule simultaneously.
This is the reason we wrote the production E -> P and P ->
Q and Q -> id after replacing them in the above example,
because the language contains the strings { id, id+id } as well.
Similarly, the unambiguous grammar for an expression having
the operators -,*,^ is :
E -> E – P | P // Minus operator is at higher level due
to least priority and left associative.
P -> P * Q | Q // Multiplication operator has more
priority than – and lesser than ^ and left associative.
Q -> R ^ Q | R // Exponent operator is at lower level due
to highest priority and right associative.
R -> id

1. 2.

Left Recursion
A context-free grammar is said to be left recursive if it contains
a production rule where the non-terminal on the left-hand side
of the rule also appears as the first symbol on the right-hand
side.
In other words, the grammar is trying to define a non-terminal
in terms of itself, creating a recursive loop.
This can be represented formally as −
A→Aα|β
Where −
 A is a non-terminal symbol.
 α represents a sequence of terminals and/or non-
terminals.
 β represents another sequence of terminals and/or non-
terminals.
The most important part here is the presence of A on both
sides of the production rule, with it appearing first on the right-
hand side.

To visualize this, consider the following parse tree −

It is generated by a left-recursive grammar.


As the grammar recursively expands the non-terminal 'A' on
the left, the tree grows indefinitely downwards on the left side.
This continuous expansion makes it unsuitable for top-down
parsing, as the parser could get trapped in an infinite loop,
trying to expand 'A' repeatedly.

Problem of Left Recursion for Top-Down Parsing


As we have seen, the top-down parsing works by starting with
the start symbol of the grammar and attempting to derive the
input string by applying production rules.
When encountering a left-recursive rule, the parser keeps
expanding the same non-terminal, leading to an infinite loop.
This inability to handle left recursion directly is a significant
drawback of top-down parsing methods.
Eliminating Left Recursion
To solve this we can eliminate immediate left recursion from a
grammar without altering the language it generates.
The general approach involves introducing a new non-terminal
and rewriting the recursive rules.
Let's illustrate this with our previous example −
A→Aα|β
We can eliminate the left recursion by introducing a new non-
terminal 'A’' and rewriting the rule as follows –
A→βAʹ
Aʹ→αAʹ|ε
In this transformed grammar −
 A now derives the non-recursive part 'β' followed by the
new non-terminal A'.
 A' handles the recursion, deriving either α A' (continuing
the recursion) or ϵ (empty string), which terminates the
recursion.
Let us see this through an example for a better view
Consider a simplified arithmetic expression grammar −
E→E+T|T
T→T∗F|F
F→(E)|id
This grammar contains left recursion in the rules for both `E`
and `T`. Let's eliminate it −
 Eliminating Left Recursion in `E`:
E→TEʹ
Eʹ→+TEʹ|ε
 Eliminating Left Recursion in `T`:
T→FTʹ
Tʹ→∗FTʹ|ε
The final transformed grammar, free from left recursion,
becomes −
E→TEʹ
Eʹ→+TEʹ|ε
T→FTʹ
Tʹ→∗FTʹ|ε
F→(E)|id

Left Factoring
While left recursion presents a parsing challenge, left factoring
is a desirable property for top-down parsing. It involves
restructuring the grammar to eliminate common prefixes in
production rules.
Importance of left factoring
When a grammar has multiple production rules for a non-
terminal with a common prefix, the parser faces ambiguity
during the parsing process.
It has to choose between these rules without knowing which
one will ultimately lead to a successful parse.
Left factoring helps in resolving this ambiguity by postponing
the decision point, making the parsing process more efficient.
Performing Left Factoring
The process of left factoring involves identifying the longest
common prefix (α) in the alternatives for a non-terminal and
rewriting the rules to factor out this common prefix.
Consider a grammar with the following rules −
A→αβ1|αβ2|…|αβn|γ
Here, 'α' is the longest common prefix in the first 'n'
alternatives for non-terminal 'A'.
We can left factor this as follows −
A→αAʹ|γ
A’→β1|β2|…|βn
In this factored grammar −
 A now derives either the common prefix 'α' followed by
the new non-terminal 'A'` or the alternative 'γ'.
 A' encapsulates the different options that were originally
prefixed by 'α'. This defers the decision of which
alternative to choose until more of the input has been
processed.
Let us understand this through an example. Let's consider
a simple grammar −
S→if E then S else S/if E then S/A
Here, the first two rules for 'S' have a common prefix, "if E
then". We can apply left factoring to obtain −
S→if E then S S' | A
Sʹ→else S | ε
This factored grammar is more suitable for top-down parsing as
it avoids the initial choice between the first two rules.
Conclusion
Left recursion can be problematic for top-down parsing and
needs to be eliminated. Left factoring, on the other hand,
improves the efficiency of top-down parsing by reducing
ambiguity.

You might also like