CD Unit 2
CD Unit 2
There are two major phases of compilation, which in
turn have many parts. Each of them take input from Phases of a Compiler
the output of the previous level and work in a
coordinated way.
Analysis Phase – An intermediate representation is
created from the give source code : We basically have two phases of compilers, namely Analysis
phase and Synthesis phase. Analysis phase creates an
Lexical Analyzer intermediate representation from the given source code.
Syntax Analyzer Synthesis phase creates an equivalent target program from
Semantic Analyzer the intermediate representation.
Intermediate Code Generator
Lexical analyzer divides the program into “tokens”,
Syntax analyzer recognizes “sentences” in the
program using syntax of language and Semantic
analyzer checks static semantics of each construct.
Intermediate Code Generator generates “abstract”
code.
Synthesis Phase – Equivalent target program is
created from the intermediate representation. It has
two parts :
Code Optimizer
Code Generator
Code Optimizer optimizes the abstract code, and
final Code Generator translates abstract
intermediate code into specific machine instructions.
● A compiler is a translator that converts the
high-level language into the machine language.
Symbol Table – It is a data structure being used and
● High-level language is written by a developer maintained by the compiler, consists all the identifier’s name
and machine language can be understood by the along with their types. It helps the compiler to function
processor. smoothly by finding the identifiers quickly.
● Compiler is used to show errors to the The compiler has two modules namely front end and back end.
programmer. Front-end constitutes of the Lexical analyzer, semantic
● The main purpose of compiler is to change the analyzer, syntax analyzer and intermediate code generator.
code written in one language without changing the And the rest are assembled to form the back end.
meaning of the program.
1. Lexical Analyzer – It reads the program and converts it
● When you execute a program which is written into tokens. It converts a stream of lexemes into a stream of
in HLL programming language then it executes into tokens. Tokens are defined by regular expressions which are
two parts. understood by the lexical analyzer. It also removes
white-spaces and comments.
2. Syntax Analyzer – It is sometimes called as parser. It various entities such as variable and function names, classes,
constructs the parse tree. It takes all the tokens one by one objects, etc.
and uses Context Free Grammar to construct the parse tree.
Why Grammar ? ● It is built in lexical and syntax analysis phases.
The rules of programming can be entirely represented in some ● The information is collected by the analysis phases of
few productions. Using these productions we can represent compiler and is used by synthesis phases of compiler to
what the program actually is. The input has to be checked generate code.
whether it is in the desired format or not. ● It is used by compiler to achieve compile time
Syntax error can be detected at this level if the input is not in efficiency.
accordance with the grammar. ● It is used by various phases of compiler as follows :-
1. Lexical Analysis: Creates new table entries in the table,
example like entries about token.
2. Syntax Analysis: Adds information regarding attribute
type, scope, dimension, line of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the
table to check for semantics i.e. to verify that expressions and
assignments are semantically correct(type checking) and
update it accordingly.
4. Intermediate Code generation: Refers symbol table for
knowing how much and what type of run-time is allocated and
table helps in adding temporary variable information.
5. Code Optimization: Uses information present in symbol
table for machine dependent optimization.
3. Semantic Analyzer – It verifies the parse tree, whether
6. Target Code generation: Generates code by using
it’s meaningful or not. It furthermore produces a verified parse
address information of identifier present in the table.
tree.It also does type checking, Label checking and Flow
control checking.
4. Intermediate Code Generator – It generates
intermediate code, that is a form which can be readily
Symbol Table entries – Each entry in symbol table is
executed by machine We have many popular intermediate
associated with attributes that support compiler in different
codes. Example – Three address code etc. Intermediate code
phases.
is converted to machine language using the last two phases
which are platform dependent. Items stored in Symbol table:
Till intermediate code, it is same for every compiler out there,
but after that, it depends on the platform. To build a new
compiler we don’t need to build it from scratch. We can take
the intermediate code from the already existing compiler and ● Variable names and constants
build the last two parts. ● Procedure and function names
5. Code Optimizer – It transforms the code so that it ● Literal constants and strings
consumes fewer resources and produces more speed. The ● Compiler generated temporaries
meaning of the code being transformed is not altered. ● Labels in source languages
Optimisation can be categorized into two types: machine
dependent and machine independent.
6. Target Code Generator – The main purpose of Target
Code generator is to write a code that the machine can Information used by compiler from Symbol table:
understand and also register allocation, instruction selection
● Data type and name
etc. The output is dependent on the type of assembler. This is
● Declaring procedures
the final stage of compilation.
● Offset in storage
● If structure or record then, pointer to structure table.
● For parameters, whether parameter passing by value or
by reference
Symbol Table in Compiler
● Number and type of arguments passed to function
● Base Address
Prerequisite – P
hases of a Compiler
Differences between compiler and
interpreter
What is Translators? Different type of translators
BY DINESH THAKUR Category: Compiler Design
A program written in high-level language is called as source . No ompiler terpreter
code. To convert the source code into machine code,
translators are needed.
A translator takes a program written in source language as erforms the translation erforms statement by
input and converts it into a program in target language as of a program as a tatement translation.
output. whole.
Assembler
Interpreter Assembler is a translator which is used to translate the
assembly language code into machine language code.
Interpreter is a translator which is used to convert programs in
high-level language to low-level language. Interpreter
translates line by line and reports the error once it
encountered during the translation process.
It directly executes the operations specified in the source
program when the input is given by the user.
It gives better error diagnostics than a compiler.
Introduction of Lexical Analysis How Lexical Analyzer functions
Lexical Analysis is the first phase of compiler also known as
scanner. It converts the High level input program into a
1. Tokenization .i.e Dividing the program into valid tokens.
sequence of Tokens.
2. Remove white space characters.
● Lexical Analysis can be implemented with the
Deterministic finite Automata. 3. Remove comments.
● The output is a sequence of tokens that is sent to the
parser for syntax analysis 4. It also provides help in generating error message by providing
row number and column number.
The lexical analyzer identifies the error with the help of
automation machine and the grammar of the given language
on which it is based like C , C++ and gives row number and
column number of the error.
Suppose we pass a statement through lexical analyzer –
int main()
{
Example of tokens:
// 2 variables
● Type token (id, number, real, . . . )
int a, b;
● Punctuation tokens (IF, void, return, . . . )
● Alphabetic tokens (keywords) a = 10;
return 0;
}
Keywords; Examples-for, while, if etc.
All the valid tokens are:
Identifier; Examples-Variable name, function
name etc.
Operators; Examples '+', '++', '-' etc. 'int' 'main' '(' ')' '{' 'int' 'a' ','
'b' ';'
Separators; Examples ',' ';' etc
'a' '=' '10' ';' 'return' '0' ';' '}'
Example of Non-Tokens:
Above are the valid tokens.
● Comments, preprocessor directive, macros, blanks,
tabs, newline etc You can observe that we have omitted comments.
exemes okens
signment symbol
entifier
(addition symbol)
Lexical Analysis – Compiler Design
entifier
Lexical analysis is the process of converting a sequence of
characters from source program into a sequence of tokens.
(multiplication symbol)
A program which performs lexical analysis is termed as a
lexical analyzer (lexer), tokenizer or scanner.
Lexical analysis consists of two stages of processing which are (number)
as follows:
• Scanning
• Tokenization The sequence of tokens produced by lexical analyzer helps
the parser in analyzing the syntax of programming languages.
Token, Pattern and Lexeme
Role of Lexical Analyzer
Token
Token is a valid sequence of characters which are given by
lexeme. In a programming language,
Specification of Tokens
SPECIFICATION OF TOKENS
Tasks of lexical analyzer can be divided into two processes: "string." The length of a string s, usually written |s|, is the number of
erforms reading of input characters, removal of
Scanning: P occurrences of symbols in s. For example, banana is a string of
white spaces and comments. length six. The empty string, denoted ε, is the string of length zero.
· Here are the rules that define the regular expressions over - If r is a regular expression that denotes the language L(r), then ( r )+
some alphabet Σ and the languages that those expressions denote: is a regular expression that denotes the language (L (r ))+
1.ε is a regular expression, and L(ε) is { ε }, that is, the language - Thus the regular expression a+ denotes the set of all strings of one or
whose sole member is the empty string. more a’s.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = - The operator + has the same precedence and associativity as the
{a}, that is, the language with one string, of length one, with ‘a’ in its operator *.
one position.
3.Suppose r and s are regular expressions denoting the languages L(r) 2. Zero or one instance ( ?):
and L(s). Then, a) (r)|(s) is a regular expression denoting the - The unary postfix operator ? means “zero or one instance of”.
language L(r) U L(s).
- The notation r? is a shorthand for r | ε.
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* - If ‘r’ is a regular expression, then ( r )? is a regular expression that
is a regular expression denoting (L(r))*. denotes the language
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative. 3. Character Classes:
5.Concatenation has second highest precedence and is left associative. - The notation [abc] where a, b and c are alphabet symbols denotes the
6. | has lowest precedence and is left associative. regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c
Regular set | d | ….|z.
A language that can be defined by a regular expression is called a - We can describe identifiers as being strings generated by the regular
regular set. If two regular expressions r and s denote the same regular expression, [A–Za–z][A– Za–z0–9]*
set, we say they are equivalent and write r = s.
Non-regular Set
There are a number of algebraic laws for regular expressions that can
be used to manipulate into equivalent forms. A language which cannot be described by any regular expression is a
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative. non-regular set. Example: The set of all strings of balanced
parentheses and repeating strings cannot be described by a regular
Regular Definitions expression. This set can be specified by a context-free grammar.
Giving names to regular expressions is referred to as a Regular
definition. If Σ is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form
dl → r 1
d2 → r2
………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . ,
di-l}.
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
Shorthands