0% found this document useful (0 votes)
82 views7 pages

CD Unit 2

- A compiler is a program that converts code written in a high-level language into machine-readable code. It has two main phases: analysis and synthesis. - In analysis, the source code is broken down and an intermediate representation is created. Synthesis then uses this to generate the target machine code. - Key parts of a compiler include the lexical analyzer, syntax analyzer, semantic analyzer, intermediate code generator, code optimizer, and code generator. The symbol table is also used to track identifiers and their properties to help the compiler work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views7 pages

CD Unit 2

- A compiler is a program that converts code written in a high-level language into machine-readable code. It has two main phases: analysis and synthesis. - In analysis, the source code is broken down and an intermediate representation is created. Synthesis then uses this to generate the target machine code. - Key parts of a compiler include the lexical analyzer, syntax analyzer, semantic analyzer, intermediate code generator, code optimizer, and code generator. The symbol table is also used to track identifiers and their properties to help the compiler work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit 2 ● In  the  first  part,  the  source  program  compiled 

and  translated  into  the  object  program  (low  level 


Compiler  is  a  software  which  converts  a  program  language). 
written  in  high  level  language  (Source  Language)  to 
● In  the  second  part,  object  program  translated 
low  level  language  (Object/Target/Machine 
into the target program through the assembler. 
Language). 
 

   
There  are  two  major  phases  of  compilation,  which  in   
turn  have  many  parts.  Each  of  them  take  input  from  Phases of a Compiler 
the  output  of  the  previous  level  and  work  in  a 
coordinated way.   
Analysis  Phase  –  An  intermediate  representation  is 
created from the give source code :  We  basically  have  two  phases  of  compilers,  namely  Analysis 
  phase  and  Synthesis  phase.  Analysis  phase  creates  an 
Lexical Analyzer  intermediate  representation  from  the  given  source  code. 
Syntax Analyzer  Synthesis  phase  creates  an  equivalent  target  program  from 
Semantic Analyzer  the intermediate representation. 
Intermediate Code Generator 
 
Lexical  analyzer  divides  the  program  into  “tokens”, 
Syntax  analyzer  recognizes  “sentences”  in  the 
program  using  syntax  of  language  and  Semantic 
analyzer  checks  static  semantics  of  each  construct. 
Intermediate  Code  Generator  generates  “abstract” 
code. 
Synthesis  Phase  –  Equivalent  target  program  is 
created  from  the  intermediate  representation.  It  has 
two parts : 
 
Code Optimizer 
Code Generator 
Code  Optimizer  optimizes  the  abstract  code,  and 
final  Code  Generator  translates  abstract 
intermediate code into specific machine instructions.   

 
● A  compiler  is  a  translator  that  converts  the 
 
high-level language into the machine language. 
Symbol  Table  –  It  is  a  data  structure  being  used  and 
● High-level  language  is  written  by  a  developer  maintained  by  the  compiler,  consists  all  the  identifier’s  name 
and  machine  language  can  be  understood  by  the  along  with  their  types.  It  helps  the  compiler  to  function 
processor.  smoothly by finding the identifiers quickly. 
● Compiler  is  used  to  show  errors  to  the  The  compiler  has  two  modules  namely  front end and back end. 
programmer.  Front-end  constitutes  of  the  Lexical  analyzer,  semantic 
● The  main  purpose  of compiler is to change the  analyzer,  syntax  analyzer  and  intermediate  code  generator. 
code  written  in  one  language  without  changing  the  And the rest are assembled to form the back end. 
meaning of the program. 
1. Lexical  Analyzer  –  It  reads  the program and converts it 
● When  you  execute  a  program  which  is  written  into  tokens.  It  converts  a  stream  of  lexemes  into  a  stream  of 
in  HLL  programming  language  then  it  executes  into  tokens.  Tokens  are  defined  by  regular  expressions  which  are 
two parts.  understood  by  the  lexical  analyzer.  It  also  removes 
white-spaces and comments. 
2. Syntax  Analyzer  –  It  is  sometimes  called  as  parser.  It  various  entities  such  as  variable  and  function  names,  classes, 
constructs  the  parse  tree.  It  takes  all  the  tokens  one  by  one  objects, etc. 
and uses Context Free Grammar to construct the parse tree. 
Why Grammar ?  ● It is built in lexical and syntax analysis phases. 
The  rules  of  programming  can be entirely represented in some  ● The  information  is  collected  by  the  analysis  phases  of 
few  productions.  Using  these  productions  we  can  represent  compiler  and  is  used  by  synthesis  phases  of  compiler  to 
what  the  program  actually  is.  The  input  has  to  be  checked  generate code. 
whether it is in the desired format or not.  ● It  is  used  by  compiler  to  achieve  compile  time 
Syntax  error  can  be  detected  at  this  level  if  the  input  is  not  in  efficiency. 
accordance with the grammar.  ● It is used by various phases of compiler as follows :- 
1. Lexical  Analysis:  Creates  new  table  entries in the table, 
example like entries about token. 
2. Syntax  Analysis:  Adds  information  regarding  attribute 
type, scope, dimension, line of reference, use, etc in the table. 
3. Semantic  Analysis:  Uses  available  information  in  the 
table  to  check  for  semantics  i.e. to verify that expressions and 
assignments  are  semantically  correct(type  checking)  and 
update it accordingly. 
4. Intermediate  Code  generation:  Refers  symbol  table  for 
knowing  how  much  and  what  type  of run-time is allocated and 
table helps in adding temporary variable information. 
5. Code  Optimization: Uses information present in symbol 
 
table for machine dependent optimization. 
3. Semantic  Analyzer  –  It  verifies  the  parse  tree,  whether 
6. Target  Code  generation:  Generates  code  by  using 
it’s  meaningful  or  not.  It  furthermore produces a verified parse 
address information of identifier present in the table. 
tree.It  also  does  type  checking,  Label  checking  and  Flow 
control checking. 
4. Intermediate  Code  Generator  –  It  generates 
intermediate  code,  that  is  a  form  which  can  be  readily 
Symbol  Table  entries  –  Each  entry  in  symbol  table  is 
executed  by  machine  We  have  many  popular  intermediate 
associated  with  attributes  that  support  compiler  in  different 
codes.  Example  –  Three  address  code  etc.  Intermediate  code 
phases. 
is  converted  to  machine  language  using  the  last  two  phases 
which are platform dependent.  Items stored in Symbol table: 
Till  intermediate  code,  it  is  same  for  every  compiler  out  there, 
but  after  that,  it  depends  on  the  platform.  To  build  a  new   
compiler  we  don’t  need  to  build  it  from  scratch.  We  can  take   
the  intermediate  code  from  the  already  existing  compiler  and  ● Variable names and constants 
build the last two parts.  ● Procedure and function names 
5. Code  Optimizer  ​–  It  transforms  the  code  so  that  it  ● Literal constants and strings 
consumes  fewer  resources  and  produces  more  speed.  The  ● Compiler generated temporaries 
meaning  of  the  code  being  transformed  is  not  altered.  ● Labels in source languages 
Optimisation  can  be  categorized  into  two  types:  machine 
dependent and machine independent. 
6. Target  Code  Generator  –  The  main  purpose  of  Target 
Code  generator  is  to  write  a  code  that  the  machine  can  Information used by compiler from Symbol table: 
understand  and  also  register  allocation,  instruction  selection 
● Data type and name 
etc.  The  output  is  dependent  on  the  type of assembler. This is 
● Declaring procedures 
the final stage of compilation. 
● Offset in storage 
● If structure or record then, pointer to structure table. 
● For  parameters,  whether  parameter  passing by value or 
by reference 
Symbol Table in Compiler 
● Number and type of arguments passed to function 
● Base Address 
Prerequisite –​ P
​ hases of a Compiler 

Symbol  Table  is  an  important  data  structure  created  and 


maintained  by  the compiler in order to keep track of semantics 
of  variable  i.e.  it  stores  information  about  scope  and  binding  Operations  of  Symbol  table  –  The  basic  operations  defined  on 
information  about  names,  information  about  instances  of  a symbol table include: 
 

 
      Differences between compiler and 
interpreter 
What is Translators? Different type of translators
  
BY DINESH THAKUR ​Category: ​Compiler Design
A  program  written  in  high-level  language  is  called  as  source  . No  ompiler  terpreter 
code.  To  convert  the  source  code  into  machine  code, 
translators are needed. 
A  translator  takes  a  program  written  in  source  language  as  erforms  the  translation erforms  statement  by 
input  and  converts  it  into  a  program  in  target  language  as  of  a  program  as  a  tatement translation. 
output.  whole. 

It also detects and reports the error during translation. 


xecution is faster.  xecution is slower. 
Roles of translator are: 
•  Translating  the  high-level  language  program  input  into  an 
equivalent machine language program.  equires  more  memory  emory  usage  is 
as  linking  is  needed  for efficient  as  no 
•  Providing  diagnostic  messages  wherever  the  programmer 
he  generated  ntermediate  object 
violates specification of the high-level language program. 
ntermediate  object code is generated. 
Different type of translators  code. 

The different types of translator are as follows: 


ebugging  is  hard  as  the  stops  translation 
Compiler  error  messages  are when  the  first  error  is 
generated  after met.  Hence, 
Compiler  is  a  translator  which  is  used  to  convert  programs  in 
canning  the  entire debugging is easy. 
high-level  language  to  low-level  language.  It  translates  the 
program only. 
entire  program  and  also  reports  the  errors  in source program 
encountered during the translation. 
rogramming  languages rogramming  languages 
 
ike  C,  C++  uses  ike  Python,  BASIC, 
compilers.  and  Ruby  uses 
nterpreters. 

  ​Assembler 

Interpreter  Assembler  is  a  translator  which  is  used  to  translate  the 
assembly language code into machine language code. 
Interpreter  is a translator which is used to convert programs in 
 
high-level  language  to  low-level  language.  Interpreter 
translates  line  by  line  and  reports  the  error  once  it 
 
encountered during the translation process. 
It  directly  executes  the  operations  specified  in  the  source   
program when the input is given by the user. 
 
It gives better error diagnostics than a compiler. 
 
Introduction of Lexical Analysis  How Lexical Analyzer functions 

Lexical  Analysis  is  the  first  phase  of  compiler  also  known  as   
scanner.  It  converts  the  High  level  input  program  into  a 
1. Tokenization .i.e Dividing the program into valid tokens. 
sequence of ​Tokens​. 
2. Remove white space characters. 
● Lexical  Analysis  can  be  implemented  with  the 
Deterministic finite Automata​.  3. Remove comments. 
● The  output  is  a  sequence  of  tokens  that  is  sent  to  the 
parser for syntax analysis  4. It also provides help in generating error message by providing 
row number and column number. 

The  lexical  analyzer  identifies  the  error  with  the  help  of 
automation  machine  and  the  grammar  of  the  given  language 
on  which  it  is  based  like  C  ,  C++  and  gives  row  number  and 
column number of the error. 
Suppose we pass a statement through lexical analyzer – 

  a = b + c​ ; It will generate token sequence like this: 

What is a token?  id=id+id​;  Where each id reference to it’s variable in the 


symbol table referencing all details 
A  lexical  token  is  a  sequence  of  characters  that  can  be treated 
as a unit in the grammar of the programming languages.  For example, consider the program 

  int main()
  {
Example of tokens: 
// 2 variables
● Type token (id, number, real, . . . ) 
int a, b;
● Punctuation tokens (IF, void, return, . . . ) 
● Alphabetic tokens (keywords)  a = 10;
return 0;
}
Keywords; Examples-for, while, if etc.
All the valid tokens are: 
Identifier; Examples-Variable name, function
name etc.  

Operators; Examples '+', '++', '-' etc. 'int' 'main' '(' ')' '{' 'int' 'a' ','
'b' ';'
Separators; Examples ',' ';' etc
'a' '=' '10' ';' 'return' '0' ';' '}'
Example of Non-Tokens: 
Above are the valid tokens. 
● Comments,  preprocessor  directive,  macros,  blanks, 
tabs, newline etc  You can observe that we have omitted comments. 

As another example, consider below printf statement. 

Lexeme​:  The  sequence  of  characters  matched  by  a  pattern  to 


form 

the  corresponding  token  or  a  sequence of input characters that 


comprises  a  single  token  is  called  a  lexeme.  eg-  “float”, 
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” . 

   

  There are 5 valid token in this printf statement. 


Exercise 1:  • keywords, 

Count number of tokens :  • constant, 


• identifiers, 
int main()
• numbers, 
{
• operators and 
int a = 10, b = 20;
• punctuations symbols 
printf("sum is :%d",a+b);
are possible tokens to be identified. 
return 0;
Pattern 
}
Answer: Total number of token: 27. Pattern  describes  a  rule  that must be matched by sequence of 
characters  (lexemes)  to  form  a  token.  It  can  be  defined  by 
Exercise 2:  regular expressions or grammar rules. 

Count number of tokens :  Lexeme 


Lexeme  is  a  sequence  of  characters  that  matches  the  pattern 
int max(int i); 
for a token i.e., instance of a 
● Lexical  analyzer first read ​int and finds it to be valid and  token. 
accepts as token 
● max  is  read  by  it  and  found  to  be  valid  function  name  (eg.) c=a+b*5; 
after reading​ (  ​Lexemes and tokens 
● int  is  also  a  token  ,  then  again  i  as  another  token  and 
finally ​;    

exemes  okens 

Answer: Total number of tokens 7:


entifier 
int, max, ( ,int, i, ), ;

 
signment symbol 
 

entifier 

 
(addition symbol) 
 
Lexical Analysis – Compiler Design
entifier 
Lexical  analysis  is  the  process  of  converting  a  sequence  of 
characters from source program into a sequence of tokens. 
(multiplication symbol) 
A  program  which  performs  lexical  analysis  is  termed  as  a 
lexical analyzer (lexer), tokenizer or scanner. 
Lexical  analysis  consists  of  two  stages  of  processing which are  (number) 
as follows: 
• Scanning 
• Tokenization  ​The  sequence  of  tokens  produced  by  lexical  analyzer  helps 
the parser in analyzing the syntax of programming languages. 
Token, Pattern and Lexeme 
Role of Lexical Analyzer 
Token 
Token  is  a  valid  sequence  of  characters  which  are  given  by 
lexeme. In a programming language, 
 
Specification of Tokens

There are 3 specifications of tokens: 1)Strings 2) Language


3)Regular expression

SPECIFICATION OF TOKENS

There are 3 specifications of tokens:


  1)Strings
Lexical analyzer performs the following tasks:  2) Language
3)Regular expression
•  Reads  the source program, scans the input characters, group 
them into lexemes and produce the token as output.  Strings and Languages
• Enters the identified token into the symbol table.  v An ​alphabet​ or character class is a finite set of symbols.
v A ​string over an alphabet is a finite sequence of symbols
• Strips out white spaces and comments from source program. 
drawn from that alphabet.
•  Correlates  error  messages  with  the  source  program  i.e.,  v A ​language is any countable set of strings over some fixed
displays  error  message  with  its  occurrence  by  specifying  the  alphabet.
line number.  In language theory, the terms "sentence" and "word" are often used as
• Expands the macros if it is found in the source program.  synonyms for

Tasks of lexical analyzer can be divided into two processes:  "string." The length of a string s, usually written |s|, is the number of
​ erforms  reading  of  input  characters,  removal  of 
Scanning:  P occurrences of symbols in s. For example, banana is a string of
white spaces and comments.  length six. The empty string, denoted ε, is the string of length zero.

Lexical Analysis: ​Produce tokens as the output.  Operations on strings


Need of Lexical Analyzer  The following string-related terms are commonly used:

Simplicity  of  design of compiler ​The removal of white spaces  1. A ​prefix of string s is any string obtained by removing


and  comments  enables  the  syntax  analyzer  for  efficient  zero or more symbols from the end of string s. For example, ban is a
syntactic constructs.  prefix of banana.
Compiler  efficiency  is  improved  S​ pecialized  buffering 
2. A suffix of string s is any string obtained by removing
techniques  for  reading  characters  speed  up  the  compiler 
zero or more symbols from the beginning of s. For example, nana is a
process. 
suffix of banana.
Compiler portability is enhanced 
  3. A substring of s is obtained by deleting any prefix and
any suffix from s. For example, nan is a substring of banana.
Lexical Errors 
4. The ​proper prefixes, suffixes, and substrings of a string s are
•  A  character  sequence  that  cannot  be  scanned  into any valid  those prefixes, suffixes, and substrings, respectively of s that are not
token is a lexical error.  ε or not equal to s itself.
•  Lexical  errors  are  uncommon, but they still must be handled  5. A subsequence of s is any string formed by deleting zero or more
by a scanner.  not necessarily consecutive positions of s
6. For example, baan is a subsequence of banana.
•  Misspelling  of  identifiers,  keyword,  or  operators  are 
considered as lexical errors.  Operations on languages:
Usually,  a  lexical  error  is  caused  by  the  appearance  of  some  The following are the operations that can be applied to languages:
illegal character, mostly at the beginning of a token.  1. Union
2. Concatenation
 
3. Kleene closure
4. Positive closure

The following example shows the operations on strings: Let L={0,1}


and S={a,b,c}
Regular Expressions 1. ​One or more instances (+):​
· Each regular expression r denotes a language L(r). - The unary postfix operator + means “ one or more instances of” .

· Here are the rules that define the regular expressions over - If r is a regular expression that denotes the language L(r), then ( r )+
some alphabet Σ and the languages that those expressions denote: is a regular expression that denotes the language (L (r ))+

1.ε is a regular expression, and L(ε) is { ε }, that is, the language - Thus the regular expression a+ denotes the set of all strings of one or
whose sole member is the empty string. more a’s.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = - The operator + has the same precedence and associativity as the
{a}, that is, the language with one string, of length one, with ‘a’ in its operator *.
one position.
3.Suppose r and s are regular expressions denoting the languages L(r) 2​. Zero or one instance ( ?)​:
and L(s). Then, a) (r)|(s) is a regular expression denoting the - The unary postfix operator ? means “zero or one instance of”.
language L(r) U L(s).
- The notation r? is a shorthand for r | ε.
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* - If ‘r’ is a regular expression, then ( r )? is a regular expression that
is a regular expression denoting (L(r))*. denotes the language
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative. 3​. Character Classes​:
5.Concatenation has second highest precedence and is left associative. - The notation [abc] where a, b and c are alphabet symbols denotes the
6. | has lowest precedence and is left associative. regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c
Regular set | d | ….|z.

A language that can be defined by a regular expression is called a - We can describe identifiers as being strings generated by the regular
regular set. If two regular expressions r and s denote the same regular expression, [A–Za–z][A– Za–z0–9]*
set, we say they are equivalent and write r = s.
Non-regular Set
There are a number of algebraic laws for regular expressions that can
be used to manipulate into equivalent forms. A language which cannot be described by any regular expression is a
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative. non-regular set. Example: The set of all strings of balanced
parentheses and repeating strings cannot be described by a regular
Regular Definitions expression. This set can be specified by a context-free grammar.
 
Giving names to regular expressions is referred to as a Regular
definition. If Σ is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form
dl → r 1
d2 → r2

………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . ,
di-l}.

Example: Identifiers is the set of strings of letters and digits beginning


with a letter. Regular
definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *

Shorthands

Certain constructs occur so frequently in regular expressions that it is


convenient to introduce notational short hands for them.

You might also like