0% found this document useful (0 votes)
16 views28 pages

Compiler Group 4

The document discusses the role of lexical analysis in compiler design, detailing its process of transforming source code into tokens and the importance of tools like Lex for automation. It also covers common lexical errors, their handling, and the relationship between lexical and syntax analysis. Additionally, it highlights the significance of regular expressions and finite automata in defining token patterns and the overall structure of the compilation process.

Uploaded by

henokgetnet0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views28 pages

Compiler Group 4

The document discusses the role of lexical analysis in compiler design, detailing its process of transforming source code into tokens and the importance of tools like Lex for automation. It also covers common lexical errors, their handling, and the relationship between lexical and syntax analysis. Additionally, it highlights the significance of regular expressions and finite automata in defining token patterns and the overall structure of the compilation process.

Uploaded by

henokgetnet0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

DEBREMARKOS UNIVERSITY

BURIE CAMPUS
DEPARTMENT OF COMPUTER SCIENCE
COMPILER DESIGN
Group members ID
1 Bereket demsew 1405717
2 Henok getnet 1403709
3 Temesgen adine 1404301
4 Seid yimam 1404187
5 Nahom fkadu 1404056
6 Eleni geta 1405947
Submmited to:( Msc)
Introduction

Lexical analysis is the foundational phase of compiler construction, transforming raw


source code into a sequence of tokens that serve as building blocks for subsequent
compilation stages. This process involves reading the input code, breaking it into
meaningful units such as keywords, identifiers, operators, and literals, and discarding
non-essential characters like whitespace and comments. By leveraging deterministic
finite automata and regular expressions, lexical analyzers ensure efficient token
recognition and classification. These tokens are then passed to the parser, which
constructs the syntactic structure of the program.

our assignment focuses on the role of lexical analysis in the broader context of
compiler construction. It examines the methods and tools used in tokenization, error
handling during lexical analysis, and the relationship between lexical and syntax
analysis. Examples of lexical errors, such as unrecognized characters, illegal
identifiers, and invalid numeric literals, illustrate the challenges encountered during
this phase. Furthermore, the assignment explores the importance of tools like Lex,
which automates the generation of lexical analyzers, enabling faster and more reliable
development processes. Understanding lexical analysis is pivotal for comprehending
how compilers bridge the gap between high-level programming languages and
executable machine code.
Introduction of Lexical Analysis
Lexical analysis, often referred to as scanning or tokenization, is the initial and crucial
phase of the compilation process. It acts as the bridge between the raw source code,
written in a high-level programming language, and the subsequent stages of
compilation.

Lexical Analysis is the first phase of a compiler that takes the input as a source code
written in a high-level language. The purpose of lexical analysis is that it aims to read
the input code and break it down into meaningful elements called tokens. Those
tokens are turned into building blocks for other phases of compilation.

Lexical analysis is the process of breaking down the source code of the program into
smaller parts, called tokens, such that a computer can easily understand. These tokens
can be individual words or symbols in a sentence, such as keywords, variable names,
numbers, and punctuation. It is also known as a scanner. Lexical Analysis can be
implemented with the deterministic finite automata The output generated from
Lexical Analysis are a sequence of tokens sent to the parser for syntax analysis.

How Lexical Analyzer Works or break ?


The primary objective of lexical analysis is to break down the input program into a
sequence of meaningful units called token. and follow these steps.

1. Input preprocessing: This stage involves cleaning up the input text and
2. preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
3. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against a
set of patterns or regular expressions that define the different types of tokens.
4. Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
5. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might check
that a variable name is a valid identifier, or that an operator has the correct
syntax.

1
6. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens can
then be passed to the next stage of compilation or interpretation.

Example:

Consider the following line of C code:


int x = 5 + 3;

The lexical analyzer would break this down into the following tokens:

int (keyword)
x (identifier)
= (assignment operator)
5 (integer literal)
+ (addition operator)
3 (integer literal)
; (semicolon)

1.2 Lexical Errors


When the token pattern does not match the prefix of the remaining input, the lexical
analyzer gets stuck and has to recover from this state to analyze the remaining input.
In simple words, a lexical error occurs when a sequence of characters does not match
the pattern of any token. It typically happens during the execution of a program.
During the tokenization process, the lexical analyzer may encounter various errors:

 Unrecognized Characters: Characters that are not part of the defined character
set of the language. For instance, encountering the character @ in a C program
would be an error.

 Illegal Identifiers: Identifiers that violate the naming conventions of the


language, such as starting with a number or containing invalid characters.

 Unterminated Literals: Missing closing quotes for string literals. For example,
"Hello, world without the closing quote.

 Invalid Numeric Literals: Using incorrect numeric formats, such as octal or


hexadecimal numbers with incorrect prefixes.

2
Examples of Lexical Erro in c++ code
 Unrecognized Characters
Description: Characters that are not part of the defined character set of the
programming language.
Example:
cpp
#include <iostream>
using namespace std;

int main() {
printf("hello compiler");$ // Error: Illegal character
'$' at the end.
return 0;
}
Explanation: The $ character is not recognized as valid syntax in C++, leading to a
lexical error.

 Illegal Identifiers

Description: Identifiers that violate naming conventions, such as starting


with a number or containing invalid characters.
Example:
cpp

#include <iostream>
using namespace std;
int main() {
int 3num = 1234; // Error: Identifier cannot start
with a number.
return 0;
}

Explanation: The identifier 3num is illegal because it starts with a digit, which is
against naming conventions.

 Unterminated Literals

Description: Missing closing quotes for string literals or comment


blocks.
Example:
cpp

3
#include <iostream>
using namespace std;
int main() {
/* comment
cout << "GFG!"; // Error: Comment block not
closed.
return 0;
}

Explanation: The comment block is not properly terminated with */, resulting in a
lexical error.

 Invalid Numeric Literals

Description: Using incorrect numeric formats, such as invalid characters


within a number.
Example:
cpp

#include <iostream>
using namespace std;
int main() {
int x = 12$34; // Error: '$' is not a valid character in
a numeric literal.
return 0;
}

Explanation: The presence of $ within the numeric literal 12$34 is invalid, leading
to a lexical error.

 Missing Characters

Description: Omitting required characters that are necessary for valid syntax.
Example:
cpp

#include <iostream> // Error: Missing 'o' in 'iostream'.


using namespace std;
int main() {
cout << "GFG!";
return 0;
}

4
Explanation: If the file inclusion line is incorrectly spelled (e.g., #include
<iostream> is missing an 'o'), it will cause a lexical error.

 Transposition of Characters

Description: Swapping two characters can lead to invalid identifiers or keywords.


Example:
cpp

#include <iostream>
using namespace std;
int mian() { // Error: 'mian' is not recognized as a valid
identifier.
cout << "GFG!";
return 0;
}

Explanation: The misspelling of main as mian results in a lexical error, as it is not


recognized as the standard entry point of a C++ program.
Lexical Error Handling
Error Reporting: When an error is detected, the lexical analyzer should provide an
informative error message to the programmer, including the nature of the error and its
location (line and column number).

Error Recovery:
Panic Mode: A straightforward approach where the lexical analyzer discards input
characters until it encounters a recognizable token, potentially leading to incorrect
parsing.
Phrase-Level Recovery: More sophisticated techniques that attempt to correct the
input by inserting, deleting, or substituting characters based on language-specific
rules.

1.4 Components of Lexical Analysis


Regular Expressions
Regular expressions are a concise notation for describing patterns in text, playing a
crucial role in defining a language's lexical structure.

Basic Constructs:

 Literal Characters: Represent themselves directly (e.g., a, b, 0, 1).


 Character Classes: Represent a set of characters (e.g., [a-z], [0-9], [A-Z]).
Operators:

5
 Concatenation: Combines two or more expressions (e.g., 'a' 'b' represents the
string "ab").
 Alternation: Represents a choice between two or more expressions
(e.g., 'a' | 'b' represents either 'a' or 'b').
 Kleene Closure: Represents zero or more occurrences of an expression (e.g., 'a'*
represents any number of 'a's, including none).
 Positive Closure: Represents one or more occurrences of an expression (e.g., 'a'+
represents one or more 'a's).
 Optional: Represents zero or one occurrence of an expression (e.g., 'a'? represents
zero or one 'a').

Examples:

Identifier: [a-zA-Z_][a-zA-Z0-9]*
Integer Literal: [0-9]+
Floating-Point Literal: [0-9]+\.[0-9]+

Finite Automata
Finite automata are abstract mathematical models that recognize languages defined by
regular expressions. They are typically represented as directed graphs, with nodes
representing states and edges representing transitions based on input symbols.

Types of Finite Automata:

 Deterministic Finite Automata (DFA): For each state and input symbol, there is
exactly one transition.
 Non-deterministic Finite Automata (NFA): For a given state and input symbol,
there may be zero, one, or multiple transitions.

1.5 Role in Lexical Analyzer Design:


Finite automata provide an efficient mechanism for recognizing and classifying
tokens in the input stream.
They can be implemented in software using state tables or other techniques.
NFAs can be more concise to represent but may be less efficient to execute; DFAs
can be efficiently implemented using lookup tables or state transition functions.

Lexical Analyzer Generators


Tools like lex (or its GNU counterpart flex) automate the generation of lexical
analyzers from regular expressions. Users provide a set of rules, each specifying a
regular expression and an action to perform when the pattern is matched.

Example: Lexical Analyzer for a Simple Language


Consider a simple language with the following tokens:
 Identifiers: Starting with a letter followed by any number of letters or digits.
 Integer Literals: A sequence of digits.
 Operators: +, -, *, /, =.

6
 Keywords: if, else, while.
We can define the following regular expressions:
Identifier: [a-zA-Z][a-zA-Z0-9]*
Integer Literal: [0-9]+
Operators: [+-]|[*]|[/]|[=]

Syntax Analysis
Syntax analysis, or parsing, is the phase of the compilation process that follows
lexical analysis. It involves recognizing the syntactic structure of the input program
according to the defined grammar of the programming language. This phase
transforms the sequence of tokens generated by the lexical analyzer into a structured
representation, often in the form of a parse tree or abstract syntax tree (AST).

Implementing the Parser

The primary objective of syntax analysis is to check whether the sequence of tokens
forms a valid statement or program according to the grammar rules of the language.
The parser examines the tokens and constructs a parse tree, which represents the
hierarchical structure of the input.

Steps in Syntax Analysis:

Input Tokens: The parser receives a sequence of tokens from the lexical analyzer.

Grammar Definition: The parser uses a defined grammar to check the syntactic
structure of the input.

Parse Tree Construction: The parser constructs a parse tree that organizes the tokens
according to their grammatical relationships.

Constructing Parse Trees

Parse Trees parse tree is a tree structure that represents the syntactic structure of a
sequence of tokens based on the grammar of the language. Each internal node of the
tree represents a grammar rule, and the leaf nodes represent the tokens.

Example:

For the expression int x = 5 + 3;, a simplified parse tree might look like this:

Root Node (Assignment): Represents the entire assignment operation.

Type Node: Represents the type of the variable being declared (int).

Identifier Node: Represents the variable name (x).

7
Expression Node: Represents the expression on the right side of the assignment.

Plus Node (+): Indicates the addition operation.

Leaf Nodes (5 and 3): Represent the operands of the addition.

This structure clearly shows how the components of the statement are hierarchically
organized.

Different Parsing Techniques

Top-Down Parsing: The parser starts from the root and works its way down to the
leaves, trying to find a leftmost derivation of the input. Example: Recursive descent
parsing is a common top-down parsing method.

Bottom-Up Parsing: The parser starts from the leaves and works its way up to the
root, attempting to reduce the input to the start symbol of the grammar.

Example: LR parsing is a common bottom-up parsing technique.

Parsing Techniques

LL(1) Parsing:

A top-down parsing technique that uses a single lookahead token to make parsing
decisions. The "LL" stands for "Left-to-right scanning of the input and Leftmost
derivation."

Advantages: Simple to implement and efficient for a subset of grammars.

LR(1) Parsing:

Description: A bottom-up parsing technique that uses a single lookahead token as


well. The "LR" stands for "Left-to-right scanning of the input and Rightmost
derivation."

Advantages: More powerful than LL(1) and capable of handling a larger class of
grammars.

Recursive Descent Parsing:

Description: A top-down parsing approach that uses a set of recursive procedures to


process the input. Each non-terminal in the grammar has a corresponding function in
the parser.

Advantages: Easy to implement and understand for simple grammars.

2.3 Ambiguous Grammar

8
Ambiguity in Grammar

A grammar is said to be ambiguous if there exists a string that can be generated by the
grammar in more than one way, leading to different parse trees. Ambiguity can cause
confusion in the parsing process as the parser cannot determine the intended structure.

Example of Ambiguous Grammar:

E → E + E | E * E | id

For the expression id + id * id, there are two valid parse trees, leading to different
interpretations of operator precedence.

Handling Ambiguous Grammar

Refactor the Grammar: Rewrite the grammar rules to eliminate ambiguity.

Use precedence rules to define the order of operations clearly.

Use Disambiguation Techniques: Implement rules that specify how to resolve


ambiguity when it occurs.

Parse Tree Prioritization: Adopt a specific strategy for generating parse trees that
adhere to the desired grammar, ensuring consistent interpretations.

Error Handling: Implement error recovery techniques within the parser to gracefully
handle ambiguities when encountered.

3.0 Semantic Analysis and Context-Sensitive Analysis


Semantic analysis, also known as context-sensitive analysis, is a crucial phase in the
compilation process that follows syntax analysis. Its primary purpose is to ensure that
the program adheres to the semantic rules of the programming language, thereby
validating the meaning of the code. This phase identifies logical errors that cannot be
detected during syntax analysis, ensuring that each construct behaves as intended.

Semantic analysis focuses on verifying that the syntax tree generated during the
syntax analysis phase is meaningful. This involves checking various aspects of the
program, such as type compatibility, scope resolution, and variable declarations. The
goal is to ensure that the code not only follows the syntactic rules but also adheres to
the semantic rules of the language.

Key Objectives of Semantic Analysis

Type Checking: Ensuring operations are performed on compatible data types.

9
Scope Resolution: Verifying that variables and functions are declared and accessible
in their respective scopes.

Error Detection: Identifying and reporting semantic errors before the code
generation phase, reducing the potential for runtime errors.

3.1 Type Checking


Type checking is a fundamental component of semantic analysis that verifies the
compatibility of data types used in expressions, assignments, and function calls. This
process is essential for ensuring that the operations performed in the program are
logically valid.

Type Expressions

Definition: A type expression defines the data type of variables, constants, and
expressions in a program, such as integers, floats, or strings.

Purpose: Type expressions allow the compiler to understand the kind of data being
manipulated, enabling it to enforce rules about how different types can interact.

Types of Variables

Declaration: Variables must be declared with a specific type before use, specifying
the type of data they can hold and the operations allowed.

Examples of Variable Types:

Primitive Types: Basic data types like:

int: Represents integer values.

float: Represents floating-point numbers.

char: Represents single characters.

Composite Types: More complex types, such as:

Arrays: Collections of elements of the same type.

Structures: Grouping of different types under a single name.

User-Defined Types: Types defined by the user, like classes in object-oriented


languages.

Mismatch Errors

10
Type mismatch errors occur when operations involve incompatible types. The
semantic analyzer checks for these errors during compilation, ensuring that variables
and expressions are used in a way that adheres to their defined types.

Common Examples of Mismatch Errors:

Arithmetic Operations:

Errors occur when attempting to perform arithmetic on incompatible types.

cpp

int a = 5;

string b = "10";

int result = a + b; // Error: Type mismatch, cannot add int and string

Function Calls:

Passing arguments with incorrect types to functions can lead to errors.

cpp

void func(int x) { /* ... */ }

func("Hello"); // Error: Type mismatch, expected int but got string

Assignment Statements:

Assigning a value of one type to a variable of another incompatible type can trigger
an error.

Cpp

float pi = 3.14;

int integerValue = pi; // Error: Type mismatch, cannot assign float

Importance of Type Checking

Early Error Detection: By catching type errors during compilation, semantic analysis
prevents runtime errors that could lead to program crashes or incorrect behavior.

Improved Code Quality: Enforcing type rules leads to clearer and more predictable
code since developers must adhere to defined types.

Facilitates Optimization: Knowing the types of variables allows for better


optimization strategies during subsequent compilation phases.

11
Example of Type Checking Process

Consider a simple program snippet:

cpp

int a = 5;

float b = 2.5;

int result = a + b; // Error: Type mismatch, cannot assign float to int

Type Checking Steps:

The compiler sees a as an integer and b as a float.

During the expression evaluation for result, the compiler attempts to add an integer
and a float.

The semantic analyzer identifies that the operation is invalid due to type
compatibility rules.

An error is reported, indicating the mismatch.

4.0 Intermediate Code Generation and Code


Optimization
Intermediate code generation and code optimization are essential phases in the
compilation process that transform high-level source code into an efficient
representation suitable for execution. These phases bridge the gap between high-level
language constructs and the low-level machine code, allowing for further analysis and
optimization.

4.1 Purpose of Intermediate Code


Intermediate code serves several purposes in the compilation process:

Abstraction: It provides a level of abstraction that is independent of the source and


target languages, making it easier to optimize and transform.

Portability: The same intermediate representation can be used to generate machine


code for different architectures, enhancing the compiler's portability.

Simplification: Complex high-level constructs can be broken down into simpler


operations, making it easier to analyze and optimize.

12
Types of Intermediate Representations

Three-Address Code (TAC):

Each instruction consists of at most three operands, allowing for clear representation
of operations.

Example:

Plaintext

t1 = a + b

t2 = t1 * c

result = t2 + d

Abstract Syntax Trees (AST):

A tree representation that captures the hierarchical structure of the source code,
where nodes represent language constructs.

Intermediate Code Generation Process

Steps in Intermediate Code Generation

1. Parsing the Source Code:


2. The parser generates a syntax tree from the input source code.
3. Traversing the Syntax Tree:
4. The compiler traverses the syntax tree to generate intermediate code.
5. Each node in the syntax tree corresponds to an operation or a construct in the
intermediate representation.
6. Generating Intermediate Instructions:

Intermediate instructions are created based on the operations defined in the syntax
tree.

The instructions are typically in the form of TAC or another intermediate


representation.

example of Intermediate Code Generation

Consider the following high-level code:

cpp

int a = 5;

int b = 10;

13
int result = a + b * 2;

The intermediate code generation process might produce the following three-
address code:

plaintext

t1 = b * 2 // Multiply b by 2

t2 = a + t1 // Add a and the result of t1

result = t2 // Store the result

4.2 Code Optimization

Once the intermediate representation is generated, various optimization techniques


can be applied to improve the efficiency of the generated code.

Purpose of Code Optimization

Performance Improvement: Optimized code runs faster and consumes fewer


resources.

Memory Efficiency: Optimizations can reduce memory usage and improve cache
performance.

Reduced Code Size: Code optimization can lead to smaller binaries, which is
beneficial for resource-constrained environments.

Types of Code Optimization

Local Optimizations:

These optimizations improve code within a basic block, which is a sequence of


consecutive statements with no jumps or branches.

Examples:

Constant Folding: Pre-computing constant expressions at compile time.

Plaintext t1 = 3 + 4 // optimized to t1 = 7

Global Optimizations:

These optimizations consider multiple basic blocks and optimize across function calls
and control flow.

14
Examples:

Inlined Function Calls: Replacing a function call with the actual code of the function
to avoid the overhead of a call.

Loop Unrolling: Expanding the loop body to reduce the number of iterations and
increase the instruction-level parallelism.

Abstract Syntax Trees in Compiler Construction

An Abstract Syntax Tree (AST) is a crucial data structure in compiler construction,


representing the hierarchical syntactic structure of source code. Unlike concrete
syntax trees (or parse trees), which retain all syntactic details, ASTs abstract away
certain elements to focus on the essential structure of the code, making them more
suitable for further processing in the compilation pipeline.

An AST is a tree where each node represents a construct occurring in the source code.
The nodes can represent various elements such as:
-Operators (e.g., arithmetic operations)
-Operands (e.g., variables, constants)
-Control structures (e.g., loops, conditionals)
For example, the expression 4+2×10+3×(5+1)4+2×10+3×(5+1) would be represented
in an AST by structuring the operations according to their precedence without
retaining unnecessary syntactic details like parentheses

An abstract syntax tree (or an AST) is a tree-shaped representation of source code that
is convenient for a compiler to operate. A compiler might represent an expression
like 4 + 2 * 10 + 3 * (5 + 1) using a tree structure like this:

15
Figure of Abstract syntax tree example

Differences Between AST and Parse Tree


The key differences between an AST and a parse tree include:
 Detail Level: Parse trees include all syntactic details, such as parentheses and
every rule from the grammar, while ASTs omit these to focus on the logical
structure
 Node Representation: In an AST, operators are internal nodes, and operands
are leaf nodes. This contrasts with parse trees where every part of the syntax is
represented explicitly.
 Efficiency: ASTs are more efficient for later stages of compilation because
they reduce complexity and focus on the relationships between constructs
rather than their syntactic form

Construction of AST

AST are typically generated during the syntax analysis phase of compilation. After
lexical analysis breaks down source code into tokens, these tokens are parsed into a
syntax tree, which is then transformed into an AST. This transformation often
involves removing unnecessary nodes that do not contribute to the semantic meaning
of the program
Applications of Abstract Syntax Trees
ASTs serve multiple purposes in compiler design:
1.Intermediate Representation: They act as an intermediate representation of source
code that can be used for various compiler phases, including optimization and code
generation.
2.Code Analysis: Tools for static code analysis utilize ASTs to detect errors and
enforce coding standards without executing the program
3. Code Generation: During the final stages of compilation, ASTs are traversed to
generate target machine code or intermediate representations like three-address code

Symbol Tables in Compiler Construction

Symbol tables are a fundamental data structure used in compiler construction, serving
as a repository for information about identifiers in a program. This includes details
about variables, functions, objects, and other entities that are crucial for the
compilation process.

Purpose of Symbol Tables


Symbol tables play several critical roles in the compilation process:

16
 Storage of Identifiers: They maintain records of all identifiers used in the source
code, including their names, types, scopes, and memory locations.
 Error Checking: Symbol tables facilitate semantic checks, ensuring that
identifiers are used correctly according to the programming language's rules. For
instance, they verify that variables are declared before use and that assignments
are type-compatible
 Scope Management: They help manage variable scopes, allowing compilers to
differentiate between identifiers with the same name in different contexts (e.g.,
local vs. global variables).
 Code Optimization: By providing quick access to identifier information, symbol
tables assist in optimizing code for better runtime efficiency
Structure of Symbol Tables
Symbol tables can be implemented using various data structures, including:
 Linear Lists: Simple but inefficient for large datasets.
 Hash Tables: Provide faster access times for lookups and are commonly used
due to their efficiency.
 Trees: Useful for managing hierarchical scopes and complex relationships among
identifiers.
 Each entry in a symbol table typically contains:
 Symbol Name: The identifier's name.
 Type: The data type of the identifier (e.g., integer, float).
 Scope: The scope in which the identifier is valid (e.g., local or global).
 Memory Address: The location in memory where the identifier is stored.
 Additional Attributes: Information such as whether a variable is initialized or if
a function is public or private.

Interaction with Compiler Phases


Symbol tables are utilized across various phases of the compilation process:
1. Lexical Analysis: During this phase, new entries are created for tokens identified
in the source code.
2. Syntax Analysis: Additional information about types and scopes is added to
existing entries.
3. Semantic Analysis: The compiler checks for semantic errors using the
information stored in the symbol table.
4. Intermediate Code Generation: Symbol tables provide necessary data for
generating intermediate representations of the code.
5. Code Generation: The final machine code generation phase uses symbol tables
to allocate memory for variables and functions
Example of a Symbol Table Entry
Consider a simple program with various identifiers:

17
distance float Global 0x1000 Uninitialized

pi float Global 0x1004 Constant, read-only

calculateArea func Global 0x1008 Returns float

radius float Local 0x2000 Parameter of calculateArea

This table illustrates how each identifiers attributes are stored to facilitate efficient
compilation and error checking

Three Address Code

Three address code is an intermediate code used by optimizing compilers whereby a


given expression is broken down into several separate instructions which can be easily
translated into assembly language.

A statement involving no more than three references(two for operands and one for
result) is known as three address statement. A sequence of three address statements is
known as three address code. Three address statement is of the form x = y op z , here
x, y, z will have address (memory location). Sometimes a statement might contain less
than three references but it is still called three address statement.

For example a source language expression a+b*c is translated into a sequence of


three-address instruction as follows,

t1 = b * c

t2 = a + t1

here t1 and t2 are compiler generated temporary names.

Three address code presents multi-operator arithmetic expressions and nested flow-of-
control statements which makes it useful for generating and optimizing target code.

We can also view three-address code as a linearized representation of syntax to a


directed acyclic graph(DAG) whereby names correspond to the interior nodes of the
DAG as shown below.

18
An example DAG for the expression: a + a * ( b – c ) + ( b - c ) * d

Example – The three address code for the expression a + b * c + d :

T1=b*c
T2=a+T1
T3=T2+d

Where T 1 , T 2 , T 3 are temporary variables.

General Form
In general, Three Address instructions are represented as-

a = b op c
Here,

 a, b and c are the operands.


 Operands may be constants, names, or compiler generated temporaries.
 op represents the operator.

Example-

 a=b+c
 c=a*b

Addresses and instructions

 Three-address code involves addresses and instructions.


 In object-oriented terms addresses will correspond to classes while instructions
correspond to the appropriate subclass.
 An address can either be a name, constant or a compiler-generated temporary. In
a sequence of instructions symbolic labels will represent the index of a three-
address instruction.
 These symbolic labels are used by instructions which alter the flow of control.
Labels can be used in place of actual indexes by making a separate pass or
by back patching.

Applications of Three-Address Code

19
1. Optimization
TAC is extensively used during the optimization phases of compilation. By breaking
down complex expressions, compilers can analyze and improve performance by
eliminating redundant calculations and optimizing resource usage.
2. Code Generation
During the code generation phase, TAC serves as an intermediary that allows
compilers to produce machine-specific code while ensuring correctness and
efficiency. This step is vital for adapting high-level language constructs to low-level
machine instructions.
3. Debugging
Three-address code aids in debugging by providing a more readable representation
than final machine code. Developers can trace program execution through TAC to
identify errors or issues more easily
4. Language Translation
TAC facilitates the translation of code between different programming languages by
providing a common intermediate representation that simplifies the conversion
process.
Types of Three-Address Code Representations
1. Quadruples
Each instruction is represented as a tuple containing four fields: operator, first
operand, second operand, and result.
2. Triples
In this representation, each instruction is stored as three fields without explicit results;
instead, results are referenced by their position in the list of instructions.

Stack Machines in Compiler

Construction

Stack machines are a fundamental data structure used in compiler construction,


particularly for their simplicity and efficiency in executing programs. They operate on
the principle of a last-in, first-out (LIFO) stack, where operands are pushed onto the
stack and results are popped off. This model simplifies the design of compilers and
the execution of code.

20
Key Features of Stack Machines
Uniform Operand Handling:
In stack machines, all operations take their operands from the top of the stack. This
implicit addressing means that there is no need to specify operand locations explicitly,
which leads to smaller instruction encodings and more compact programs.
Simplified Code Generation:
The architecture allows for straightforward compilation strategies. For example, a
compiler can generate code that directly corresponds to stack operations without needing
complex register allocation. This uniformity reduces the complexity of the compiler
design.
Compact Instruction Set:
Stack machines typically have a denser instruction set compared to register-based
architectures. Instructions can often fit within fewer bits, allowing for better cache
efficiency and reduced memory costs.
Optimizations:
While stack machines require more instructions to perform operations (e.g., every
variable load requires a separate Load instruction), optimizations can be applied at both
the compiler and backend levels to improve performance. Techniques such as keeping
frequently accessed stack elements in registers (accumulators) can reduce memory access
time
Architecture of Stack Machines
A typical stack machine consists of several components:
 Main Memory: Stores both code and data.
 Program Counter (PC): Points to the next instruction to execute.
 Stack Pointer (SP): Indicates the top of the stack.
 Instruction Register (IR): Holds the currently executing instruction.
 Temporary Register: Used for intermediate calculations.

Example Instructions
Common instructions in stack machines include:
 PushImm n: Pushes an immediate value nn onto the stack.
 PushAbs x: Pushes the value of variable xx onto the stack.
 Pop x: Pops the top value from the stack into variable xx.
 Add: Pops the top two values off the stack, adds them, and pushes the result back
on.

Advantages of Stack Machines

21
1. Simplicity: The architecture is easier to understand and implement compared to
more complex register-based systems.
2. Ease of Compilation: Compiling high-level languages into stack machine code
is generally simpler due to fewer concerns about register management
3. Cost Efficiency: The design allows for simpler and cheaper CPUs, making them
suitable for various applications such as Java Virtual Machine (JVM) and
WebAssembly
Challenges with Stack Machines
Despite their advantages, stack machines can face challenges such as:
 Higher instruction counts due to separate load instructions for each variable.
 Potential errors in program execution if not managed properly, particularly
regarding operand arity

Basic software tools used in


compiler construction
Lexical Analyzer Generators in Compiler

Construction

Lexical analyzer generators are essential tools in compiler construction that automate
the creation of lexical analyzers, also known as scanners. These tools take high-level
specifications of tokens defined by regular expressions and generate code that can
recognize and process these tokens in source code.
Purpose and Functionality
1. Tokenization:
The primary function of a lexical analyzer is to read the input source code and convert
it into a sequence of tokens. Tokens are meaningful sequences of characters that
represent various elements of the programming language, such as keywords,
identifiers, literals, and operators.
2. Pattern Recognition:
Lexical analyzer generators allow developers to define patterns for tokens using
regular expressions. The generator then constructs a finite automaton (FA) that can
recognize these patterns efficiently during the scanning process.
3. Ease of Use:
By using a generator, developers can create complex lexical analyzers without having
to manually implement the underlying algorithms for token recognition. This
significantly reduces development time and complexity

22
Common Lexical Analyzer Generators
Several widely used lexical analyzer generators include:
 Lex: One of the most popular tools, Lex allows users to define tokens through
regular expressions and automatically generates C code for the scanner.
 Flex: An enhanced version of Lex, Flex provides additional features and better
performance while maintaining compatibility with Lex syntax.
 ANTLR: Although primarily a parser generator, ANTLR also includes
capabilities for generating lexical analyzers, allowing for seamless integration
between parsing and lexical analysis.
Structure of a Lexical Analyzer Generator
Typically, a lexical analyzer generator operates based on a structured input file
divided into three sections:
Declarations:
This section defines the tokens and their corresponding regular expressions.
Translation Rules:
Here, actions associated with each token are specified, such as what to do when a
token is recognized (e.g., returning it to the parser).
Auxiliary Functions:
This optional section can include additional helper functions that support the main
tokenization process.
Advantages of Using Lexical Analyzer Generators
 Efficiency in Development: Automating the generation of lexical analyzers
saves significant time compared to manual coding.
 Consistency and Reliability: Generated analyzers are typically more consistent
and less prone to errors than hand-written code.
 Adaptability: Changes in language specifications can be easily accommodated
by modifying the input definitions rather than rewriting substantial portions of
code.
Challenges
While lexical analyzer generators provide numerous benefits, they also come with
challenges:
 Learning Curve: Understanding how to effectively use these tools requires
familiarity with regular expressions and the specific syntax of the generator.
 Performance Considerations: Depending on how they are implemented,
generated analyzers may not always be as optimized as hand-crafted solutions
tailored for specific applications

23
Parser Generators in Compiler
Construction
Parser generators are essential tools in compiler construction, specifically designed to
automate the creation of parsers, which are components responsible for syntax
analysis. They take a formal grammar as input and produce source code for a parser
that can process strings based on that grammar. This process is crucial in transforming
high-level programming languages into a format that can be understood and executed
by machines.

Functionality of Parser Generators


 Input Grammar: Parser generators require a formal description of the language's
syntax, typically provided in the form of context-free grammars (CFG). This
grammar defines the rules for constructing valid strings in the language.
 Output Parser: The generator produces a parser, often implemented in a
programming language like C, Java, or Python. This parser can then analyze
source code written in the specified language, building parse trees or abstract
syntax trees (ASTs) as it processes tokens.
 Error Handling: Many parser generators include built-in mechanisms for error
detection and reporting, allowing developers to identify and address syntax errors
during the parsing phase.
Types of Parser Generators
 Top-Down Parsers: These parsers begin with the highest-level rule and attempt
to rewrite it into the input string. Examples include recursive descent parsers and
LL parsers.
 Bottom-Up Parsers: These parsers start with the input string and attempt to
reduce it to the start symbol of the grammar. Common examples are LR parsers
and SLR parsers.
Advantages of Using Parser Generators
 Efficiency: They automate tedious aspects of parser creation, reducing
development time and minimizing human error.
 Flexibility: Developers can easily modify the grammar and regenerate the parser
without manually rewriting parsing logic.
 Standardization: Many parser generators follow established conventions,
making it easier for developers to understand and maintain code.
Popular Parser Generators
1. Yacc (Yet Another Compiler Compiler): A widely used tool that generates
LALR parsers from a given context-free grammar.
2. Bison: An extension of Yacc that provides additional features and improved
functionality.

24
3. ANTLR (Another Tool for Language Recognition): A powerful parser
generator that supports multiple languages and offers advanced features like tree
construction and visitor patterns.

CONCLUSION

In conclusion, lexical analysis is a cornerstone of compiler construction, playing a


pivotal role in converting source code into a structured and analyzable format. This
phase not only ensures that the raw input adheres to the basic syntax rules of the
programming language but also prepares the groundwork for more complex analyses
in subsequent stages. Through tokenization and error handling, lexical analysis
enables the compiler to detect and report foundational errors early in the compilation
process.

Our assignment underscores the importance of lexical analysis in ensuring the overall
efficiency and accuracy of compilers. By studying tools like Lex, we observe how
automation enhances the reliability and speed of developing lexical analyzers. The
ability to process and classify tokens systematically allows compilers to handle
diverse programming constructs effectively. Ultimately, lexical analysis contributes
significantly to the seamless translation of high-level code into executable machine
instructions, highlighting its indispensable role in modern software development.

25
References

 Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, "Compilers:


Principles, Techniques, and Tools" (Dragon Book), Pearson Education.
 Kenneth C. Louden, "Compiler Construction: Principles and Practice," Cengage
Learning.
 Andrew W. Appel, "Modern Compiler Implementation in C/Java/ML,"
Cambridge University Press.

Web Resources:

 Compiler Design Tutorial by Geeks for Geeks


 Lexical Analysis Overview by Tutorials Point

26

You might also like