Compiler Group 4
Compiler Group 4
BURIE CAMPUS
DEPARTMENT OF COMPUTER SCIENCE
COMPILER DESIGN
Group members ID
1 Bereket demsew 1405717
2 Henok getnet 1403709
3 Temesgen adine 1404301
4 Seid yimam 1404187
5 Nahom fkadu 1404056
6 Eleni geta 1405947
Submmited to:( Msc)
Introduction
our assignment focuses on the role of lexical analysis in the broader context of
compiler construction. It examines the methods and tools used in tokenization, error
handling during lexical analysis, and the relationship between lexical and syntax
analysis. Examples of lexical errors, such as unrecognized characters, illegal
identifiers, and invalid numeric literals, illustrate the challenges encountered during
this phase. Furthermore, the assignment explores the importance of tools like Lex,
which automates the generation of lexical analyzers, enabling faster and more reliable
development processes. Understanding lexical analysis is pivotal for comprehending
how compilers bridge the gap between high-level programming languages and
executable machine code.
Introduction of Lexical Analysis
Lexical analysis, often referred to as scanning or tokenization, is the initial and crucial
phase of the compilation process. It acts as the bridge between the raw source code,
written in a high-level programming language, and the subsequent stages of
compilation.
Lexical Analysis is the first phase of a compiler that takes the input as a source code
written in a high-level language. The purpose of lexical analysis is that it aims to read
the input code and break it down into meaningful elements called tokens. Those
tokens are turned into building blocks for other phases of compilation.
Lexical analysis is the process of breaking down the source code of the program into
smaller parts, called tokens, such that a computer can easily understand. These tokens
can be individual words or symbols in a sentence, such as keywords, variable names,
numbers, and punctuation. It is also known as a scanner. Lexical Analysis can be
implemented with the deterministic finite automata The output generated from
Lexical Analysis are a sequence of tokens sent to the parser for syntax analysis.
1. Input preprocessing: This stage involves cleaning up the input text and
2. preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
3. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against a
set of patterns or regular expressions that define the different types of tokens.
4. Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
5. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might check
that a variable name is a valid identifier, or that an operator has the correct
syntax.
1
6. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens can
then be passed to the next stage of compilation or interpretation.
Example:
The lexical analyzer would break this down into the following tokens:
int (keyword)
x (identifier)
= (assignment operator)
5 (integer literal)
+ (addition operator)
3 (integer literal)
; (semicolon)
Unrecognized Characters: Characters that are not part of the defined character
set of the language. For instance, encountering the character @ in a C program
would be an error.
Unterminated Literals: Missing closing quotes for string literals. For example,
"Hello, world without the closing quote.
2
Examples of Lexical Erro in c++ code
Unrecognized Characters
Description: Characters that are not part of the defined character set of the
programming language.
Example:
cpp
#include <iostream>
using namespace std;
int main() {
printf("hello compiler");$ // Error: Illegal character
'$' at the end.
return 0;
}
Explanation: The $ character is not recognized as valid syntax in C++, leading to a
lexical error.
Illegal Identifiers
#include <iostream>
using namespace std;
int main() {
int 3num = 1234; // Error: Identifier cannot start
with a number.
return 0;
}
Explanation: The identifier 3num is illegal because it starts with a digit, which is
against naming conventions.
Unterminated Literals
3
#include <iostream>
using namespace std;
int main() {
/* comment
cout << "GFG!"; // Error: Comment block not
closed.
return 0;
}
Explanation: The comment block is not properly terminated with */, resulting in a
lexical error.
#include <iostream>
using namespace std;
int main() {
int x = 12$34; // Error: '$' is not a valid character in
a numeric literal.
return 0;
}
Explanation: The presence of $ within the numeric literal 12$34 is invalid, leading
to a lexical error.
Missing Characters
Description: Omitting required characters that are necessary for valid syntax.
Example:
cpp
4
Explanation: If the file inclusion line is incorrectly spelled (e.g., #include
<iostream> is missing an 'o'), it will cause a lexical error.
Transposition of Characters
#include <iostream>
using namespace std;
int mian() { // Error: 'mian' is not recognized as a valid
identifier.
cout << "GFG!";
return 0;
}
Error Recovery:
Panic Mode: A straightforward approach where the lexical analyzer discards input
characters until it encounters a recognizable token, potentially leading to incorrect
parsing.
Phrase-Level Recovery: More sophisticated techniques that attempt to correct the
input by inserting, deleting, or substituting characters based on language-specific
rules.
Basic Constructs:
5
Concatenation: Combines two or more expressions (e.g., 'a' 'b' represents the
string "ab").
Alternation: Represents a choice between two or more expressions
(e.g., 'a' | 'b' represents either 'a' or 'b').
Kleene Closure: Represents zero or more occurrences of an expression (e.g., 'a'*
represents any number of 'a's, including none).
Positive Closure: Represents one or more occurrences of an expression (e.g., 'a'+
represents one or more 'a's).
Optional: Represents zero or one occurrence of an expression (e.g., 'a'? represents
zero or one 'a').
Examples:
Identifier: [a-zA-Z_][a-zA-Z0-9]*
Integer Literal: [0-9]+
Floating-Point Literal: [0-9]+\.[0-9]+
Finite Automata
Finite automata are abstract mathematical models that recognize languages defined by
regular expressions. They are typically represented as directed graphs, with nodes
representing states and edges representing transitions based on input symbols.
Deterministic Finite Automata (DFA): For each state and input symbol, there is
exactly one transition.
Non-deterministic Finite Automata (NFA): For a given state and input symbol,
there may be zero, one, or multiple transitions.
6
Keywords: if, else, while.
We can define the following regular expressions:
Identifier: [a-zA-Z][a-zA-Z0-9]*
Integer Literal: [0-9]+
Operators: [+-]|[*]|[/]|[=]
Syntax Analysis
Syntax analysis, or parsing, is the phase of the compilation process that follows
lexical analysis. It involves recognizing the syntactic structure of the input program
according to the defined grammar of the programming language. This phase
transforms the sequence of tokens generated by the lexical analyzer into a structured
representation, often in the form of a parse tree or abstract syntax tree (AST).
The primary objective of syntax analysis is to check whether the sequence of tokens
forms a valid statement or program according to the grammar rules of the language.
The parser examines the tokens and constructs a parse tree, which represents the
hierarchical structure of the input.
Input Tokens: The parser receives a sequence of tokens from the lexical analyzer.
Grammar Definition: The parser uses a defined grammar to check the syntactic
structure of the input.
Parse Tree Construction: The parser constructs a parse tree that organizes the tokens
according to their grammatical relationships.
Parse Trees parse tree is a tree structure that represents the syntactic structure of a
sequence of tokens based on the grammar of the language. Each internal node of the
tree represents a grammar rule, and the leaf nodes represent the tokens.
Example:
For the expression int x = 5 + 3;, a simplified parse tree might look like this:
Type Node: Represents the type of the variable being declared (int).
7
Expression Node: Represents the expression on the right side of the assignment.
This structure clearly shows how the components of the statement are hierarchically
organized.
Top-Down Parsing: The parser starts from the root and works its way down to the
leaves, trying to find a leftmost derivation of the input. Example: Recursive descent
parsing is a common top-down parsing method.
Bottom-Up Parsing: The parser starts from the leaves and works its way up to the
root, attempting to reduce the input to the start symbol of the grammar.
Parsing Techniques
LL(1) Parsing:
A top-down parsing technique that uses a single lookahead token to make parsing
decisions. The "LL" stands for "Left-to-right scanning of the input and Leftmost
derivation."
LR(1) Parsing:
Advantages: More powerful than LL(1) and capable of handling a larger class of
grammars.
8
Ambiguity in Grammar
A grammar is said to be ambiguous if there exists a string that can be generated by the
grammar in more than one way, leading to different parse trees. Ambiguity can cause
confusion in the parsing process as the parser cannot determine the intended structure.
E → E + E | E * E | id
For the expression id + id * id, there are two valid parse trees, leading to different
interpretations of operator precedence.
Parse Tree Prioritization: Adopt a specific strategy for generating parse trees that
adhere to the desired grammar, ensuring consistent interpretations.
Error Handling: Implement error recovery techniques within the parser to gracefully
handle ambiguities when encountered.
Semantic analysis focuses on verifying that the syntax tree generated during the
syntax analysis phase is meaningful. This involves checking various aspects of the
program, such as type compatibility, scope resolution, and variable declarations. The
goal is to ensure that the code not only follows the syntactic rules but also adheres to
the semantic rules of the language.
9
Scope Resolution: Verifying that variables and functions are declared and accessible
in their respective scopes.
Error Detection: Identifying and reporting semantic errors before the code
generation phase, reducing the potential for runtime errors.
Type Expressions
Definition: A type expression defines the data type of variables, constants, and
expressions in a program, such as integers, floats, or strings.
Purpose: Type expressions allow the compiler to understand the kind of data being
manipulated, enabling it to enforce rules about how different types can interact.
Types of Variables
Declaration: Variables must be declared with a specific type before use, specifying
the type of data they can hold and the operations allowed.
Mismatch Errors
10
Type mismatch errors occur when operations involve incompatible types. The
semantic analyzer checks for these errors during compilation, ensuring that variables
and expressions are used in a way that adheres to their defined types.
Arithmetic Operations:
cpp
int a = 5;
string b = "10";
int result = a + b; // Error: Type mismatch, cannot add int and string
Function Calls:
cpp
Assignment Statements:
Assigning a value of one type to a variable of another incompatible type can trigger
an error.
Cpp
float pi = 3.14;
Early Error Detection: By catching type errors during compilation, semantic analysis
prevents runtime errors that could lead to program crashes or incorrect behavior.
Improved Code Quality: Enforcing type rules leads to clearer and more predictable
code since developers must adhere to defined types.
11
Example of Type Checking Process
cpp
int a = 5;
float b = 2.5;
During the expression evaluation for result, the compiler attempts to add an integer
and a float.
The semantic analyzer identifies that the operation is invalid due to type
compatibility rules.
12
Types of Intermediate Representations
Each instruction consists of at most three operands, allowing for clear representation
of operations.
Example:
Plaintext
t1 = a + b
t2 = t1 * c
result = t2 + d
A tree representation that captures the hierarchical structure of the source code,
where nodes represent language constructs.
Intermediate instructions are created based on the operations defined in the syntax
tree.
cpp
int a = 5;
int b = 10;
13
int result = a + b * 2;
The intermediate code generation process might produce the following three-
address code:
plaintext
t1 = b * 2 // Multiply b by 2
Memory Efficiency: Optimizations can reduce memory usage and improve cache
performance.
Reduced Code Size: Code optimization can lead to smaller binaries, which is
beneficial for resource-constrained environments.
Local Optimizations:
Examples:
Plaintext t1 = 3 + 4 // optimized to t1 = 7
Global Optimizations:
These optimizations consider multiple basic blocks and optimize across function calls
and control flow.
14
Examples:
Inlined Function Calls: Replacing a function call with the actual code of the function
to avoid the overhead of a call.
Loop Unrolling: Expanding the loop body to reduce the number of iterations and
increase the instruction-level parallelism.
An AST is a tree where each node represents a construct occurring in the source code.
The nodes can represent various elements such as:
-Operators (e.g., arithmetic operations)
-Operands (e.g., variables, constants)
-Control structures (e.g., loops, conditionals)
For example, the expression 4+2×10+3×(5+1)4+2×10+3×(5+1) would be represented
in an AST by structuring the operations according to their precedence without
retaining unnecessary syntactic details like parentheses
An abstract syntax tree (or an AST) is a tree-shaped representation of source code that
is convenient for a compiler to operate. A compiler might represent an expression
like 4 + 2 * 10 + 3 * (5 + 1) using a tree structure like this:
15
Figure of Abstract syntax tree example
Construction of AST
AST are typically generated during the syntax analysis phase of compilation. After
lexical analysis breaks down source code into tokens, these tokens are parsed into a
syntax tree, which is then transformed into an AST. This transformation often
involves removing unnecessary nodes that do not contribute to the semantic meaning
of the program
Applications of Abstract Syntax Trees
ASTs serve multiple purposes in compiler design:
1.Intermediate Representation: They act as an intermediate representation of source
code that can be used for various compiler phases, including optimization and code
generation.
2.Code Analysis: Tools for static code analysis utilize ASTs to detect errors and
enforce coding standards without executing the program
3. Code Generation: During the final stages of compilation, ASTs are traversed to
generate target machine code or intermediate representations like three-address code
Symbol tables are a fundamental data structure used in compiler construction, serving
as a repository for information about identifiers in a program. This includes details
about variables, functions, objects, and other entities that are crucial for the
compilation process.
16
Storage of Identifiers: They maintain records of all identifiers used in the source
code, including their names, types, scopes, and memory locations.
Error Checking: Symbol tables facilitate semantic checks, ensuring that
identifiers are used correctly according to the programming language's rules. For
instance, they verify that variables are declared before use and that assignments
are type-compatible
Scope Management: They help manage variable scopes, allowing compilers to
differentiate between identifiers with the same name in different contexts (e.g.,
local vs. global variables).
Code Optimization: By providing quick access to identifier information, symbol
tables assist in optimizing code for better runtime efficiency
Structure of Symbol Tables
Symbol tables can be implemented using various data structures, including:
Linear Lists: Simple but inefficient for large datasets.
Hash Tables: Provide faster access times for lookups and are commonly used
due to their efficiency.
Trees: Useful for managing hierarchical scopes and complex relationships among
identifiers.
Each entry in a symbol table typically contains:
Symbol Name: The identifier's name.
Type: The data type of the identifier (e.g., integer, float).
Scope: The scope in which the identifier is valid (e.g., local or global).
Memory Address: The location in memory where the identifier is stored.
Additional Attributes: Information such as whether a variable is initialized or if
a function is public or private.
17
distance float Global 0x1000 Uninitialized
This table illustrates how each identifiers attributes are stored to facilitate efficient
compilation and error checking
A statement involving no more than three references(two for operands and one for
result) is known as three address statement. A sequence of three address statements is
known as three address code. Three address statement is of the form x = y op z , here
x, y, z will have address (memory location). Sometimes a statement might contain less
than three references but it is still called three address statement.
t1 = b * c
t2 = a + t1
Three address code presents multi-operator arithmetic expressions and nested flow-of-
control statements which makes it useful for generating and optimizing target code.
18
An example DAG for the expression: a + a * ( b – c ) + ( b - c ) * d
T1=b*c
T2=a+T1
T3=T2+d
General Form
In general, Three Address instructions are represented as-
a = b op c
Here,
Example-
a=b+c
c=a*b
19
1. Optimization
TAC is extensively used during the optimization phases of compilation. By breaking
down complex expressions, compilers can analyze and improve performance by
eliminating redundant calculations and optimizing resource usage.
2. Code Generation
During the code generation phase, TAC serves as an intermediary that allows
compilers to produce machine-specific code while ensuring correctness and
efficiency. This step is vital for adapting high-level language constructs to low-level
machine instructions.
3. Debugging
Three-address code aids in debugging by providing a more readable representation
than final machine code. Developers can trace program execution through TAC to
identify errors or issues more easily
4. Language Translation
TAC facilitates the translation of code between different programming languages by
providing a common intermediate representation that simplifies the conversion
process.
Types of Three-Address Code Representations
1. Quadruples
Each instruction is represented as a tuple containing four fields: operator, first
operand, second operand, and result.
2. Triples
In this representation, each instruction is stored as three fields without explicit results;
instead, results are referenced by their position in the list of instructions.
Construction
20
Key Features of Stack Machines
Uniform Operand Handling:
In stack machines, all operations take their operands from the top of the stack. This
implicit addressing means that there is no need to specify operand locations explicitly,
which leads to smaller instruction encodings and more compact programs.
Simplified Code Generation:
The architecture allows for straightforward compilation strategies. For example, a
compiler can generate code that directly corresponds to stack operations without needing
complex register allocation. This uniformity reduces the complexity of the compiler
design.
Compact Instruction Set:
Stack machines typically have a denser instruction set compared to register-based
architectures. Instructions can often fit within fewer bits, allowing for better cache
efficiency and reduced memory costs.
Optimizations:
While stack machines require more instructions to perform operations (e.g., every
variable load requires a separate Load instruction), optimizations can be applied at both
the compiler and backend levels to improve performance. Techniques such as keeping
frequently accessed stack elements in registers (accumulators) can reduce memory access
time
Architecture of Stack Machines
A typical stack machine consists of several components:
Main Memory: Stores both code and data.
Program Counter (PC): Points to the next instruction to execute.
Stack Pointer (SP): Indicates the top of the stack.
Instruction Register (IR): Holds the currently executing instruction.
Temporary Register: Used for intermediate calculations.
Example Instructions
Common instructions in stack machines include:
PushImm n: Pushes an immediate value nn onto the stack.
PushAbs x: Pushes the value of variable xx onto the stack.
Pop x: Pops the top value from the stack into variable xx.
Add: Pops the top two values off the stack, adds them, and pushes the result back
on.
21
1. Simplicity: The architecture is easier to understand and implement compared to
more complex register-based systems.
2. Ease of Compilation: Compiling high-level languages into stack machine code
is generally simpler due to fewer concerns about register management
3. Cost Efficiency: The design allows for simpler and cheaper CPUs, making them
suitable for various applications such as Java Virtual Machine (JVM) and
WebAssembly
Challenges with Stack Machines
Despite their advantages, stack machines can face challenges such as:
Higher instruction counts due to separate load instructions for each variable.
Potential errors in program execution if not managed properly, particularly
regarding operand arity
Construction
Lexical analyzer generators are essential tools in compiler construction that automate
the creation of lexical analyzers, also known as scanners. These tools take high-level
specifications of tokens defined by regular expressions and generate code that can
recognize and process these tokens in source code.
Purpose and Functionality
1. Tokenization:
The primary function of a lexical analyzer is to read the input source code and convert
it into a sequence of tokens. Tokens are meaningful sequences of characters that
represent various elements of the programming language, such as keywords,
identifiers, literals, and operators.
2. Pattern Recognition:
Lexical analyzer generators allow developers to define patterns for tokens using
regular expressions. The generator then constructs a finite automaton (FA) that can
recognize these patterns efficiently during the scanning process.
3. Ease of Use:
By using a generator, developers can create complex lexical analyzers without having
to manually implement the underlying algorithms for token recognition. This
significantly reduces development time and complexity
22
Common Lexical Analyzer Generators
Several widely used lexical analyzer generators include:
Lex: One of the most popular tools, Lex allows users to define tokens through
regular expressions and automatically generates C code for the scanner.
Flex: An enhanced version of Lex, Flex provides additional features and better
performance while maintaining compatibility with Lex syntax.
ANTLR: Although primarily a parser generator, ANTLR also includes
capabilities for generating lexical analyzers, allowing for seamless integration
between parsing and lexical analysis.
Structure of a Lexical Analyzer Generator
Typically, a lexical analyzer generator operates based on a structured input file
divided into three sections:
Declarations:
This section defines the tokens and their corresponding regular expressions.
Translation Rules:
Here, actions associated with each token are specified, such as what to do when a
token is recognized (e.g., returning it to the parser).
Auxiliary Functions:
This optional section can include additional helper functions that support the main
tokenization process.
Advantages of Using Lexical Analyzer Generators
Efficiency in Development: Automating the generation of lexical analyzers
saves significant time compared to manual coding.
Consistency and Reliability: Generated analyzers are typically more consistent
and less prone to errors than hand-written code.
Adaptability: Changes in language specifications can be easily accommodated
by modifying the input definitions rather than rewriting substantial portions of
code.
Challenges
While lexical analyzer generators provide numerous benefits, they also come with
challenges:
Learning Curve: Understanding how to effectively use these tools requires
familiarity with regular expressions and the specific syntax of the generator.
Performance Considerations: Depending on how they are implemented,
generated analyzers may not always be as optimized as hand-crafted solutions
tailored for specific applications
23
Parser Generators in Compiler
Construction
Parser generators are essential tools in compiler construction, specifically designed to
automate the creation of parsers, which are components responsible for syntax
analysis. They take a formal grammar as input and produce source code for a parser
that can process strings based on that grammar. This process is crucial in transforming
high-level programming languages into a format that can be understood and executed
by machines.
24
3. ANTLR (Another Tool for Language Recognition): A powerful parser
generator that supports multiple languages and offers advanced features like tree
construction and visitor patterns.
CONCLUSION
Our assignment underscores the importance of lexical analysis in ensuring the overall
efficiency and accuracy of compilers. By studying tools like Lex, we observe how
automation enhances the reliability and speed of developing lexical analyzers. The
ability to process and classify tokens systematically allows compilers to handle
diverse programming constructs effectively. Ultimately, lexical analysis contributes
significantly to the seamless translation of high-level code into executable machine
instructions, highlighting its indispensable role in modern software development.
25
References
Web Resources:
26