CD Farre
CD Farre
A compiler is a so ware program that converts the high-level source code wri en in a programming
language into low-level machine code that can be executed by the computer hardware. The process
of conver ng the source code into machine code involves several phases or stages, which are
collec vely known as the phases of a compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning. This
phase reads the source code and breaks it into a stream of tokens, which are the basic units
of the programming language. The tokens are then passed on to the next phase for further
processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this
phase is usually an Abstract Syntax Tree (AST).
3. Seman c Analysis: The third phase of a compiler is seman c analysis. This phase checks
whether the code is seman cally correct, i.e., whether it conforms to the language’s type
system and other seman c rules. In this stage, the compiler checks the meaning of the
source code to ensure that it makes sense. The compiler performs type checking, which
ensures that variables are used correctly and that opera ons are performed on compa ble
data types. The compiler also checks for other seman c errors, such as undeclared variables
and incorrect func on calls.
4. Intermediate Code Genera on: The fourth phase of a compiler is intermediate code
genera on. This phase generates an intermediate representa on of the source code that
can be easily translated into machine code.
5. Op miza on: The fi h phase of a compiler is op miza on. This phase applies various
op miza on techniques to the intermediate code to improve the performance of the
generated machine code.
6. Code Genera on: The final phase of a compiler is code genera on. This phase takes the
op mized intermediate code and generates the actual machine code that can be executed
by the target hardware..
Lexical analysis, also known as scanning or tokeniza on, is the first phase of the compila on process
in computer science. Its primary task is to break the input program or code into meaningful units
called tokens. These tokens are the smallest meaningful units in a programming language and serve
as the building blocks for further processing by the compiler or interpreter.
Input Stream:The input to the lexical analyzer is the source code or program text, which is typically a
sequence of characters.
Tokeniza on:The lexical analyzer scans the input stream character by character and groups
characters into tokens based on predefined rules or pa erns specified by the programming
language's lexical structure. These tokens represent keywords, iden fiers, literals (like numbers or
strings), operators, punctua on symbols, and other language-specific constructs.
Token Classifica on:Each token is classified into a specific category or type, such as keywords,
iden fiers, operators, etc. This classifica on helps in subsequent phases of the compila on process,
such as parsing and seman c analysis.
Token Output:As the lexical analyzer iden fies tokens, it generates a stream of tokens as output,
which is passed on to the next phase of the compiler or interpreter for further processing.
Error Handling:The lexical analyzer also detects and reports lexical errors, such as invalid characters
or tokens that do not conform to the language's lexical rules.
This parsing technique uses Left Most This parsing technique uses Right Most
Derivation. Derivation.
The main leftmost decision is to select The main decision is to select when to
what production rule to use in order to use a production rule to reduce the
construct the string. string to get the starting symbol.
Top-down parsing is a strategy used in syntax analysis and parsing in the field of compiler design,
where the parsing process starts from the highest-level construct (the start symbol) and works its
way down to the individual tokens (the leaves of the parse tree). This approach a empts to
construct a parse tree for an input string by applying grammar rules in reverse, star ng from the
start symbol of the grammar and making decisions at each step about which rule to apply next to
derive the input string.
What is Bo om Up Parsing?
Bo om-up parsing is a method used in compiler design for syntax analysis, where the parsing
process starts with the leaf nodes of the parse tree (the input tokens) and works its way up to the
root (the start symbol of the grammar).This approach constructs the parse tree from the bo om up
by recognizing and reducing sequences of tokens into gramma cal constructs, according to the rules
of a given grammar, un l the en re sequence is reduced to the start symbol. It effec vely builds the
parse tree by star ng with the details (tokens) and combining them into higher-level constructs un l
it constructs the whole program.
It is a technique which may or may not It is a technique that does not require any kind of
require backtracking process. backtracking.
It uses procedures for every non-terminal It finds out productions to use by replacing input
entity to parse strings. string.
First L of LL is for left to right and second L is for leftmost L of LR is for left to right and R is for rightmost
derivation. derivation.
It follows the left most derivation. It follows reverse of right most derivation.
Using LL parser parser tree is constructed in top down manner. Parser tree is constructed in bottom up manner.
Ends when stack used becomes empty. Starts with an empty stack.
Pre-order traversal of the parse tree. Post-order traversal of the parser tree.
Terminal is read after popping out of stack. Terminal is read before pushing into the stack.
What is first
1.Defini on: The FIRST set of a nonterminal symbol is the set of terminal symbols that can appear as
the first symbol in a string derived from that nonterminal. In other words, it is the set of all possible
star ng symbols for a string derived from that nonterminal. 2.Calcula on: The FIRST set for each
nonterminal symbol is calculated by examining the produc ons for that symbol and determining
which terminal symbols can appear as the first symbol in a string derived from that produc on.
3.Recursive Descent Parsing: The FIRST set is o en used in recursive descent parsing, which is a top-
down parsing technique that uses the FIRST set to determine which produc on to use at each step
of the parsing process. 4.Ambiguity Resolu on: The FIRST set can help resolve ambigui es in the
grammar by providing a way to determine which produc on to use based on the next input symbol.
5.Follow Set: The FOLLOW set is another concept used in syntax analysis that represents the set of
symbols that can appear immediately a er a nonterminal symbol in a deriva on. The FOLLOW set is
o en used in conjunc on with the FIRST set to resolve parsing conflicts and ensure that the parser
can correctly iden fy the structure of the input code.
What is follow
FOLLOW set in compiler design are used to iden fy the terminal symbol immediately a er a non-
terminal in a given language. FOLLOW set is also used to avoid backtracking same as the FIRST set.
The only difference is FOLLOW set works on vanishing non-terminal on the right-hand side so that
decision-making gets easier for the compiler while parsing.
Follow(X) to be the set of terminals that can appear immediately to the right of Non-Terminal X in
some senten al form.
1. Improving Performance:Op mized code runs faster and consumes fewer resources, leading to
improved program performance. This is especially important inresource-constrained environments
such as embedded systems or mobile devices.
2. Reducing Execu on Time:Op miza on techniques such as loop unrolling, func on inlining, and
constant folding can reduce the number of instruc ons executed, thusdecreasing program execu on
me.
3. Minimizing Memory Usage: Op miza on techniques like register alloca on and memory
op miza on help reduce memory usage, making the program more memory-efficient.This is crucial
in scenarios where memory resources are limited.
5. Reducing Power Consump on: Op mized code typically requires fewer instruc ons to execute,
leading to reduced power consump on. This is important for ba ery-powered devices and energy-
efficient compu ng, where power consump on is a cri cal factor.
Storage Alloca on Strategies:- There are mainly three types of Storage Alloca on Strategies:
1. Sta c Alloca on - Sta c alloca on lays out or assigns the storage for all the data objects at the
compile me. In sta c alloca on names are bound to storage. The address of these iden fiers will be
the same throughout. The memory will be allocated in a sta c loca on once it is created at compile
me. C and C++ use sta c alloca on.
Advantages of Sta c Alloca on 1. It is easy to understand. 2. The memory is allocated once only at
compile me and remains the same throughout the program comple on. 3. Memory alloca on is
done before the program starts taking memory only on compile me.
Disadvantages of Sta c Alloca on 1. Not highly scalable. 2. Sta c storage alloca on is not very
efficient. 3. The size of the data must be known at the compile me.
2. Heap Alloca on :- Heap alloca on is used where the Stack alloca on lacks if we want to retain the
values of the local variable a er the ac va on record ends, which we cannot do in stack alloca on,
here LIFO scheme does not work for the alloca on and de-alloca on of the ac va on record. Heap is
the most flexible storage alloca on strategy we can dynamically allocate and de-allocate local
variables whenever the user wants according to the user needs at run- me. The variables in heap
alloca on can be changed according to the user’s requirement. C, C++, Python, and Java all of these
support Heap Alloca on.
Advantages of Heap Alloca on :- 1. Heap alloca on is useful when we have data whose size is not
fixed and can change during the run me. 2. We can retain the values of variables even if the
ac va on records end. 3. Heap alloca on is the most flexible alloca on scheme.
Disadvantages of Heap Alloca on 1. Heap alloca on is slower as compared to stack alloca on. 2.
There is a chance of memory leaks. 3. Stack Alloca on
3. Stack is commonly known as Dynamic alloca on. Dynamic alloca on means the alloca on of
memory at run- me. Stack is a data structure that follows the LIFO principle so whenever there is
mul ple ac va on record created it will be pushed or popped in the stack as ac va ons begin and
ends. Local variables are bound to new storage each me whenever the ac va on record begins
because the storage is allocated at run me every me a procedure or func on call is made. When
the ac va on record gets popped out, the local variable values get erased because the storage
allocated for the ac va on record is removed. C and C++ both have support for Stack alloca on.
A top-down parser builds the parse tree from the top down, star ng with the start non-terminal.
There are two types of Top-Down Parsers:
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from the Le to
Right manner and the second L shows that in this parsing technique, we are going to use the Le
most Deriva on Tree. And finally, the 1 represents the number of look-ahead, which means how
many symbols are you going to see when you want to make a decision.
The grammar is free from le recursion. The grammar should not be ambiguous. The grammar has
to be le factored in so that the grammar is determinis c grammar. These condi ons are necessary
but not sufficient for proving a LL(1) parser.
Algorithm to construct LL(1) Parsing Table: Step 1: First check all the essen al condi ons men oned
above and go to step 2. Step 2: Calculate First() and Follow() for all non-terminals.
First(): If there is a variable, and from that variable, if we try to drive all the strings then the
beginning Terminal Symbol is called the First.
Follow(): What is the Terminal Symbol which follows a variable in the process of deriva on.
Find First(α) and for each terminal in First(α), make entry A –> α in the table.
If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each terminal in Follow(A),
make entry A –> ε in the table.
If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A –> ε in the table
for the $. To construct the parsing table, we have two func ons:
Advantages of Construc on of LL(1) Parsing Table: 1.Determinis c Parsing: LL(1) parsing tables give a
determinis c parsing process, truly intending that for a given informa on program and language
structure, there is a novel not en rely set in stone by the ongoing non-terminal image and the
lookahead token.
2.Efficiency: LL(1) parsing tables take into considera on produc ve parsing of programming dialects.
When the parsing table is built, the parsing calcula on can decide the following parsing ac vity by
straigh orwardly ordering the table, bringing about a steady me query.
3.Predic ve Parsing: LL(1) parsing tables work with prescient parsing, where the parsing ac vity is
resolved exclusively by the ongoing non-terminal image and the lookahead token without the
requirement for backtracking or specula ng.
4.Error Discovery: The development of a LL(1) parsing table empowers the parser to proficiently
dis nguish mistakes. By dissec ng the passages in the parsing table, the parser can recognize
clashes, like various sec ons for a similar non-terminal and lookahead blend.
Non-Le Recursion: LL(1) parsing tables require the disposal of le recursion in the language
structure. While le recursion is a typical issue in syntaxes, the most common way of killing it brings
about a more organized and unambiguous language structure
Type checking is the process of verifying and enforcing constraints of types in values. A compiler
must check that the source program should follow the syntac c and seman c conven ons of the
source language and it should also check the type rules of the language. It allows the programmer to
limit what types may be used in certain circumstances and assigns types to values. The type-checker
determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a viola on, and incorrect types
are corrected. Whatever the compiler we use, while it is compiling the program, it has to follow the
type rules of the language. Every language has its own set of type rules for the language. We know
that the informa on about data types is maintained and computed by the compiler.
There are two kinds of type checking: Sta c Type Checking. Dynamic Type Checking.
Sta c Type Checking: - Sta c type checking is defined as type checking performed at compile me. It
checks the type variables at compile- me, which means the type of the variable is known at the
compile me. It generally examines the program text during the transla on of the program. Using
the type rules of a system, a compiler can infer from the source text that a func on (fun) will be
applied to an operand (a) of the right type each me the expression fun(a) is evaluated.
Run me Error Protec on. ,It catches syntac c errors like spurious words or extra
punctua on.,It catches wrong names like Math and Predefined Naming., Detects incorrect
argument types
Dynamic Type Checking: Dynamic Type Checking is defined as the type checking being done at run
me. In Dynamic Type Checking, types are associated with values, not variables. Implementa ons of
dynamically type-checked languages run me objects are generally associated with each other
through a type tag, which is a reference to a type containing its type informa on. Dynamic typing is
more flexible. A sta c type system always restricts what can be conveniently expressed. Dynamic
typing results in more compact programs since it is more flexible and does not require types to be
spelled out. Programming with a sta c type system o en requires more design and implementa on
effort.
The Expressions of languages.,The rules for assigning types to constructs (seman c rules).
Stack Alloca on: The alloca on happens on con guous blocks of memory. We call it a stack memory
alloca on because the alloca on happens in the func on call stack. The size of memory to be
allocated is known to the compiler and whenever a func on is called, its variables get memory
allocated on the stack. And whenever the func on call is over, the memory for the variables is de-
allocated. This all happens using some predefined rou nes in the compiler. A programmer does not
have to worry about memory alloca on and de-alloca on of stack variables. This kind of memory
alloca on is also known as Temporary memory alloca on because as soon as the method finishes its
execu on all the data belonging to that method flushes out from the stack automa cally. This means
any value stored in the stack memory scheme is accessible as long as the method hasn’t completed
its execu on and is currently in a running state.
Heap Alloca on: The memory is allocated during the execu on of instruc ons wri en by
programmers. Note that the name heap has nothing to do with the heap data structure. It is called a
heap because it is a pile of memory space available to programmers to allocate and de-allocate.
Every me when we made an object it always creates in Heap-space and the referencing informa on
to these objects is always stored in Stack-memory. Heap memory alloca on isn’t as safe as Stack
memory alloca on because the data stored in this space is accessible or visible to all threads. If a
programmer does not handle this memory well, a memory leak can happen in the program.
1. Sta c alloca on allocates memory on the basis of the size of data objects. 1.Heap
alloca on makes use of heap for managing the alloca on of memory at run me.
2. In sta c alloca on, there is no possibility of the crea on of dynamic data structures and
objects. 2.In heap alloca on, dynamic data structures and objects are created.
3. In sta c alloca on, the names of the data objects are fixed with storage for addressing.
3.Heap alloca on allocates a con guous block of memory to data objects.
4. Sta c alloca on is a simple, but not efficient memory management technique. 4.Heap
alloca on does memory management in very efficient manner.
5. The sta c alloca on strategy is faster in accessing data as compared to heap alloca on.
5.While heap alloca on is slow in accessing as there is a chance of crea on of holes in
reusing the free space.
10 Typically used for global/sta c variables 10.Used for dynamically allocated data
Input Buffering
The lexical analyzer scans the input from le to right one character at a me. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Input buffering is an important concept in compiler design that refers to the way in which
the compiler reads input from the source code. In many cases, the compiler reads input one
character at a me, which can be a slow and inefficient process. Input buffering is a
technique that allows the compiler to read input in larger chunks, which can improve
performance and reduce overhead.
The basic idea behind input buffering is to read a block of input from the source code into a
buffer, and then process that buffer before reading the next block. The size of the buffer can
vary depending on the specific needs of the compiler and the characteris cs of the source
code being compiled. For example, a compiler for a high-level programming language may
use a larger buffer than a compiler for a low-level language, since high-level languages tend
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. there are two methods used in this context: One Buffer Scheme, and
Two Buffer Scheme. These are explained as following below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwri ng the first of
lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the only problem
with this method is that if length of the lexeme is longer than length of the buffer then
scanning input cannot be scanned completely. Ini ally both the bp and fp are poin ng to the
first character of first buffer. the boundary of first buffer end of buffer character should be
placed at the end first buffer when fp encounters first eof, then one can recognize end of
first buffer and hence filling up second buffer is started. in the same way when second eof is
obtained then it indicates of second buffer.
Tokeniza on is the process of breaking down a stream of text into smaller units called tokens. These
tokens could be words, phrases, symbols, or other meaningful elements. Tokeniza on is a
fundamental step in natural language processing (NLP) and text mining tasks.
In the context of NLP, tokeniza on typically involves spli ng text into words or subwords. This
allows machines to be er understand and analyze the text. For example, consider the sentence
"Tokeniza on helps in NLP tasks." A er tokeniza on, it might be represented as ["Tokeniza on",
"helps", "in", "NLP", "tasks"].
Tokeniza on can be straigh orward for languages like English where words are separated by spaces,
but it can be more complex for languages like Chinese or Japanese where words are not separated
by spaces. Tokeniza on can also involve handling punctua on, special characters, and other
linguis c nuances.
Different tokeniza on techniques exist depending on the specific requirements of the task at hand.
For instance, word-level tokeniza on, subword-level tokeniza on (like Byte Pair Encoding or BPE),
and character-level tokeniza on are some common approaches. Each has its own advantages and
use cases in different NLP applica ons.
DFA (Determinis c Finite Automaton): 1. DFA is a finite state machine that accepts or rejects strings
of symbols and produces a unique computa on or path for each input string. 2. In a DFA, for each
state and input symbol, there is exactly one transi on to a next state. This determinis c nature
makes DFAs easier to understand and implement. 3.DFAs are suitable for recognizing regular
languages, which are a subset of formal languages with simple pa erns that can be defined using
regular expressions.
NFA (Nondeterminis c Finite Automaton): 1.NFA is similar to DFA but allows mul ple transi ons
from a state on the same input symbol and can have ε-transi ons (transi ons without consuming
input). 2.Unlike DFA, NFA does not have a unique computa on for each input string; it can have
mul ple possible paths or computa ons. 3.NFA is o en more concise and easier to design for certain
problems due to its nondeterminis c nature, which allows greater flexibility in represen ng
pa erns.
DFA NFA
DFA cannot use Empty String transition. NFA can use Empty String transition.
DFA rejects the string in case it terminates in a NFA rejects the string in the event of all
state that is different from the accepting state. branches dying or refusing the string.
Time needed for executing an input string is Time needed for executing an input string
less. is more.
DFA requires more space. NFA requires less space then DFA.