Automata & Compiler Design Course
Automata & Compiler Design Course
CLO 1. Introduce the fundamental concepts of Automata Theory, Formal Languages and compiler
design
CLO 2. Principles Demonstrate Application of Automata Theory and Formal Languages in the field of
compiler design
CLO 3. Develop understanding of computation through Push Down Automata and Turing Machines
CLO 4. Introduce activities carried out in different phases of Phases compiler
CLO 5. Identify the undecidability problems.
These are sample Strategies, which teachers can use to accelerate the attainment of the various course
outcomes.
1. Lecturer method (L) needs not to be only a traditional lecture method, but alternative effective
teaching methods could be adopted to attain the outcomes.
2. Use of Video/Animation to explain functioning of various concepts.
3. Encourage collaborative (Group Learning) Learning in the class.
4. Ask at least three HOT (Higher order Thinking) questions in the class, which promotes critical
thinking.
5. Adopt Problem Based Learning (PBL), which fosters students’ Analytical skills, develop design
thinking skills such as the ability to design, evaluate, generalize, and analyze information
rather than simply recall it.
6. Introduce Topics in manifold representations.
7. Show the different ways to solve the same problem with different approaches and encourage
the students to come up with their own creative ways to solve them.
8. Discuss how every concept can be applied to the real world - and when that's possible, it helps
improve the students' understanding.
Module-1
Introduction to Automata Theory: Central Concepts of Automata theory, Deterministic Finite
Automata(DFA), Non- Deterministic Finite Automata(NFA) ,Epsilon- NFA, NFA to DFA Conversion,
Minimization of DFA
Regular Expressions and Languages: Regular Expressions, Finite Automata and Regular
Expressions, Proving Languages Not to Be Regular
Lexical Analysis Phase of compiler Design: Role of Lexical Analyzer, Input Buffering , Specification of
Token, Recognition of Token.
Syntax Analysis Phase of Compilers: Part-2: Bottom-up Parsing, Introduction to LR Parsing: SLR,
More Powerful LR parsers
Undecidability : A language That Is Not Recursively Enumerable, An Undecidable Problem That Is RE.
Course Outcomes
At the end of the course the student will be able to:
CO 1. Acquire fundamental understanding of the core concepts in automata theory and Theory of
Computation
CO 2. Design and develop lexical analyzers, parsers and code generators
CO 3. Design Grammars and Automata (recognizers) for different language classes and become
knowledgeable about restricted models of Computation (Regular, Context Free) and their
relative powers.
CO 4. Acquire fundamental understanding of the structure of a Compiler and Apply concepts
automata theory and Theory of Computation to design Compilers
CO 5. Design computations models for problems in Automata theory and adaptation of such model
in the field of compilers
The sum of three tests, two assignments, and quiz/seminar/group discussion will be out of 100
marks and will be scaled down to 50 marks
(to have a less stressed CIE, the portion of the syllabus should not be common /repeated for any
of the methods of the CIE. Each method of CIE should have a different syllabus portion of the
course).
CIE methods /question paper has to be designed to attain the different levels of Bloom’s
taxonomy as per the outcome defined for the course.
Semester End Examination:
Theory SEE will be conducted by University as per the scheduled timetable, with common
question papers for the subject (duration 03 hours)
1. The question paper will have ten questions. Each question is set for 20 marks and
Marks scored shall be proportionally reduced to 50 marks
2. There will be 2 questions from each module. Each of the two questions under a module
(with a maximum of 3 sub-questions), should have a mix of topics under that module.
3. The students have to answer 5 full questions, selecting one full question from each
module
Module 1
Textbooks
1. John E Hopcroft, Rajeev Motwani, Jeffrey D. Ullman,“ Introduction to Automata
Theory, Languages and Computation”, Third Edition, Pearson.
2. Alfred V.Aho, Monica S.Lam,Ravi Sethi, Jeffrey D. Ullman, “ Compilers Principles,
Techniques and Tools”, Second Edition,Perason.
Reference:
1. Elain Rich, “Automata,Computability and complexity”, 1st Edition, Pearson
Education,2018.
2. K.L.P Mishra, N Chandrashekaran , 3rd Edition , ‘Theory of Computer Science”,PHI,2012.
3. Peter Linz, “An introduction to Formal Languages and Automata “, 3rd Edition,
Narosa Publishers,1998.
4. K Muneeswaran, ”Compiler Design”, Oxford University Press 2013.
Weblinks and Video Lectures (e-Resources):
1. https://nptel.ac.in/courses/106/106/106106049/#
2. https://nptel.ac.in/courses/106/104/106104123/
3. https://www.jflap.org/
Activity Based Learning (Suggested Activities in Class)/ Practical Based learning
Module 1
The term "Automata" is derived from the Greek word "αὐτόματα" which means "self-acting".
An automaton (Automata in plural) is an abstract self-propelled computing device which
follows a predetermined sequence of operations automatically.
The automata theory is a developed methods to describe and analyse the dynamic
behaviour of discrete systems.
This automaton consists of states and transitions. The State is represented by circles,
and the Transitions is represented by arrows.
Automata is the kind of machine which takes some string as input and this input
goes through a finite number of states and may enter in the final state.
There are the basic terminologies that are important and frequently used in
automata:
Alphabet
String
Length of a String
o If S = ‘cabcad’, |S|= 6
o If |S|= 0, it is called an empty string (Denoted by λ or ε)
Kleene Star
• Definition − The set ∑+ is the infinite set of all possible strings of all possible lengths
over ∑ excluding λ.
• Representation − ∑+ = ∑1 ∪ ∑2 ∪ ∑3 ∪…….
∑+ = ∑* − { ϵ }
• Example − If ∑ = { a, b } , ∑+ = { a, b, aa, ab, ba, bb,………..}
Language
An automaton with a finite number of states is called a Finite Automaton (FA) or Finite
State Machine (FSM).
In DFA, for each input symbol, one can determine the state to which the machine will move.
Hence, it is called Deterministic Automaton. As it has a finite number of states, the machine
is called Deterministic Finite Machine or Deterministic Finite Automaton.
o At the time of transition, the automata can either move to the next state or
stay in the same state.
o Finite automata have two states, Accept state or Reject state. When the input
string is processed successfully, and the automata reached its final state, then
it will accept.
Transition Diagram:
A transition diagram or state transition diagram is a directed graph which can be constructed as
follows:
➢ There is a node for each state in Q, which is represented by the circle.
➢ There is a directed edge from node q to node p labeled a if δ(q, a) = p.
➢ In the start state, there is an arrow with no source.
➢ Accepting states or final states are indicating by a double circle.
q a p
Module 1
Transition Table
The transition table is basically a tabular representation of the transition function. It
takes two arguments (a state and a symbol) and returns a state (the "next state").
Example
• Q = {a, b, c},
• ∑ = {0, 1},
• q0 = {a},
• F = {c}, and
Present State Next State for Input 0 Next State for Input 1
a a b
b c a
c b c
c
Module 1
Example 1:
Solution:
→q0 q1 q2
q1 q0 q2
*q2 q2 q2
Explanation:
o In the above table, the first column indicates all the current states. Under
column 0 and 1, the next states are shown.
o The first row of the transition table can be read as, when the current state is
q0, on input 0 the next state will be q1 and on input 1 the next state will be
q2.
o In the second row, when the current state is q1, on input 0, the next state will
be q0, and on 1 input the next state will be q2.
o In the third row, when the current state is q2 on input 0, the next state will be
q2, and on 1 input the next state will be q2.
o The arrow marked to q0 indicates that it is a start state and circle marked to
q2 indicates that it is a final state.
Example 2:
Module 1
Solution:
→q0 q0 q1
q1 q1, q2 q2
q2 q1 q3
*q3 q2 q2
Explanation:
o The first row of the transition table can be read as, when the current state is
q0, on input 0 the next state will be q0 and on input 1 the next state will be
q1.
o In the second row, when the current state is q1, on input 0 the next state will
be either q1 or q2, and on 1 input the next state will be q2.
o In the third row, when the current state is q2 on input 0, the next state will be
q1, and on 1 input the next state will be q3.
o In the fourth row, when the current state is q3 on input 0, the next state will be
q2, and on 1 input the next state will be q2.
Note:The term deterministic refers to the fact that on each input there is one
and only one state to which the automaton transition can happen from its
current state.
Module 1
start q0 a q1 accept
ᵟ(q0,a)=q1
step 4: Identify other transitions
ᵟ(q1,a)=?
step5: Construct DFA using transitions in step 3 & step 4.
q0 a q1 a
a
q0 q1
q1
*q1
Note: atleast : min
atmost : max
therefore The language to be accepted by DFA can be written as
L={a,aa,aaa,aaaa,….}
Or
L={an,n≥1}
Or
L={w:na≥1,wϵ{a}*}
Module 1
a
q0 q1 accept
a b
q0 q1 q0
q1 q1
*q1
L={w:na(w)≥1,wϵ{a,b}*}
a
q0 q1 accept
iv)remaining transitions
δ(q1,b)=q1 abbbb
δ(q0,b)=q0 a
b b a,b bbbba
start a a baba
q0 q1 q2
trap state
accept
Module 1
Note: In the DFA, there is no transition from state q for the input symbol a.
But we can include a transition from state q1 for i/p ‘a’ to a non final reject state called
trap state.Therefore q2 is called trap state .
A b
q0 q1 q0
q2 q1
*q1
q2 q2 q2
4.Draw a DFA to accept strings of 0’s &1’s having 3 consecutive 0’s
Sol:
i)Minimum string 3 consecutive 0’s(000)
ii)Alphabet ∑={0,1}
iii)
1 0,1
q0 0 q1 0 q2 0 q3
1
Further Conversion u study from class notes.
Introduction to Compiling:
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands
for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern
flow-of- control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by
certain amounts to build-in macro
COMPILER
Compiler is a translator program that translates a program written in (HLL) the source program
and translate it into an equivalent program in (MLL) the target program. As an important part of
a compiler is error showing to the programmer.
ShreedeviPramod,Dept of CSE,BrCE
Executing a program written n HLL programming language is basically of two parts. the
source program must first be compiled translated into a object program. Then the results object
program is loaded into a memory executed.
ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use
a mnemonic (symbols) for each machine instruction, which they would subsequently translate
into machine language. Such a mnemonic machine language is now called an assembly
language. Programs known as assembler were written to automate the translation of assembly
language in to machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it were machine language.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds. Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation. The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
Once the assembler procedures an object program, that program must be placed into memory and
ShreedeviPramod,Dept of CSE,BrCE
executed. The assembler could place the object program directly in memory and transfer control
to it,
ShreedeviPramod,Dept of CSE,BrCE
thereby causing the machine language program to be execute. This would waste core by leaving
the assembler in memory while the user’s program was being executed. Also the programmer
would have to retranslate his program with each execution, thus wasting translation time. To
over come this problems of wasted translation time and memory. System programmers
developed another component called loader
“A loader is a program that places programs into memory and prepares them for execution.” It
would be more efficient if subroutines could be translated into object form the loader
could”relocate” directly behind the user’s program. The task of adjusting programs or they may
be placed in arbitrary core locations is called relocation. Relocation loaders perform four
functions.
1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs
another very important role, the error-detection. Any violation of d HLL specification would be
detected and reported to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax
analysis is aided by using techniques based on formal grammar of the programming language.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output runs faster
and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the length
of machine language program are carried out during this phase. The output of the code
generator is the machine language program of the specified computer.
Code Generator produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each computation is to
be done. Many computers have only a few high speed registers in which computations can be
ShreedeviPramod,Dept of CSE,BrCE
performed quickly. A good code generator would attempt to utilize registers as efficiently as
possible.
Table Management (or) Book-keeping:-
A compiler needs to collect information about all the data objects that appear in the source
program. The information about data objects is collected by the early phases of the compiler-
lexical and syntactic analyzers. The data structure used to record this information is called as
Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:
ShreedeviPramod,Dept of CSE,BrCE
Fig 1.6: Compilation Process of a source code through phases
ShreedeviPramod,Dept of CSE,BrCE
Lexical Analyzer
THE ROLL OF THE LEXICAL ANALYZER
Symbol
Table
The lexical analysis is the first phase of the compiler where a lexical analyser operate as an
interface between the source code and the rest of the phases of a compiler. It reads the input
characters of the source program, groups them into lexemes, and produces a sequence of
tokens for each lexeme. The tokens are sent to the parser for syntax analysis.
If the lexical analyzer is located as a separate pass in the compiler it can need an intermediate
file to locate its output, from which the parser would then takes its input. It can eliminate the
need for the intermediate file, the lexical analyzer and the syntactic analyser (parser) are often
grouped into the same pass where the lexical analyser operates either under the control of the
parser or as a subroutine with the parser.
The lexical analyzer also interacts with the symbol table while passing tokens to the parser.
Whenever a token is discovered, the lexical analyzer returns a representation for that token to
the parser. If the token is a simple construct including parenthesis, comma, or a colon, then it
returns an integer program. If the token is a more complex items including an identifier or
another token with a value, the value is also passed to the parser.
Lexical analyzer separates the characters of the source language into groups that logically
belong together, called tokens. It includes the token name which is an abstract symbol that
define a type of lexical unit and an optional attribute value called token values. Tokens can be
identifiers, keywords, constants, operators, and punctuation symbols including commas and
ShreedeviPramod,Dept of CSE,BrCE
parenthesis. A rule that represent a group of input strings for which the equal token is make
as output is called the pattern.
The lexical analyzer also handles issues including stripping out the comments and whitespace
(tab, newline, blank, and other characters that are used to separate tokens in the input). The
correlating error messages that are generated by the compiler during lexical analyzer with the
source program.
For example, it can maintain track of all newline characters so that it can relate an ambiguous
statement line number with each error message. It can be implementing the expansion of
macros, in the case of macro, pre-processors are used in the source program.
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-
consuming and costly. So, the input strings are stored into a buffer and then scanned by
Lexical Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify
tokens. It uses two pointers to scan tokens −
• Both pointers start at the beginning of the string, which is stored in the buffer.
ShreedeviPramod,Dept of CSE,BrCE
• The character ("blank space") beyond the token ("int") have to be examined before the
token ("int") will be determined.
• After processing token ("int") both pointers will set to the next token ('a'), & this
process will be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in
First Half, the second half is filled with new characters to be read. If the look Ahead pointer
moves towards the right end of the buffer of the second half, the first half will be filled with
new characters, and it goes on.
ShreedeviPramod,Dept of CSE,BrCE
Sentinels − Sentinels are used to making a check, each time when the forward pointer is
converted, a check is completed to provide that one half of the buffer has not converted off. If
it is completed, then the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead,
which is needed to process an input character in transferring characters. It includes two
buffers, each includes N-character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme Begin
points to the starting of the current lexeme which is discovered. Forward scans ahead before a
match for a pattern are discovered. Before a lexeme is initiated, lexeme begin is set to the
character directly after the lexeme which is only constructed, and forward is set to the
character at its right end.
Preliminary Scanning − Certain processes are best performed as characters are moved from
the source file to the buffer. For example, it can delete comments. Languages
like FORTRAN which ignores blank can delete them from the character stream. It can also
collapse strings of several blanks into one blank. Pre-processing the character stream being
subjected to lexical analysis saves the trouble of moving the look ahead pointer back and
forth over a string of blanks.
Token:
ShreedeviPramod,Dept of CSE,BrCE
A token is a pain consisting of a token name and an optional attribute value. The token name
is an abstract symbol representing a kind of lexical unit. A particular denoting an identifier.
The token names are the input symbol that are the parser process.
Typically tokens are ,
1) Identifiers
2) Keywords
3) Operator
4) special symbol 5)constant
Pattern
A pattern is a description of the form that the lexeme’s of a token may take. In the case
of keyword as a token , the pattern is just the sequence of characters that form
the keyword. For identifiers and some other tokens. The pattern is more complex
structure.That is matched by many string.
Lexeme
A lexeme is a sequence of character in the source program that matches pattern for a token
and identified by the lexical analyzer as an instance of that token.
ShreedeviPramod,Dept of CSE,BrCE