• Language Processors: Overview of language processing
system: – preprocessors – compiler – assembler – Linkers &
loaders, difference between compiler and interpreter-
structure of a compiler:–phases of a compiler.
• Lexical Analysis: - Role of Lexical Analysis: Lexical analysis
Versus Parsing – Tokens, Patterns, and Lexemes – Attributes
for Tokens – Lexical errors - Input Buffering: Buffer Pairs –
Sentinels
• Specification of Tokens: Strings and Languages – Operations
on Languages – Regular Expressions – Regular Definitions
• Recognition of Tokens: Transition Diagrams – Recognition of
Reserved Words and Identifiers - Completion of the Running
Example – Architecture of a Transition–Diagram-Based Lexical
Analyzer
• The Lexical Analyzer Generator (LEX): Use of Lex – Structure
of Lex Programs
Code optimization
• Code optimization phase attempts to improve the
intermediate code, so that faster-running machine
code will result.
• Faster/shorter/Less power consumable target code.
• Compiler spent significant amount of time on this
phase.
• Optimized Three address code after Code
Optimization phase for the example statement is
Code Generation
• It takes intermediate representation of the source
program as input and maps it into the target
language.
• If the target language is machine code, registers or
memory locations are selected for each of the
variables used by the program.
• Intermediate instructions are translated into
sequences of machine instructions
• Crucial part is assignment of registers to hold variables.
• First operand of each instruction specifies destination.
• F-> floating point number
• #-> 60.0 consider as immediate constant
• MOVF id3, R2
• MULF #60.0, R2
• MOVF id2, R1
• ADDF R2, R1
• MOVF R1, id1
Error handler
• Each phase encounters errors.
• After detecting an error, a phase must somehow deal
with that error, so that compilation can proceed,
allowing further errors in the source program to be
detected.
• Lexical analysis phase can detect errors that do not form
any token of the language.
• Syntax analysis phase can detect the token stream that
violates the (structure (or) syntax rules of the language.
• Semantic analysis phase detects the constructs that
have no meaning to the operation involved.
Phase Pass
The process of compilation is One complete scan of the
carried in various steps. source language is called pass
Each step is called a phase It includes reading an input file
and writing to an output file
Different phases include: Many phases can be grouped as
LA,SA,SeA,ICG,CO,CG one pass
The task of compilation may be
carried out in single pass or
multiple passes
Role of Lexical Analysis
• The lexical analyzer is the first phase of a compiler.
• Its main task is to read the input characters and
produce as output a sequence of tokens that the
parser uses for syntax analysis.
• Another task of lexical analyzer is stripping out from
the source program comments and white space in
the form of blank and tab and newline characters.
• Correlating error messages from the compiler with
the source program.
• The lexical analyzer may keep track of the number of newline
characters seen, so that line number can be associated with
an error message.
• In some compilers, the lexical analyzer is in charge of making
a copy of the source program with the error messages
marked in it.
• If the lexical analyzer finds a token invalid, it generates an
error.
• The lexical analyzer works closely with the syntax analyzer.
• It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer
when it demands.
• The lexical analyzer collects information about tokens into
their associated attributes.
• After identifying the tokens, the strings are entered into
database called a symbol table.
• It works in two phases:
1. Scan
2. Separation of tokens
Lexical analysis Vs Parsing
• All compilers separate the task of analyzing
syntax into two different parts.
• Lexical and syntax
• Lexical-> small scale language constructs
– Names and literals
• Syntax-> large scale language constructs
– expressions, statements and program units
Why lexical analysis is separated from
syntax analysis?
1. Simplicity
– lexical analysis is simplified because it is less complex than syntax
analyser
– Syntax analyser can be smaller and cleaner by removing low level details
of lexical analysis
2. Efficiency
– lexical analysis should be optimized ( requires significant portion of total
compile time)
– Syntax analysis should not be optimized
3. Portability
– Lexical analysis may not be portable because input device-specific
peculiarities can be restricted to scanner
– Syntax analysis is always portable
Token
• Token is a sequence of characters that can be treated
as a single logical entity. Sequence of characters having
the collective meaning in the source program
• Typical tokens are identifiers, keywords, operators,
special symbols, constants.
• Pattern: Set of rules that describe tokens
• Lexeme: Sequence of characters in the source program
that are matched with a pattern of the token
• Ex: keyword if if condition
• relational op <,>,<= < or <= or >
Lexical errors
• It is hard for lexical analyzer to tell without aid of
other computers, that there is a source code error.
• Some errors are out of power of lexical analyzer to
recognize: – fi (a == f(x)) …
• Lexical analyzer can not tell whether fi is a
misspelling keyword if or an undeclared function
identifier. Since fi is valid lexeme.
• Such errors are recognized when no pattern for
tokens matches a character sequence.
• Other phase of the compiler probably parser
handle this type of error.
• If lexical analyser unable to proceed because of
none of the patterns for tokens matches any
prefix of the remaining input ,
• The simplest recovery strategy is panic mode
recovery
Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token
– Delete one character from the remaining input
– Insert a missing character into the remaining input
– Replace a character by another character
– Transpose two adjacent characters
Input Buffering:
Buffer Pairs – Sentinels
Buffer pairs
• Because of the amount of time taken to
process characters and number of characters
must be processed during the compilation of
large source program, specialized buffering
techniques have been introduced.
• We need to introduce a two buffer scheme to
handle large look-aheads safely
• Each buffer is of same size N
• N is usually size of disk block
• We can read N characters into a buffer
• If fewer than N characters remain in the input
file, then a special character represented by
eof marks the end of the source file.
• Two pointers to the input are maintained:
• 1. Pointer lexeme begin :marks the beginning
of the current lexeme
• 2. Pointer forward: scans until a pattern match
is found
• Once the next lexeme is determined, forward
is set to the character at its right end.
• Lexeme begin is set to the character
immediately after the lexeme just found.
Sentinels
• For each character read we make two tests:
– one for the end of the buffer
– One to determine what character is read
We can combine the buffer-end test with the
test for the current character if we extend each
buffer to hold a sentinel character at the end.
The sentinel is a special character that can not
be part of the source program -eof
• Eof is marked for the end of the entire input.
• Any eof that appears other than at the end of
a buffer means that the input is at an end.
Specification of Tokens
• In theory of compilation regular expressions
are used to formalize the specification of
tokens
• Regular expressions are means for specifying
regular languages
– Strings and Languages
– Operations on Languages
– Regular Expressions
– Regular Definitions
Strings and Languages
Recognition of Tokens
• Transition Diagrams
• Recognition of Reserved Words and Identifiers
• Completion of the Running Example
• Architecture of a Transition–Diagram-Based
Lexical Analyzer
• An input file, which we call lex.1, is written in the
Lex language and describes the lexical analyzer to
be generated.
• The Lex compiler transforms lex.1 to a C program,
in a file that is always named lex.yy.c.
• The latter file is compiled by the C compiler into a
file called a. out, as always.
• The C-compiler output is a working lexical analyzer
that can take a stream of input characters and
produce a stream of tokens.
• The attribute value, whether it be another numeric
code, a pointer to the symbol table, or nothing, is
placed in a global variable yylval, which is shared
between the lexical analyzer and parser, thereby
making it simple to return both the name and an
attribute value of a token.
• The declarations section includes declarations of
variables, manifest constants (identifiers declared to
stand for a constant, e.g., the name of a token), and
regular definitions.
• The translation rules each have the form
Pattern{ Action}
• Each pattern is a regular expression, which may use the
regular definitions of the declaration section.
• The actions are fragments of code, typically written in C,
although many variants of Lex using other languages
have been created.
• The third section holds whatever additional functions are
used in the actions.
• Alternatively, these functions can be compiled
separately and loaded with the lexical analyzer.
• When called by the parser, the lexical analyzer
begins reading its remaining input, one character
at a time, until it finds the longest prefix of the
input that matches one of the patterns P.
• It then executes the associated action A.
• Typically, A, will return to the parser, but if it does
not (e.g., because P describes whitespace or
comments), then the lexical analyzer proceeds to
find additional lexemes, until one of the
corresponding actions causes a return to the
parser.
• The lexical analyzer returns a single value, the
token name, to the parser, but uses the shared,
integer variable yylval to pass additional
information about the lexeme found, if needed.