0% found this document useful (0 votes)
46 views37 pages

Compiler Construction: Lexical Analysis

The document discusses lexical analysis in compilers. It describes how a scanner groups character streams into tokens by assembling lexemes. It provides an example of tokens and lexemes from a sample program. It also discusses the role of the scanner in the compiler front end and interacting with the parser and symbol table.

Uploaded by

MUHAMMAD YAQOOB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views37 pages

Compiler Construction: Lexical Analysis

The document discusses lexical analysis in compilers. It describes how a scanner groups character streams into tokens by assembling lexemes. It provides an example of tokens and lexemes from a sample program. It also discusses the role of the scanner in the compiler front end and interacting with the parser and symbol table.

Uploaded by

MUHAMMAD YAQOOB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Compiler Construction

Lecture 2 - Lexical Analysis

© 2003 Robert M. Siegfried


All rights reserved

Lexical Analysis
• Lexical analysis (or scanning) is the process by
which the stream of characters is grouped into
strings representing the words of a language
(called lexemes) which correspond to specific
grammatical elements of that language (called
tokens)
• Tokens are the fundamental building blocks of a
program’s grammatical structure, representing
such basic elements as identifiers, numericliterals,
and specific keywords and operators of the
language.
Lexical Analysis (continued)

• Lexemes are the character strings assembled


from the character stream of a program, and
the token represents what component of the
program’s grammar they constitute.

The scanner’s role in the


compiler’s front end

Token
requested Semantic
Actions Semantic
Scanner Parser Analyzer
Token
returned

Symbol table
The scanner’s role in the compiler’s
front end (continued)
• The parser is the driving force for much of the compiler’s
front end.
• The parser requests a token from the scanner, which
returns the token corresponding to the next lexeme.
• The parser requests a particular semantic action, which
depends on what component of the grammar is being
parsed.
• All parts of the front end add information into the symbol
table; the scanner adds lexemes to the symbol table (when
necessary) and the symbol table returns to the scanner the
token corresponding to the lexeme.

Lexical analysis - an example


Consider the statement:
for (i = 0; i < amount; i++) sum += x[i];

The character stream (with their ASCII values) is:


f o r ( i = 0 ; i < a m o u n t ...
66 6F 52 20 28 69 3D 30 3B 20 69 20 3C 20 61 6D 6F 75 6E 74 ...

The scanner assembles the following lexemes:


for ( i = 0 ; i < …
and finds in the symbol table the corresponding tokens:
for openparen identifier Assign NumLiteral
semicolon identifier lessthan … …
Tokens and Lexemes
• In many instances, there is a one-to-one
correspondence between the lexemes and tokens
for reserved words and operators.
• User-defined identifiers are usually assigned the
token of Identifier.
• Numbers are usually assigned the token of
NumericLiteral or something more specific like
IntegerLiteral.
• Characters and strings are assigned the token of
CharacterLiteral.

A Brief Introduction to Formal Language Theory


String - A sequence of symbols
e. g., ABabbCaC
typedef int * intptr ;

Symbols
Alphabet - A finite set of symbols.
e. g., A, B, C
1, 2, 3
ARRAY, SET, ;, OF, +
A Brief Introduction to Formal Language Theory
(continued)
Language - Any set of string over an alphabet.
Graphs - A finite set of vertices and arcs.
A B

C D

A Brief Introduction to Formal Language Theory


(continued)

Trees - a directed graph without circuits.

B C

D E F G
A Brief Introduction to Formal Language Theory
(continued)
Terminals - any symbol in a given language’s alphabet.
In formal language theory, they are represented by lower-
case letters (i. e., a b)
e. g., int, while, class are terminals in C++.

Nonterminals - any set of combinations of terminals. A


combination of terminals can be derived from a
nonterminal, according to the productions (rules) of the
grammar of the language. Usually represented by capital
letters (i. e., A B)

A Brief Introduction to Formal Language Theory


(continued)

Examples of nonterminals and the productions in which


they appear:
Expression ::= Term (±Term) *
Term ::= Factor ({*|/} Factor) *
Factor ::= Identifier
Factor ::= Constant

Variables - Any terminal or nonterminal usually


represent by a Greek letter. ß
Chomsky Hierarchy

Type Language Automaton


0 Recursively enumerable Turing Machine
(or unrestricted)
1 Context-sensitive Linear-bounded
Turing Machine
2 Context-free Pushdown Automaton
3 Regular Finite Automaton

Type N automata are computer implementations designed to


process type N language.

Languages & Grammars

Type 0 - Recursive enumerable α ::= ß,


where α and ß can be any string or variable.
Type 1 - Context-sensitive α ::= ß,
where |α | < |ß| and at least one character in α is a nonterminal
Type 2 - Context-free A ::= ß,
where there is one and only nonterminal and NO terminals on the left.
Type 3 - Regular A ::= a or A ::= aB

Most real programming languages are almost context-free


(with a few context-sensitive traits)
Automata

Turing Machine - Given an input stream, it performs as many


computations as necessary, finally deciding whether to accept (if the
string is within the language).

a b a c d e f c a b b a d c a

Head

Linearly-bounded Automaton - works like a Turing Machine, but


limits space to the length of the input string.

Automata (continued)

Pushdown automaton - Uses a stack, and it can read from


only the stack.

Finite Automaton - The machine cannot write anything.


After reading the string, it either accepts or rejects the
string.

In theory, computers can be as powerful as a Turing


Machine.
Deterministic Finite Automata

Definition – A finite deterministic automaton is a 5-tuple


(Q, Σ, δ, q0 , F) where:
Q is a finite set of states
Σ is a finite alphabet
q0 ∈Q is the initial state
F ⊆ Q is a set of final states
δ is a transition function

Deterministic Finite Automata (continued)


1
q0 0 q1 Σ = { 1, 0}

0 1 0 Q = { q0 , q1, q2, q3 }
1

0
F = { q0 }
q2 q3
1

δ(q0 , 0) = q2; δ(q0 , 1) = q1; δ(q1 , 0) = q0; δ(q1 , 1) = q3


d(q2 , 0) = q3; δ(q2 , 1) = q0; δ(q3 , 0) = q1; δ(q3 , 1) = q2
δ maps Q x Σ into Q
Deterministic Finite Automata (continued)
b

a
c
q0 q1 q2
a
b
a
c b
q3 c

a, b, c δ(q 0, a) = q 1; δ(q0 , b) = q 3; δ(q0 , c) = q 3


δ(q 1, a) = q 0; δ(q1 , b) = q 1; δ(q1 , c) = q2
δ(q 2, a) = q 3; δ(q2 , b) = q 3; δ(q2 , c) = q 3
δ(q 3, a) = q 3; δ(q3 , b) = q 3; δ(q3 , c) = q 3

What language will this DFA accept?


A-Z,0-9
blank
A-Z
blank A-Z,0-9, “ “
A-Z q2
q1 0-9
q0
A-Z
q5
0-9 A-Z
0-9
A-Z,0-9

q3
blank q4

Σ = {A-Z, 0-9, blank}


Regular Expressions
Regular expressions have an indefinite repetition of
symbols in its accepted language.
(01)* = {01, 0101, 010101, 01010101, 0101010101, ...}
1*001* = {0000, 100, 10001, 1000, 1001, 100001,
10000, ...}
(0+10)* = {0, 10, 100, 1010, 010, 0100...}
(1+ε)(0+10)* = {1, 010, 100, 1100, 010, 1010, ...}
(0+1)*101 = {101, 00101, 0101, 10101, 00101, 1011101,
000101, ...}
0+1+2+ = 00*11*22* = {001112, 012, 0112, 00111222, ...}
b*(aa)*c* = {?}

Deterministic vs. Nondeterministic Automata

A deterministic finite automaton makes one and


only move in a given state-symbol combination.
A Nondeterministic finite automaton can make 0
or more moves for such a combination.

0 1
M = (Q, Σ, δ, q0, F) as before
0
q0 q3 q4 but δ(Q, Σ) may be a set or
1
1 undefined
q1
0
0 Is this more powerful than a DFA?
q2
1
Nondeterministic Finite Automata
A nondeterministic finite automaton will have an equivalent
deterministic automaton.
The NFA M is defined as
0 1 m = (Q, Σ, q0, δ, F) where
Q = { q0 , q1, q2}
q0 1
q1 Σ = { 0, 1 }
0
F = { q1 }
0

q2 δ q0 q1 q2
0 {q0, q2} q2 φ
1 q1 q1 φ

Nondeterministic Finite Automata (continued)


1

1
q0 q1
1
0 0

q3 q2
0
0 1
0
q4 1

The equivalent Deterministic Finite Automaton


The equivalence of DFAs and NFAs

A DFA and an NFA are equivalent if their 5-tuple are equivalent.


They are also equivalent if they accept the same language.
E. g.,
0,1 0,1
1 0 q4
q0 q3
1

q1 What language will this NFA accept?

q2
0,1

The equivalence of DFAs and NFAs (continued)


0

q0 q3
1 0
1
0
q1 q2
1
0

What language will this DFA accept?


NFA with epsilon-moves

NFAs can contain ε-moves, where ε is the empty


string. It takes us to another state without having to
read another character.

0 1 2

ε ε
q0 q1 q2

NFAs with epsilon moves (continued)

To convert this to the equivalent NFA without ε-moves, we need to find ε-


closure(q), the set of all states p which can be reached from q by ε-moves.

0 1 2

0,1 1,2
q0 q1 q2

0,1,2
NFAs with epsilon moves (continued)
To convert this to the equivalent NFA without ε-moves,
we need to find ε-closure(q), the set of all states which
can be reached from q by ε-moves.

0 1 2

0, 1 1, 2
q0 q1 q2

0, 1, 2

Why are DFAs, NFAs and NFAs with epsilon moves


important?

• We can automate the construction of NFAs


with epsilon moves for regular expressions.
• From there, we can build NFAs with epsilon
moves, and in turn, a DFA.
• A DFA is easy to implement in a computer
program procedure.
A few basic NFAs

0 0
q0 q1 q0 q1
r=0 r=1

0 ε 0
q0 q1 q2 q3
r=0 r=1

A few basic NFAs (continued)

ε q1 0
q2 ε

q0 q5

ε 1
q3 q4 ε

r = 0+1
A few basic NFAs (continued)

ε 1 ε
q0 q1 q2 q5

r = 1*

Combining the basic NFAs


r = 0 + 0*1
0
q0 q0 ε
ε

q0 q5
ε ε
ε
q0 0 ε
q0
q0 ε 1 q0
ε
q0 q0
ε
Transition Diagrams

• Transition diagrams are a special form of


finite automaton, incorporating features that
belong in a compiler’s scanner:
– Actions associated with final states.
– Backup from a state, allowing for a lookahead
character being returned to the input stream.
– Transitions can be labeled as belonging to
“other”, indicating any class of character not
explicitly accounted for.

Transition Diagrams(continued)

In drawing transition diagrams, it is helpful to use


an alternate approach to describing regular
expressions:
a|b denotes a or b.
ab denotes a followed by b
(a|b)* denotes a followed by b zero or more times
(a|b)c denotes a or b followed by c
Transition Diagrams(continued)
The different lexical categories or classes can be
described in this fashion:
letter : (a | b | c | d | e .... A | B | C | D | E ..| X | Y | Z)
digit: ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )
other: ( ! | @ | # | $ | % | ^ | & | * | ( | ) | _ | + | = | - |
`|~|{|}| \ |”|’|:|;)
identifier : letter (letter | digit)*
integer : digit digit*
real: (digit digit* . digit digit*) |
(digit digit* . digit digit* (E | e) (+|- | ) digit digit*)

Transition Diagrams(continued)
The transition diagram for our language shown before
becomes:
letter

letter *
other
0 1 2 {Identifier}

digit
digit

letter *
3 4 {Number}
other

digit
Practical Issues in Lexical Analysis

There are several important practical issues


that arise in the design of a scanner:
• Lookahead
• Case sensitivity
• Skipping of lead blanks and comments
• Use of first characters

Lookahead characters
Since you cannot determine if you have read beyond
the end of a lexeme until you have done so, you must
be prepared to handle the “lookahead” character. There
are two approaches available:
Start with a lookahead character and fetch a new one
every time the lookahead character is “consumed” by
the lexeme.
•Use two functions to manipulate the input stream, one
to “get” the next character and one to “unget” the next
character, returning it temporarily to the input stream.
Lookahead characters (continued)

// gettc() - Fetches a character from a


// file. It uses get and adjusts
// the line number count when
// necessary.
char scanner::gettc(void)
{
char c;

// If we' re at the end of file, return a null


// byte, which serves to mark the end of file.
if (infile.eof())
c = '\0';

Lookahead characters (continued)

// If the next character is a newline,


// increment the line count
else if ((c = infile.get()) == '\n')
linenum++;
// Return the character converted to lower case
return(tolower(c));
}
Lookahead characters (continued)
// ungettc() - Returns a character to the
// file. Uses ungetc and will
// adjust line number count.
void scanner::ungettc(char c)
{
// If it's a newline, decrement the line
// count; we haven't gone to the next line
// yet.
if (c == '\n')
--linenum;
// Put it back into the input stream.
infile.putback(c);

Case sensitivity
• Although “a” and “A” are regarded as the
same character in the English language,
they are represented by different ASCII
codes. For a compiler to be case insensitive,
we need to consider these both as the same
letter.
• The easiest way to do this is to convert all
letters to the same case.
• Not all languages do this, e.g., C.
Skipping lead blanks and comments

• Before reading the first significant character


in a lexeme, it necessary to skip past both
lead blanks as well as comments.
• One must assume that the scanner can
encounter either or both repeatedly and
interchangeably before reading the first
significant character.

Skipping lead blanks and comments (continued)

// firstchar() - Skips past both white space


// and comments until it finds
// the first non-white space
// character outside a comment.
char scanner::firstchar(void)
{
char c;
bool goodchar = false;

// If we're at the end of the file,


// return the EOF marker so that we'll
// return the EOF token
if (infile.eof())
return(EndOfFile);
Skipping lead blanks and comments (continued)

// We're looking for a non-white space


// character that is outside a comment.
// Keep scanning until we find one or
// reach the end of the file.
while (!goodchar) {
// Skip the white space in the
// program
while (!infile.eof()
&& isspace(c = gettc()))
;

// Is it a comment or a real
// first character?
if (c != '{')
goodchar = true;

Skipping lead blanks and comments (continued)

else
// Skip the comment
while (!infile.eof()
&& (c = gettc()) != '}')
;
}

// If we're at the end of file, return


// the EOF marker. Otherwise, return
// the character.
if (infile.eof())
return(EndOfFile);
else
return(c);
}
Use of first character

• In most programming languages, the first


character of a lexeme indicates the nature of
the lexeme and token associated with it.
• In most instances, identifiers and reserved
words begin with a letter (followed by zero
or more letters and digits), numbers begin
with a digit and operators begin with other
characters.

Use of first character (continued)

// gettoken() - Scan out the token strings of


// the language and return the
// corresponding token class to the
// parser.
tokentype scanner::gettoken(int &tabindex)
{
char c;

// If this is the end of the file, send the


// token that indicates this

if ((c = lookahead) == EndOfFile)


return(tokeof);
Use of first character (continued)

// If it begins with a letter, it is a word.


// If begins with a digit, it is a number.
// Otherwise, it is an error.
lookahead = gettc();
if (isalpha(c))
return(scanword(c, tabindex));
else if (isdigit(c))
return(scannum(c, tabindex));
else
return(scanop(c, tabindex));

Scanning for reserved words and identifiers

• Once the scanner determines that the first


character is a letter, it continues to read
characters and concatenate them to the
lexeme until it encounters a character other
than a letter or digit.
• If the resultant lexeme is not in the symbol
table, it must be a new identifier.
Scanning for reserved words and identifiers (continued)

// scanword() - Scan until you encounter


// something other than a letter.
tokentype scanner::scanword(char c,
int &tabindex)
{
char lexeme[LexemeLen];
int i = 0;

// Build the string one character at a time.


// It keeps scanning until either the end of
// file or until it encounters a non-letter
lexeme[i++] = c;
while ((c = lookahead) != EndOfFile
&& (isalpha(c) || isdigit(c))) {
lexeme[i++] = c;
lookahead = gettc();
}

// Add a null byte to terminate the


// string and get the lookahead that
// begins the next lexeme.
lexeme[i] = '\0';
ungettc(lookahead);
lookahead = firstchar();

// If the lexeme is already in the symbol


// table, return its tokenclass. If it
// isn't, it must be an identifier whose
// type we do not know yet.
if (st.installname(lexeme, tabindex))
return(st.gettok_class(tabindex));
else {
st.setattrib(tabindex, stunknown,
tokidentifier);
return(tokidentifier);
}
}
Scanning for numeric literals
• After determining that the lexeme begins with a
digit, the scanner reads characters, concatenating
them to the lexeme until it encounters a non-digit.
• If it is a period, it will concatenate this to the
lexeme and resume reading characters until it
encounters another non-digit.
• If it is an “E”, it must then read the exponent.
• The token associated with the lexeme is either
number or the number’s type.

Scanning for numeric literals (continued)

// scannum() - Scan for a number.


tokentype scanner::scannum(char c,int &tabindex)
{
int ival, i = 0;
bool isitreal = false;
float rval;
char lexeme[LexemeLen];

// Scan until you encounter something that


// cannot be part of a number or the end of
// file
lexeme[i++] = c;
while ((c = lookahead) != EndOfFile
&& isdigit(c)) {
lexeme[i++] = c;
lookahead = gettc();
}
Scanning for numeric literals (continued)

// Is there a fractional part?


if (c == '.') {
isitreal = true;
lexeme[i++] = c;
while ((c = lookahead) != EndOfFile
&& isdigit(c)) {
lexeme[i++] = c;
lookahead = gettc();
}
}

// Add a null byte to terminate the


// string and get the lookahead that
// begins the next lexeme.
ungettc(lookahead);
lexeme[i] = '\0';
lookahead = firstchar();

Scanning for numeric literals (continued)

// If there is no fractional part, it is an


// integer literal constant. Otherwise, it
// is a real literal constant. Firstly, is
// it already in the symbol table?
if (st.installname(lexeme, tabindex))
return(st.gettok_class(tabindex));
// If not, is it real?
else if (isitreal) {
st.setattrib(tabindex, stunknown,
tokconstant);
st.installdatatype(tabindex,
stliteral, dtreal);
rval = atof(lexeme);
st.setvalue(tabindex, rval);
return(st.gettok_class(tabindex));
}
Scanning for numeric literals (continued)

// Must be an integer literal


else {
st.setattrib(tabindex, stunknown,
tokconstant);
st.installdatatype(tabindex,
stliteral, dtinteger);
ival = atoi(lexeme);
st.setvalue(tabindex, ival);
//ungettc(lookahead);
return(st.gettok_class(tabindex));
}
ungettc(lookahead);
return(st.gettok_class(tabindex));
}

Scanning for operators and characters literals


• If the first character is neither a letter nor a digit, the
lexeme must be one of the following:
– an operator
– a character literal
– a string literal
• In scanning an operator:
– we should be cognizant of how many characters it may
contain.
– we may wish to hand-code the token that will be
returned by the symbol table.
• In scanning a literal, we read characters until encountering
the appropriate closing quotation mark.
Special problems in lexical
analysis
There are a few other problems faced in
lexical analysis:
• Token overloading
• Backtracking
• Buffering
• When keywords are not reserved words

Token overloading
• On occasion, there are difficulties presented by a
lexeme serving more than one role in a
programming language.e.g, = is the test of equality
AND the assignment operator.
• This can be handled by using different lexemes
– E. g., C uses = = and =, Pascal uses = and :=,
FORTRAN uses .EQ. and =.
• If several lexemes are grouped into one token, it
may become necessary to separate one or more of
the lexemes out to become a distinctly different
token.
Backtracking
• In rare instances, it may become necessary
to backtrack and re-scan the text of the
program.
E.g., the DO statement in FORTRAN
DO 101 I = 1, 50
is initially read as
DO101 = 1
until the , is encountered.

Text buffering

• Reading file input is a time-consuming


process. This makes the buffering of input
text crucial to the efficiency of a compiler.
• In most instances, file input is buffered on
modern operating systems, rendering the
issue less important than a decade ago.
Text buffering (continued)
#define NUMBYTES 512
#define NUMBUFFERS 2
#define MAXSTACK 2
#define gettch()(top > 0 ? buffer[--top]: fetchchar())

int bytesread, fd, c, top, linenum = 1;


char buf[NUMBUFFERS][NUMBYTES],
buffer[MAXSTACK], inputstring[MAXLINE];

/*
* ungettch() - This function, together with the
* macro gettch(), allows the
* program to push and pop
* characters to and from the
* lookahead buffer.
*/
ungettch(char c)
{
if (top > MAXSTACK) {
printf("\nToo many characters \"ungotten\””
“\n");
exit(1);
}
buffer[top++] = c;
}
Text buffering (continued)
/*
* openfile() - This opens an inputfile as "read
* only” using the unbuffered I/O
* library for greater efficiency.
*/
openfile(char infilename[])
{
if ((fd = open(infilename, O_RDONLY)) < 0) {
printf("Cannot open %s\n",infilename);
exit(1);
}
}

Text buffering (continued)

/*
* closefile() - This closes the file. It is a
* separate function to allow for
* easier modification.
*/
closefile(void)
{
close(fd);
}
Text buffering (continued)
/*
* fetchchar() - This function uses two buffers
* of 512 bytes (one block in
* MS-DOS), and unbuffered I/O
* to fetch a single character at
* a time. When it reaches the
* end of the buffer, it gets
* another 512 bytes until end of
* file.
*/

Text buffering (continued)


int fetchchar(void)
{
static int nextchar = NUMBYTES,
thisbuf = NUMBUFFERS-1;
if (nextchar >= bytesread)
/* Buffer is full
Fill the next buffer */
if ((bytesread = read(fd,
buf[thisbuf
= (thisbuf == NUMBUFFERS-1)?0:thisbuf+1],
NUMBYTES-1)) <= 0)
/* Reached end of file. */
return(EOF);
Text buffering (continued)
else
/* Reset buffer pointer */
nextchar = 0;
return(buf[thisbuf][nextchar++]);
}

When keywords are not reserved words


• The keywords of a programming language are
usually reserved, i. e., they cannot be used by a
programmer as an identifier, a user-defined
variable, data type, etc.
• There are programming languages where this is not
the case, making programs difficult to understand
and making it difficult to return the proper token.
E. g.,
• IF THEN THEN THEN = ELSE;
ELSE ELSE = THEN;
Scanner generators

• Scanner generators automatically generate


a scanner given the lexical specifications
and software routines given by the user.
• Scanner generators take advantage of the
fact that a scanner is essentially an
implementation of a finite automaton and
can thus be created in an automated fashion.
• LEX is an example of such a software tool.

You might also like