2 Lexing
2 Lexing
Rupesh Nasre.
Machine-Independent
Machine-Independent
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
Intermediate representation
Backend
Token stream
Frontend
Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
Machine-Dependent
Machine-Dependent
Semantic
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
Intermediate
Intermediate
Code Symbol
CodeGenerator
Generator Table
2
Intermediate representation
Lexing Summary
Character stream
●
Basic lex Lexical
Machine-Indep.
Machine-Indep.
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream
Intermediate
representation
●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
3
Role
●
Read input characters
●
Group into words (lexemes)
●
Return sequence of tokens
●
Sometimes
– Eat-up whitespace
– Remove comments
– Maintain line number information
4
Token, Pattern, Lexeme
Token Pattern Sample lexeme
if Characters i, f if
comparison <= or >= or < or > or == or != <=, !=
identifier letter (letter + digit)* pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “, surrounded by “” “core dumped”
6
Classwork: Regex Recap
●
If L is a set of letters (A-Z, a-z) and D is a set
of digits (0-9),
– Find the size of the language LD.
– Find the size of the language L U D.
– Find the size of the language L4.
●
Write regex for real numbers
– Without eE, without +- in mantissa (1.89)
– Without eE, with +- in mantissa (-1.89)
– With eE, with -+ in exponent (-1.89E-4)
7
Classwork
●
Write regex for strings over alphabet {a, b} that
start and end with a.
●
Strings with third last letter as a.
●
Strings with exactly three bs.
●
Strings with even length.
●
Homework
– Exercises 3.3.6 from ALSU.
8
Example Lex
/*/*variables
variables*/*/
Patterns [a-z]
[a-z] {{
yylval
yylval==*yytext
*yytext--'a';
'a';
return
returnVARIABLE;
VARIABLE; Tokens
}}
/*/*integers
integers*/*/
[0-9]+
[0-9]+ {{
yylval
yylval==atoi(yytext);
atoi(yytext); Lexemes
return
returnINTEGER;
INTEGER;
}}
/*/*operators
operators*/*/
[-+()=/*\n]
[-+()=/*\n]{{return
return*yytext;
*yytext;}}
/*/*skip
skipwhitespace
whitespace*/*/
[[\t]
\t] ;;
/*/*anything
anythingelse
elseis
isan
anerror
error*/*/
.. yyerror("invalid
yyerror("invalidcharacter");
character");
9
a1.l a1.y
lex
lex yacc
yacc
gcc
gcc
12
Lexing and Context
●
Language design should ensure that lexing
can be done without context.
●
Your assignments and most languages need
context-insensitive lexing.
DO
DO55 I I==1.25
1.25 DO
DO55 I I==1,25
1,25
●
“DO 5 I” is an identifier in Fortran, as spaces are allowed in identifiers.
●
Thus, first is an assignment, while second is a loop.
●
Lexer doesn't know whether to consider the input “DO 5 I” as an identifier
or as a part of the loop, until parser informs it based on dot or comma.
●
Alternatively, lexer may employ a lookahead.
13
Lexical Errors
●
It is often difficult to report errors for a lexer.
– fi (a == f(x)) ...
– A lexer doesn't know the context of fi. Hence it
cannot “see” the structure of the sentence –
structure is known only to the parser.
– fi = 2; OR fi(a == f(x));
●
But some errors a lexer can catch.
– 23 = @a;
– if $x friendof anil ...
16
Input Buffering
●
“We cannot know we were executing a finite
loop until we come out of the loop.”
●
In C, without reading the next character we
cannot determine a binary minus symbol (a-b).
->, -=, --, -e, ...
Sometimes we may have to look several
characters in future, called lookahead.
In the fortran example (DO 5 I), the lookahead
could be upto dot or comma.
●
Reading character-by-character from disk is
inefficient. Hence buffering is required. 17
Input Buffering
●
A block of characters is read from disk into a buffer.
●
Lexer maintains two pointers:
– lexemeBegin
– forward E = M * C * * 2 \f
forward
lexemeBegin
What
Whatis
isthe
theproblem
problemwith
withsuch
suchaascheme?
scheme?
18
Input Buffering
●
The issue arises when the lookahead is
beyond the buffer.
●
When you load the buffer, the previous content
is overwritten!
Input read Input to be
read
E = M * C * * 2 \f
forward
lexemeBegin
How
Howdo
dowe
wesolve
solvethis
thisproblem?
problem? 19
Double Buffering
●
Uses two (half) buffers.
●
Assumes that the lookahead would not be
more than one buffer size.
Buf1 Buf2
E = M * C * * 2 \f
forward
lexemeBegin
20
Transition Diagrams
●
Step to be taken on each character can be
specified as a state transition diagram.
– Sometimes, action may be associated with a state.
< =
0 1 2 return(comp, LE);
other yyless(1); return(comp, LT);
= 3
= return(comp, EQ);
4 5
>
other yyless(1); return(assign, ASSIGN);
6
= 8 return(comp, GE);
7
other 9 yyless(1); return(comp, GT);
21
...
Keywords vs. Identifiers
●
Keywords may match identifier pattern
– Keywords: int, const, break, ...
– Identifiers: (alpha | _) (alpha | num | _)*
●
If unaddressed, may lead to strange errors.
– Install keywords a priori in the symbol table.
– Prioritize keywords
●
In lex, the rule for a keyword must precede
that of the identifier.
22
25
Lookahead
●
/ signifies the lookahead symbol. The input is
read and matched, but is left unconsumed in
the current rule.
m
s
28
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=0
abababaababbbabbababb
ababaa
29
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=0
abababaababbbabbababb
ababaa
30
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=1
abababaababbbabbababb
ababaa
31
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=2
abababaababbbabbababb
ababaa Match found
32
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
35
Source: wikipedia
KMP String Matching
●
First linear time algorithm for string matching.
●
Whenver there is a mismatch, do not restart;
rather fail intelligently.
●
We define a failure function for each position,
taking into account the suffix and the prefix.
●
Note that the matched part of the large string T is
essentially the pattern string s. Thus, failure
function can be computed simply using pattern s.
abababaababbbabbababb
ababaa
36
Failure is not final.
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a
37
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
if (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match
38
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
while (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match
i 1 2 3 4 5 6
39
f(i) 0 0 1 2 3 1
Classwork
●
Find failure function for pattern ababba.
●
Test it on string abababbaa.
●
Fibonacci strings are defined as
– s1 = b, s2 = a, sk = sk-1sk-2 for k > 2
– e.g., s3 = ab, s4 = aba, s5 = abaab
● Find the failure function for s6.
40
Fibonacci Strings
– s1 = b, s2 = a, sk = sk-1sk-2 for k > 2
– e.g., s3 = ab, s4 = aba, s5 = abaab
●
Do not contain bb or aaa.
●
The words end in ba and ab alternatively.
●
Suppressing last two letters creates a palindrome.
●
...
Source: Wikipedia 41
KMP Generalization
●
KMP can be used for keyword matching.
●
Aho and Corasick generalized KMP to
recognize any of a set of keywords in a text.
h e r s
0 1 2 8 9
i
s s
6 7
h e
3 4 5
i 1 2 3 4 5 6 7 8 9
f(i) 0 0 0 1 2 0 3 0 3 42
KMP Generalization
●
When in state i, the failure function f(i) notes
the state corresponding to the longest proper
suffix that is also a prefix of some keyword.
h e r s
0 1 2 8 9
i
s s
6 7
h e
3 4 5
44
Regex NFA DFA
Draw an NFA for *cpp
Ʃ
c p p
0 1 2 3
p p
c
c p p
0 1 2 3
c c
46
Regex NFA DFA
●
For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.
●
Thus, the regex is (a|b)*abb.
●
How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9 b 10
ϵ 4 b ϵ
5
47
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9 b 10
ϵ 4 b ϵ
5
48
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
b
C
b a b
a b
A B D b E DFA
a a
a
49
Regex NFA DFA
Ʃ
a b b NFA
0 1 2 3
b b
a
a b b DFA
0 1 2 3
a a
b
C
b a b
a b
A B D b E DFA
a a non-minimal
a
50
Regex NFA DFA
(a|b)*abb Regex
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8 b 9 b 10 NFA
ϵ b ϵ
4 5
ϵ
b
C
b a b
a b
A B D b E DFA
a a non-minimal
a
51
Regex DFA
1. Construct a syntax tree for regex#.
2. Compute nullable, firstpos, lastpos, followpos.
3. Construct DFA using transition function.
4. Mark firstpos(root) as start state.
5. Mark states that contain position of # as
accepting states.
52
Regex DFA
●
Regex is (a|b)*abb#.
●
Construct a syntax tree for the regex.
.
. #
. 6
b
. b 5
* 4
a
3
| ●
Leaves correspond to operands.
●
Interior nodes correspond to operators.
●
Operands constitute strings.
1 a b 2
53
Functions from Syntax Tree
●
For a syntax tree node n
– nullable(n): true if n represents ϵ.
– firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
– lastpos(n): set of positions that correspond to the
last symbol of strings in n's subtree.
– followpos(n): set of next possible positions from n
for valid strings.
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8
b
9
b
10
ϵ 4 b 5 ϵ
54
ϵ
nullable
●
nullable(n): true if n represents ϵ.
●
Regex is (a|b)*abb#.
F .
F . #
F . F
b
F . b F
T * F
a
F
F |
F a b F
55
nullable
●
nullable(n): true if n represents ϵ.
Node n nullable(n)
leaf labeled ϵ true
leaf with position i false
or-node n = c1 | c2 nullable(c1) or nullable(c2)
cat-node n = c1c2 nullable(c1) and nullable(c2)
star-node n = c* true
57
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)
star-node n = c* firstpos(c)
58
lastpos
●
lastpos(n): set of positions that correspond
to the last symbol of strings in n's subtree.
Node n lastpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 lastpos(c1) U lastpos(c2)
cat-node n = c1c2 if (nullable(c2)) lastpos(c1) U lastpos(c2)
else lastpos(c2)
star-node n = c* lastpos(c)
59
firstpos lastpos
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3
{3} {3}
{1,2} {1,2} |
1 a b 2
{1} {1} {2} {2}
60
followpos
●
followpos(n): set of next possible positions
from n for valid strings.
– If n is a cat-node with child nodes c1 and c2, then
for each position in lastpos(c1), all positions in
firstpos(c2) follow.
– If n is a star-node, then for each position in
lastpos(n), all positions in firstpos(n) follow.
61
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
1 a b 2
{1} {1} {2} {2}
62
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 63
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 64
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3, 1, 2}
2 {3, 1, 2}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 65
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function (next slide).
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.
66
DFA Transitions
create unmarked state firstpos(root). {1,2,3} {6} .
while there exists unmarked state s {
mark s a b a
1 2 3
for each input symbol x {
uf = U followpos(p) where p is in s labeled x
transition[s, x] = uf n followpos(n)
1 {3, 1, 2}
if uf is newly created
2 {3, 1, 2}
unmark uf b 3 {4}
a
} 123 1234 4 {5}
5 {6} 67
} 6 {}
Final DFA
b
b a
a b b DFA
123 1234 1235 1236
a
a
Ʃ
a b b NFA
0 1 2 3
b b
a
a b b DFA
0 1 2 3
a a
68
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function.
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.
70
Lexing Summary
Character stream
●
Basic lex Lexical
Machine-Indep.
Machine-Indep.
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream Intermediate
representation
●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
71