Unit Ii
Unit Ii
Regular Expression – FA and Regular Expressions – Proving languages not to be regular – Closure
properties of regular languages – Equivalence and minimization of Automata.
We construct REs from primitive constituents (basic elements) by repeatedly applying certain recursive rules
as given below. (In the definition)
Definition : Let S be an alphabet. The regular expressions are defined recursively as follows.
Basis :
i) is a RE
ii) is a RE
iii) , a is RE.
Recursive Step :
i)
ii)
iii)
iv)
Closure : r is RE over only if it can be obtained from the basis elements (Primitive REs) by a finite no of
applications of the recursive step (given in 2).
Example : Let = { 0,1,2 }. Then (0+21)*(1+ F ) is a RE, because we can construct this expression by
applying the above rules as given in the following step.
Steps RE Constructed Rule Used
1 1 Rule 1(iii)
2 Rule 1(i)
3 1+ Rule 2(i) & Results of Step 1, 2
37
CS2303 THEORY OF COMPUTATION
Notation : If r is a RE over some alphabet then L(r) is the language associate with r . We can define the
language L(r) associated with (or described by) a REs as follows.
= L(0)*L(0) ∪ L(1)
={ , 0,00,000,........} {0,1}
Precedence Rule
38
CS2303 THEORY OF COMPUTATION
Consider the RE ab + c. The language described by the RE can be thought of either L(a)L(b+c) or
L(ab) L(c) as provided by the rules (of languages described by REs) given already. But these two
represents two different languages lending to ambiguity. To remove this ambiguity we can either
2) Use a set of precedence rules to evaluate the options of REs in some order. Like other algebras mod in
mathematics.
i) The star operator precedes concatenation and concatenation precedes union (+) operator.
ii) It is also important to note that concatenation & union (+) operators are associative and union operation
is commutative.
Using these precedence rule, we find that the RE ab+c represents the language L(ab) L(c) i.e. it should be
grouped as ((ab)+c).
We can, of course change the order of precedence by using parentheses. For example, the language
represented by the RE is L(a)L(b+c).
a(b+c)
Example : The RE ab*+b is grouped as ((a(b*))+b) which describes the language L(a)(L(b))*
L(b)
Example : It is easy to see that the RE (0+1)*(0+11) represents the language of all strings over {0,1} which
are either ended with 0 or 11.
Example : The regular expression r =(00)*(11)*1 denotes the set of all strings with an even number of 0's
Note : The notation is used to represent the RE rr*. Similarly, represents the RE rr, denotes r,
and so on.
Exercise : Give a RE r over {0,1} s.t. L(r)={ has at least one pair of consecutive 1's}
Solution : Every string in L(r) must contain 00 somewhere, but what comes before and what goes before
is completely arbitrary. Considering these observations we can write the REs as (0+1)*11(0+1)*.
Example : Considering the above example it becomes clean that the RE (0+1)*11(0+1)*+(0+1)*00(0+1)*
represents the set of string over {0,1} that contains the substring 11 or 00.
39
CS2303 THEORY OF COMPUTATION
Example : Consider the RE 0*10*10*. It is not difficult to see that this RE describes the set of strings over {0,1}
that contains exactly two 1's. The presence of two 1's in the RE and any no of 0's before, between and after
the 1's ensure it.
Example : Consider the language of strings over {0,1} containing two or more 1's.
Solution : There must be at least two 1's in the RE somewhere and what comes before, between, and after is
completely arbitrary. Hence we can write the RE as (0+1)*1(0+1)*1(0+1)*. But following two REs also
represent the same language, each ensuring presence of least two 1's somewhere in the string
i) 0*10*1(0+1)*
ii) (0+1)*10*10*
Solution : Though it looks similar to ex ……., it is harder to construct to construct. We observer that, whenever
a 1 occurs, it must be immediately followed by a 0. This substring may be preceded & followed by any no of
0's. So the final RE must be a repetition of strings of the form: 00…0100….00 i.e. 0*100*. So it looks like the
RE is (0*100*)*. But in this case the strings ending in 1 or consisting of all 0's are not accounted for. Taking
these observations into consideration, the final RE is r = (0*100*)(1+ )+0*(1+ ).
Alternative Solution :
The language can be viewed as repetitions of the strings 0 and 01. Hence get the RE as r = (0+10)*(1+ ).This
is a shorter expression but represents the same language.
Regular Expression:
FA to regular expressions:
FA to RE (REs for Regular Languages) :
Lemma : If a language is regular, then there is a RE to describe it. i.e. if L = L(M) for some DFA M, then there is a
RE r such that L = L(r).
Notations : is a RE denoting the language which is the set of all strings w such that w is the label of a
path from state i to state j in M, and that path has no intermediate state whose number is
greater then k. ( i & j (begining and end pts) are not considered to be "intermediate" so i and /or j can be
40
CS2303 THEORY OF COMPUTATION
greater than k )
Basis : k = 0, i.e. the paths must not have any intermediate state ( since all states are numbered 1 or
above). There are only two possible paths meeting the above condition :
Induction :
Assume that there exists a path from state i to state j such that there is no intermediate state whose number is
1. The path dose not go through the state k at all i.e. number of all the intermediate states are less than
k. So, the label of the path from state i to state j is tha language described by the RE .
2. The path goes through the state k at least once. The path may go from i to j and k may appear
more than once. We can break the into pieces as shown in the figure 7.
41
CS2303 THEORY OF COMPUTATION
Figure 7
1. The first part from the state i to the state k which is the first recurence. In this path, all intermediate
states are less than k and it starts at iand ends at k. So the RE denotes the language of the
label of path.
2. The last part from the last occurence of the state k in the path to state j. In this path also, no
intermediate state is numbered greater than k. Hence the RE denoting the language of the
label of the path.
3. In the middle, for the first occurence of k to the last occurence of k , represents a loop which may be
taken zero times, once or any no of times. And all states between two consecutive k's are numbered
less than k.
Hence the label of the path of the part is denoted by the RE .The label of the path from state i to state
j is the concatenation of these 3 parts which is
Since either case 1 or case 2 may happen the labels of all paths from state i to j is denoted by the following RE
We can construct for all i, j {1,2,..., n} in increasing order of k starting with the basis k = 0 upto k = n since
depends only on expressions with a small superscript (and hence will be available). WLOG, assume
that state 1 is the start state and are the m final states where ji {1, 2, ... , n }, and
. According to the convention used, the language of the automatacan be denoted by the RE
42
CS2303 THEORY OF COMPUTATION
Since is the set of all strings that starts at start state 1 and finishes at final state following the transition of
the FA with any value of the intermediate state (1, 2, ... , n) and hence accepted by the automata.
Regular Grammar:
A grammar is right-linear if each production has one of the following three forms:
• A cB ,
• A c,
• A
Where A, B ( with A = B allowed) and . A grammar G is left-linear if each production has once of
the following three forms.
A Bc , A c, A
Regular grammar and Finite Automata are equivalent as stated in the following theorem.
Theorem : A language L is regular iff it has a regular grammar. We use the following two lemmas to prove the
above theorem.
Let and .
N=Q, and
[ Note: If , then ]
43
CS2303 THEORY OF COMPUTATION
and
By construction, the grammar G will have one production for each of the above transitions. Therefore, we have
the corresponding derivation.
Hence w L(g).
Conversely, if , then the derivation of w in G must have the form as given above.
But, then the construction of G from M implies that
is constructed as follows:
if
and , if .
44
CS2303 THEORY OF COMPUTATION
Conversely, if is accepted by M, then because is the only accepting state of M, the transitions
causing w to be accepted by M will be of the form given above. These transitions corresponds to a
Given any left-linear grammar G with production of the form , we can construct from it a right-
Putting the two lemmas and the discussions in the above paragraph together we get the proof of the theorem-
It is easy to see that G generates the language denoted by the regular expression
(01)*0. The construction of lemma 2 for this grammar produces the follwoing FA.
This FA accepts exactly (01)*1.
In this section, we examine some questions about CFLs we can answer. A CFL may be represented using a
CFG or PDA. But an algorithm that uses one representation can be made to work for the others, since we can
construct one from the other.
45
CS2303 THEORY OF COMPUTATION
Testing Emptiness :
Proof : Given any CFL L, there is a CFG G to generate it. We can determine, using the construction described
in the context of elimination of useless symbols, whether the start symbol is useless. If so, then ;
otherwise not.
Testing Membership :
Given a PDA P for L, simulating the PDA on input string x doesnot quite work, because the PDA can grow its
stack indefinitely on input, and the process may never terminate, even if the PDA is deterministic.
determine whether and it can easily be done using the technique given in the context of elimination of
Hence, if the length of the input string x is n, then it takes exactly n steps to derive x ( provided x is in ).
Let the maximum number of productions for any nonterminal in is K. So at every step in derivation, there are
atmost k choices. We may try out all these choices, systematically., to derive the string x in . Since
there are atmost i.e. choices. This algorithms is of exponential time complexity. We now present
an efficient (polynomial time) membership algorithm.
Pumping Lemma:
Limitations of Finite Automata and Non regular Languages :
The class of languages recognized by FA s is strictly the regular set. There are certain languages which are
non regular i.e. cannot be recognized by any FA
In order to accept is language, we find that, an automaton seems to need to remember when passing the
center point between a's and b's how many a's it has seen so far. Because it would have to compare that
with the number of b's to either accept (when the two numbers are same) or reject (when they are not same)
the input string.
46
CS2303 THEORY OF COMPUTATION
But the number of a's is not limited and may be much larger than the number of states since the string may
be arbitrarily long. So, the amount of information the automaton need to remember is unbounded.
A finite automaton cannot remember this with only finite memory (i.e. finite number of states). The fact that
FA s have finite memory imposes some limitations on the structure of the languages recognized. Inductively, we
can say that a language is regular only if in processing any string in this language, the information that has to
be remembered at any point is strictly limited. The argument given above to show that is non regular is
informal. We now present a formal method for showing that certain languages such as are non regular
Properties of CFL’s
Let . Then . All productions applied in their derivation are also in . Hence i.e.
Similarly, if , then
Thus .
47
CS2303 THEORY OF COMPUTATION
Conversely, let . Then and the first step in this derivation must be either or
Since and are disjoint, the derivation must use the productions of only ( which are also in
So, , as claimed
Again, we assume that and are disjoint, and is a nonterminal not in or . we construct the CFG
We claim that
To prove it, we first assume that and . Then and . We can derive the string xy in
as shown below.
For the converse, let . Then the derivation of w in will be of the form
i.e. the first step in the derivation must see the rule . Again, since and are
disjoint and and , some string x will be generated from using productions in ( which are
Thus
Hence and .
This means that w can be divided into two parts x, y such that and . Thus .This
completes the proof
Theorem : If L is a CFL, then so is .
Proof : Let be the CFG generating L. Let us construct the CFG from G
where .
First (n-1)-steps uses the production S SS producing the sentential form of n numbers of S 's. The
nonterminal S in the i-th position then generates using production in P ( which are also in )
It is also easy to see that G can generate the empty string, any string in L and any string for n >1 and
none other.
Hence
Proof : We prove it by giving a counter example. Consider the language .The following
CFG generates L1 and hence a CFL
49
CS2303 THEORY OF COMPUTATION
The nonterminal X generates strings of the form and C generates strings of the form ,
. These are the only types of strings generated by X and C. Hence, S generates .
Using similar reasoning, it can be shown that the following grammar and hence it is
also a CFL.
Hence proof.
Proof : Assume, for contradiction, that CFL's are closed under complementation. SInce, CFL's are also closed
under union, the language , where and are CFL's must be CFL. But by DeMorgan's law
This contradicts the already proved fact that CFL's are not closed under intersection.
But it can be shown that the CFL's are closed under intersection with a regular set.
where is defined as
contains iff
50
CS2303 THEORY OF COMPUTATION
and contains
The idea is that M simulates the moves of P and D parallely on input w, and accepts w iff both P and
D accepts. That means, we want to show that
iff
and
Basic Case is n=0. Hence , and . For this case it is trivially true
Let
From the definition of and considering the n-th move of the PDA M above, we have
and
Hence and
If and , then and we got that if M accepts w, then both P and D accepts it.
We can show that converse, in a similar way. Hence is a CFL ( since it is accepted by a PDA M )
This property is useful in showing that certain languages are not context-free.
Example : Consider the language
51
CS2303 THEORY OF COMPUTATION
iff .
The proof is by induction on n, the number of steps taken by the derivation. We assume, for simplicity (and of
course without loss of generality), that G and hence are in CNF.
The basis is n=1 in which case it is trivial. Because must be either or BC with .
Hence iff
Assume that it is true for (n-1)-steps. Let . Then the first step must apply a rule of the
form and it gives
where and
By constructing of G',
Hence
, let be a language (over any alphabet). This defines a function S, called substitution, on which is
.
Similarly, for any language L,
The following theorem shows that CFLs are closed under substitution.
Thereom : Let is a CFL, and s is a substitution on such that is a CFL for all , thus
s(L) is a CFL
Proof : Let L = L(G) for a CFG and for every , for some
. Without loss of generality, assume that the sets of nonterminals N and 's are
disjoint.
52
CS2303 THEORY OF COMPUTATION
• consists of
1. and
2. The production of P but with each terminal a in the right hand side of a production replaced by
everywhere.
If Part : Let then according to the definition there is some string and
From the construction of , we find that, there is a derivation corresponding to the string
(since contains all productions of G but every ai replaced with in the RHS of any
production).
Every is the start symbol of and all productions of are also included in .
Hence
Therefore,
Each ( ) can only generate a string , since each 's and N are disjoin.
Therefore, we get
since
53
CS2303 THEORY OF COMPUTATION
since
Proof : Let be a CFL, and h is a homomorphism on i.e for some alphabets . consider the
following substitution S:Replace each symbol by the language consisting of the only string h(a), i.e.
for all . Then, it is clear that, h(L) = s(L). Hence, CFL's being closed under substitution
must also be closed under homomorphism.
54
CS2303 THEORY OF COMPUTATION