Open Problems in Automata Theory and Formal
Languages
Jeffrey Shallit
School of Computer Science
University of Waterloo
Waterloo, Ontario N2L 3G1
Canada
shallit@cs.uwaterloo.ca
http://www.cs.uwaterloo.ca/~shallit
1 / 50
An Advertisement
Just out from Cambridge University Press! Order your copy today!
2 / 50
Separating Words with Automata
Motivation: a classical problem from the early days of automata
theory:
Given two DFA’s M1 and M2 , with m and n states, respectively,
with L(M1 ) 6= L(M2 ), what is a good bound on the length of the
shortest string accepted by one but not the other?
I The cross-product construction gives an upper bound of
mn − 1
I But an upper bound of m + n − 2 follows from the usual
algorithm for minimizing automata
I Furthermore, this bound is best possible.
I For NFA’s the bound is exponential in m and n
3 / 50
Separating Words with Automata
We now look at the inverse problem: given two distinct words, how
big an automaton do we need to separate them?
That is, given two words w and x of length ≤ n, what is the
smallest number of states in any DFA that accepts one but not the
other?
Call this number sep(w , x).
4 / 50
A machine M separates the word w from the word x if M accepts
w and rejects x, or vice versa.
For example, the machine below separates 1000 from 0010.
0 1
0
1
0, 1
However, no 2-state DFA can separate these two strings. So
sep(1000, 0010) = 3.
5 / 50
Separating Words with Automata
Easy case: if the two words are of different lengths, then we can
separate them with a DFA of size O(log n).
For by the prime number theorem, if x 6= y , and x, y ≤ n then
there is a prime p = O(log n) such that x 6≡ y (mod p).
So we can accept one string and reject the other by using a cycle
mod p, and the appropriate residue class.
6 / 50
Separating Words with Automata
A similar idea works if the strings have a different number of 1’s,
or if the 1’s are in different positions, or if the number of
occurrences of a short subword is different, etc.
7 / 50
Separating Words With Automata
I Let
S(n) := max sep(w , x),
|w |=|x|=n
w 6=x
the smallest number of states required to separate any two
strings of length n.
I The separation problem was first studied by Goralcik and
Koubek 1986, who proved S(n) = o(n).
I In 1989 Robson who obtained the best known bound:
S(n) = O(n2/5 (log n)3/5 ).
8 / 50
Separating Words with Automata
For equal-length strings, S(n) doesn’t depend on alphabet size
(provided it is at least 2).
Suppose x, y are distinct strings of length n an alphabet Σ of size
> 2.
Then they must differ in some position, say
x = x 0 a x 00
y = y 0 b y 00
for a 6= b.
Map a to 0, b to 1 and assign all other letters arbitrarily to either 0
or 1. This gives two new distinct strings X and Y of the same
length. If X and Y can be separated by an m-state DFA, then so
can x and y , by renaming transitions to be over Σ instead of 0 and
1. 9 / 50
Separating Words With Automata
A weaker result:
Theorem (Robson, 1996). We can separate words by computing
the parity of the number of 1’s occurring in positions congruent to
√
i (mod j), for i, j = O( n).
This gives the bound S(n) = O(n 1/2 ).
Open Problem 1: Improve Robson’s bound of O(n 2/5 (log n)3/5 )
on S(n).
10 / 50
Separating Words With Automata: Lower Bound
I Claim: S(n) = Ω(log n).
I To see this, consider the two strings
0t−1+lcm(1,2,...,t) 1t−1 and 0t−1 1t−1+lcm(1,2,...,t) .
Proof in pictures:
0-tail
1-tail
0-cycle
1-cycle
11 / 50
So no t-state machine can distinguish these strings.
Since lcm(1, 2, . . . , t) = e t+o(t) by the prime number theorem, the
lower bound S(n) = Ω(log n) follows.
12 / 50
Separating Words With Automata
Some data:
n S(n) n S(n)
1 2 10 4
2 2 11 4
3 2 12 4
4 3 13 4
5 3 14 4
6 3 15 4
7 3 16 4
8 3 17 4
9 3 18 5
13 / 50
Variations on Separating Words
I Separation by context-free grammars; count number of
productions
I Problem: right-hand sides can be arbitrarily complicated
I Solution: Use CFG’s in Chomsky normal form (CNF), where
all productions are of the form A → BC or A → a.
14 / 50
Variations on Separating Words
I In 1999 Currie, Petersen, Robson and JOS proved:
I If |w | 6= |x| then there is a CFG in CNF with O(log log n)
productions separating w from x. Furthermore, this bound is
optimal.
I If |w | = |x| there is a CFG in CNF with O(log n) productions
separating w from x. There is a lower bound of Ω( logloglogn n ).
Open Problem 2: Find matching upper and lower bounds in the
case |w | = |x|.
15 / 50
More Variations on Separating Words
I Separation by NFA. Do NFA give more power?
Yes,
sep(0001, 0111) = 3
but
nsep(0001, 0111) = 2.
16 / 50
More Variations on Separating Words
Is
sep(x, w )/nsep(x, w )
unbounded?
Yes.
Consider once again the strings
w = 0t−1+lcm(1,2,...,t) 1t−1 and x = 0t−1 1t−1+lcm(1,2,...,t)
where t = n2 − 3n + 2, n ≥ 4.
17 / 50
We know from before that any DFA separating these strings must
have at least t + 1 = n2 + 3n + 3 states.
Now consider the following NFA M:
0 0 0 0
0
0 loop of n states
0
0 0 0 0
loop of n-1 states
0
The language accepted by this NFA is {0a : a ∈ A}1∗ , where A is
the set of all integers representable by a non-negative integer linear
combination of n and n − 1.
18 / 50
0 0 0 0
0
0 loop of n states
0
0 0 0 0
loop of n-1 states
0
But t − 1 = n2 − 3n + 1 6∈ A.
On the other hand, every integer ≥ t is in A. Hence
w = 0t−1+lcm(1,2,...,t) 1t−1 is accepted by M but
x = 0t−1 1t−1+lcm(1,2,...,t) is not.
√
M has 2n = Θ( t) states, √ so p
sep(x, w )/nsep(x, w ) ≥ t = Ω( log |x|), which is unbounded.
19 / 50
More Variations on Separating Words
Open Problem 3: Find good bounds on nsep(w , x) for
|w | = |x| = n, as a function of n.
Open Problem 4: Find good bounds on sep(w , x)/nsep(w , x).
20 / 50
More Variations on Separating Words
I Must sep(w R , x R ) = sep(w , x)?
No, for w = 1000, x = 0010, we have
sep(w , x) = 3
but
sep(w R , x R ) = 2.
Open Problem 5:
Is ¯ ¯
¯sep(x, w ) − sep(x R , w R )¯
¯ ¯
unbounded?
21 / 50
More Variations on Separating Words
I Two words are conjugates if one is a cyclic shift of the other.
I Is the separating words problem any easier if restricted to
pairs of conjugates?
22 / 50
Another Kind of Separation
Suppose you have regular languages R1 , R2 with R1 ⊆ R2 and
R2 − R1 infinite.
Then it is easy to see that there is a regular language R3 such that
R1 ⊆ R3 ⊆ R2 such that R2 − R3 and R3 − R1 are both infinite.
This is a kind of topological separation property.
23 / 50
Another Kind of Separation
In 1980, Bucher asked:
Open Problem 6: Is the same true for context-free languages?
That is, given context-free languages L1 , L2 with L1 ⊆ L2 and
L2 − L1 infinite, need there be a context-free language L3 such
that L1 ⊆ L3 ⊆ L2 such that L2 − L3 and L3 − L1 are both infinite?
24 / 50
The Thue-Morse Sequence
The Thue-Morse sequence
t = t0 t1 t2 · · · = 011010011001011010010110 · · ·
can be described in many ways
I by the recurrence t2n = tn and t2n+1 = 1 − tn
I as the fixed point of the map 0 → 01 and 1 → 10
I as the sequence generated by the following DFA, where one
inputs n expressed in base 2
0 0
1
0 1
1
25 / 50
Runs in The Thue-Morse Sequence
An interesting sequence related to Thue-Morse is the associated
sequence d of run lengths:
d= 1 2 1 1 2 2 2 1
z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{
t = 0 11 0 1 00 11 00 1 · · ·
d = d(0)d(1)d(2) · · · = 12112221121 · · · .
It is not difficult to prove that d is the fixed point of the morphism
h that sends
1 → 121
2 → 12221.
26 / 50
Runs in The Thue-Morse Sequence
Subsequences of the sequence d have some really remarkable
properties. For example:
d(16n + 1) = d(8n + 1) for 0 ≤ n ≤ 56
d(32n) = d(8n) for 0 ≤ n ≤ 14562
d(32n + 21) = d(8n + 5) for 0 ≤ n ≤ 29126
d(64n + 1) = d(16n + 1) for 0 ≤ n ≤ 1864134,
and the most amazing of all,
d(64n) = d(16n) for 0 ≤ n ≤ 119304646.
27 / 50
Runs in The Thue-Morse Sequence
However, each of these “identities” fails at the next index.
The reason for this behavior is still somewhat puzzling, although it
is certainly related to the fact that the incidence matrix of h has
eigenvalues 2 and 1.
28 / 50
Runs in The Thue-Morse Sequence
I can prove
4a
Theorem. For all integers a > b > 0, there exists an n = O(44 )
with d(4a n) 6= d(4b n).
But this is almost certainly not best possible.
29 / 50
The proof idea is to use representations in terms of the sequence
1
1, 1, 3, 5, 11, 21, 43, . . . , (2n − (−1)n ).
3
Open Problem 7: Determine a good lower bound on
min{n : d(4a n) 6= d(4b n)}.
This suggests all sorts of similar questions. More generally, given a
morphic sequence (generated as a fixed point of a morphism), for
how many terms can distinct linear subsequences agree, as a
function of the size of the morphism?
30 / 50
Avoidability
I One of the beautiful properties of the Thue-Morse word is
that it avoids overlaps
I An overlap is a subword (factor) of the form axaxa, where a
is a single letter, and x is a word
I We’d like to generalize this
I One possible generalization involves Hankel determinants
31 / 50
Avoidability
I An n × n Hankel determinant of a sequence (ai )i≥0 is a
determinant of a matrix of the form
ak ak+1 · · · ak+n−1
ak+1 ak+2 · · · ak+n
ak+2 ak+3 · · · ak+n+1
.
.. .. . . ..
. . . .
ak+n−1 ak+n · · · ak+2n−2
I If a sequence of real numbers has an overlap
a x a x a
z}|{ z }| { z }| { z }| { z }| {
ak ak+1 · · · ak+n−2 ak+n−1 ak+n · · · ak+2n−3 ak+2n−2
then the corresponding n × n Hankel matrix has the first and
last rows both equal to axa, and so the determinant is 0.
32 / 50
Avoidability
I Is there a sequence on two real numbers for which all the
Hankel determinants (of all orders) are nonzero?
I No - a simple backtracking argument proves that the longest
such sequence is of length 14. For example,
1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2.
I How about sequences over three numbers?
Open Problem 8: Is there a sequence on three real numbers for
which all the Hankel determinants are nonzero?
33 / 50
Avoidability
Indeed, a backtracking algorithm easily finds such a sequence on
{1, 2, 3} with 200 terms.
What is interesting is that this algorithm never had to backtrack!
The sequence begins
1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 3, 1, 1, 2, 1, ...
34 / 50
Avoidability
Furthermore, the following morphism seems to generate such a
sequence on 4 symbols:
1 → 12, 2 → 23, 3 → 14, 4 → 32.
This generates the sequence
1223231423141232 · · · .
Open Problem 9: Does this sequence have all nonzero Hankel
determinants? We have checked up to length 800.
35 / 50
A universality problem
Classical problem: given M, a machine of some type (e.g., DFA,
NFA, PDA), decide if L(M) = Σ∗ .
I Unsolvable, if M is a PDA;
I PSPACE-complete, if M is an NFA;
I Solvable in polynomial time if M is a DFA.
36 / 50
A universality problem
A simple variation:
Given M, does there exist an integer n ≥ 0 such that Σn ⊆ L(M)?
I Still unsolvable for PDA’s
I NP-complete for DFA’s
I PSPACE-hard for NFA’s - but is it in PSPACE? This is Open
Problem 10.
37 / 50
A universality problem
It would follow that this problem is in PSPACE if we knew that if
Σr ⊆ L(M) for some n, then there always exists a “small” such r
(e.g., r ≤ exp(p(n)) for some polynomial p).
To see this, on input the DFA, we just examine every length l up
to the bound r and nondeterministically guess a string of length l
(symbol-by-symbol) that fails to be accepted. All this can be done
in NPSPACE and hence PSPACE by Savitch’s theorem.
38 / 50
A related problem
Here is a related problem.
Open Problem 11: what is the complexity of the following
problem: given a finite language L, is L∗ infinite?
If L is represented by a regular expression, this problem is NP-hard.
39 / 50
Another related problem
Open Problem 12: What is the complexity of the following
problem: given a finite list of words L over an alphabet Σ, is
Fact(L∗ ) = Σ∗ ?
Here by “Fact” we mean the set of all factors (contiguous
subwords).
Similarly, we’d like to find good bounds on the length of the
shortest word in Σ∗ − Fact(S ∗ ), given that Fact(L∗ ) 6= Σ∗ . This is
Open Problem 13. Currently examples of quadratic length are
known, but the best upper bound is doubly-exponential in the
length of the longest word in S.
40 / 50
Problems on state complexity
Given a regular language L, we can measure its state complexity,
the number of states in the smallest DFA accepting L.
We write it as sc(L).
Maslov (1970) initiated the study of state complexity in the Soviet
Union, and since then there have been dozens of papers.
Many papers have focused on state complexity of various
transformations.
41 / 50
State complexity of transformations
Let L1 , L2 be regular languages with state complexity m and n,
respectively.
Transformation Complexity
L1 ∩ L 2 mn
L1 L2 m · 2n − 2n−1
L∗1 3 · 2n−2
LR
1 2n
42 / 50
A detour into topology
Consider a set S of real numbers, and let Cl(S) denote the
“closure” under the usual topology (S together with all its
accumulation points), and let S be the usual complement (with
respect to R).
Now consider all sets obtainable by starting from S and applying
closure and complement any number of times, in any order.
Kuratowski’s “14-theorem” says that this process gives at most 14
distinct sets.
For example,
S = [0, 1) ∪ (1, 2] ∪ {3} ∪ [4, 5] ∪ (Q ∩ (6, 7)).
43 / 50
State complexity
In analogy with Kuratowski’s “14 theorem”, one can prove the
following:
Theorem. Start with any language L and apply the
transformations L → L∗ and L → L in any order, any number of
times. Then at most 14 distinct languages are generated.
We’d like to know the state complexities of these transformation,
obtaining good upper and lower bounds.
All cases have been solved, except one: Open Problem 14:
L → (L∗ )∗ . What is the state complexity of this transformation? It
is potentially doubly-exponential, but is that achievable?
44 / 50
State complexity
Another open problem (Open Problem 15): what is the state
complexity of the transformation that maps a string to its
“boundary”, that is, to L∗ ∩ (L)∗ ? It is potentially as bad as
something like 4n , but is that achievable?
45 / 50
One More for Dessert: Pierce Expansions
Let a > b > 0 be integers. Define b0 = b and bi+1 = a mod bi for
i ≥ 0.
Let P(a, b) be the least index n such that bn = 0.
46 / 50
One More for Dessert: Pierce Expansions
An example with a = 35, b = 22:
b0 = 22
b1 = 35 mod 22 = 13
b2 = 35 mod 13 = 9
b3 = 35 mod 9 = 8
b4 = 35 mod 8 = 3
b5 = 35 mod 3 = 2
b6 = 35 mod 2 = 1
b7 = 35 mod 1 = 0.
So P(35, 22) = 7.
47 / 50
One More for Dessert: Pierce Expansions
Open Problem 16 : Find good estimates for how big P(a, b) can
be, as a function of a.
The problem is interesting because it is related to the so-called
“Pierce Expansion” of b/a:
22 1 1 1 1 1 1 1
= (1 − (1 − (1 − (1 − (1 − (1 − )))))).
35 1 2 3 4 11 17 35
This is called a Pierce expansion.
48 / 50
One More for Dessert: Pierce Expansions
It is known that P(a, b) = O(a1/3 ). However, the true behavior is
probably O((log a)2 ). There is a lower bound of Ω(log a), which
can be obtained by choosing
a = lcm(1, 2, . . . , n) − 1
b = n.
49 / 50
For Further Reading
I J. M. Robson, Separating words with machines and groups,
RAIRO Info. Theor. Appl. 30 (1996), 81–86.
I J. Currie, H. Petersen, J. M. Robson, and J. Shallit,
Separating words with small grammars, J. Autom. Lang.
Combin. 4 (1999), 101–110.
I N. Rampersad, J. Shallit, Z. Xu, The computational
complexity of universality problems for prefixes, suffixes,
factors, and subwords of regular languages,
http://arxiv.org/abs/0907.0159.
I G. Allouche, J.-P. Allouche, and J. Shallit, Kolams indiens,
dessins sur le sables aux ı̂les Vanuatu, courbe de Sierpiński, et
morphismes de monoı̈de, Annales de l’Institut Fourier 56
(2006), 2115–2130.
I J. A. Brzozowski, E. Grant, J. Shallit, Closures in formal
languages and Kuratowski’s theorem, in Developments in
Language Theory, 2009, pp. 125–144.
50 / 50