The Boyer-Moore Algorithm
Right-to-left scan
The string matching algorithm of choice.
12345678901234567
xpbctbxabpqxctbpg
tpabxab
Achieves expected sublinear run time using three
ideas:
Right-to-left scan
Bad character rule
Good suffix rule
Boyer-Moore
When a mismatch occurs, shift P right by some
amount.
If we always shift by one, run time is O(nm). Instead,
we try to shift by more than one, if possible.
Page 1
Definition. For each x in the alphabet, R(x) is the
index of the rightmost occurrence of x in P. R(x) is 0 if
x does not occur in P.
Example. = {a, b, c, d}
P = abadab
Boyer-Moore
Page 2
The bad character shift rule
Suppose that for some alignment of P against T, the
rightmost n i characters of P are matched, but P[i]
mismatches. Assume the mismatch is with T[k].
Then, shift P right by max[1, i R(T[k])] places.
x a b c d
R(x) 5 6 0 4
bacbcdabaaba
abadab
abadab
abadab
abadab
Boyer-Moore
Page 3
Boyer-Moore
Shift by 2
Shift by 3
Shift by 1
Page 4
The extended bad character shift rule
The good suffix rule. Suppose that for some
alignment of P and T, substring t of T matches a suffix
When a mismatch occurs at P[i] and T[k] and T[k] = x,
shift P to the right so that the closest x in P to the left
of P[i] is aligned with T[k].
of P, but a mismatch occurs at the next position. Find
the rightmost copy t of t in P such that t is not a
suffix of P and the character to the left of t in P differs
from the character to the left of t in P. Shift P so that
t in P is aligned with t in T.
T
P
x
P
x
x
x
y
T
x
t
P
Boyer-Moore
Page 5
. . . If there is no such t, shift the left end of P past the
left end of t in T by the least amount so that a prefix
of the shifted pattern matches a suffix of t in T.
Boyer-Moore
t
P
t
y
t
Page 6
P
s
. . . If no such shift is possible, shift P by n places to
the right (i.e., past t).
Boyer-Moore
s
y
Page 7
Boyer-Moore
Page 8
If an occurrence of P is found, shift P by the least
amount so that a proper prefix of the shifted P matches
123456789012345678
prstabstubabvqxrst
*
qcabdabdab
1234567890
a suffix of the occurrence of P in T. If no such shift is
possible, shift P by n places to the right (i.e., past t).
123456789012345678
prstabstubabvqxrst
qcabdabdab
1234567890
Boyer-Moore
Page 9
Definition. For each i, L(i) is the largest position less
than n such that P[i.. n] matches a suffix of
P[1..L(i)]. L(i) is 0 if no such position exists.
L(i)
Boyer-Moore
Definition. For each i, L(i) is the largest position less
than n such that P[i..n] matches a suffix of
P[1..L(i)] and such that the character preceding that
suffix is not equal to P[i1]. L(i) is 0 if no such
position exists.
n
P
Boyer-Moore
Page 10
Page 11
Boyer-Moore
z
1
y
L(i)
y
i
n
Page 12
Definition. Nj(P) is the length of the longest suffix of
P[1..j] that is also a suffix of the full string P.
Example
Nj(P)
P
Observation. N(P) is the reverse of Z(P). That is,
Nj(P) = Znj+1(Pr), where Pr is the reversal of P.
Example
j
P[j]
Nj(P)
1 2 3 4 5 6 7 8 9
c a b d a b d a b
0 0 2 0 0 5 0 0 *
Boyer-Moore
j
P[j]
Nj(P)
1 2 3 4 5 6 7 8 9
c a b d a b d a b
0 0 2 0 0 5 0 0 *
Pr[j]
Zj(Pr)
b a d b a d b a c
* 0 0 5 0 0 2 0 0
Thus, N(P) can be computed in O(n) time.
Page 13
Theorem.
L(i) is the largest j such that Nj(P) |P[i..n]|.
L(i) is the largest j such that Nj(P)= |P[i..n]|.
Boyer-Moore
Page 14
Definition. l(i) is the length of the longest suffix of
P[i..n] that is also a prefix of P. If no such suffix
exists, l(i) = 0.
l(i)
Computing the L(i)s
for i 1 to n do
L(i) 0
for j 1 to n1 do
i nNj(P) + 1
L(i) j
Theorem. l(i) equals the largest j |P[i..n]| such
that Nj(P) = j.
Thus, L can be computed in O(n) time.
Boyer-Moore
Page 15
Boyer-Moore
Page 16
Using L and l for the good suffix rule
If a mismatch occurs at position i 1 of P, then
If L(i) > 0, shift P right by n L(i) places
If L(i) = 0, shift P right by n l(i) places
If an occurrence of P is found, then shift P by
n l(2) places.
If P[n] mismatches, shift P one place to the right.
Boyer-Moore
Page 17
With the strong good suffix rule alone, the worst-case
run time of Boyer-Moore is
O(m) if P is not in T (Knuth, Morris, Pratt 1977,
Guibas & Odlyzko 1980, Cole 1994)
O(nm) if P is in T, but can be modified to achieve
O(n+m) time in all cases (Galil 1979, Apostolico
and Giancarlo 1986)
With the bad character rule alone
Worst-case time is O(nm)
Expected time on random strings is sublinear
Sublinear time observed in practice
Boyer-Moore
Page 19
Boyer-Moore(P,T)
compute L(i) and L(i) for each position i of P
compute R(x) for each x in the alphabet
kn
while k m do
i n; h k
while i > 0 and P[i] = T[h] do
i--; h-if i = 0 then
report occurrence of P in T ending at T[k]
k k + n l(2)
else
shift P (increase k) by the maximum
amount determined by the bad character
rule and the good suffix rule
Boyer-Moore
Page 18