0% found this document useful (0 votes)

184 views25 pages

String Matching Algorithms

The document summarizes three common string matching algorithms: Naive, Rabin-Karp, and Knuth-Morris-Pratt. The Naive algorithm has O(mn) runtime by comparing characters at each index. Rabin-Karp improves this to O(m+n) by comparing hash values instead of characters. Knuth-Morris-Pratt also has O(m+n) runtime by constructing a state machine from the pattern to avoid re-checking characters.

Uploaded by

Aditya Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

184 views25 pages

String Matching Algorithms

Uploaded by

Aditya Pratap Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

STRING MATCHING

Aditya Pratap Singh

215/CO/15
Netaji Subhas Institute Of Technology
CONTENTS

● Introduction
● String Matching
● Basic Classification
● Naive Algorithm
● Rabin-Karp Algorithm
○ String Hashing
○ Hash value for substrings
● Knuth-Morris-Pratt Algorithm
○ Prefix Function
○ KMP Matcher
● Summary
INTRODUCTION

● String matching algorithms are an important class of string

algorithms that tries to find one or many indices where one
or several strings(or patterns) are found in the larger string(or
text)

● Why do we need string matching?

String matching is used in various applications like spell
checkers, spam filters, search engines, plagiarism detectors,
bioinformatics and DNA sequencing etc.
STRING MATCHING

● To find all occurrences of a pattern in a given text

● Formally, given a pattern P[1..m] and a text T[1..n], find all
occurrences of P in T. Both P and T belongs to Σ*
● P occurs with shift s(beginning at s+1): P[1] = T[s+1], P[2] =
T[s+2],…, P[m] = T[s+m]
● If so, s is called a valid shift, otherwise an invalid shift
● Note: one occurrence can start within another one ie.
overlapping is allowed. eg P=abab T=abcabababbc, P occurs
at s=3 and s=5.

*text is the string that we are searching

*pattern is the string that we are searching for
*Shift is an offset into a string
BASIC CLASSIFICATION

1. Naive Algorithm - The naive approach is accomplished by

performing a brute-force comparison of each character in the
pattern at each possible placement of the pattern in the
string. It is O(mn) in the worst case scenario

2. Rabin-Karp Algorithm - It compares the string’s hash values,

rather than string themselves. Performs well in practice and
generalized to other algorithm for related problems such as
2D-string matching

3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

algorithm and is capable of running O(m+n) in the worst
case. It improves the running time by taking advantage of
prefix function
NAIVE ALGORITHM

One of the most obvious approach towards the string matching

problem would be to compare the first element of the pattern to
be searched ‘p’, with the first element of the string ‘s’ in which to
locate ‘p’.

If the first element of ‘p’ matches the first element of ‘s’ ,

compare the second element and so on. If match found proceed
likewise until entire ‘p’ is found. If a mismatch is found at any
position , shift index to one position to the right and continue
comparison

This approach is easy to understand and implement but it can be

too slow in some cases.
In worst case it may take (m*n) iterations to complete the task.
PSEUDOCODE

function naive(text[], pattern[]){

for(i = 0; i < n; i++) {
for(j = 0; j < m && i + j < n; j++) {
if(text[i + j] != pattern[j]) break; // mismatch found
if(j == m) // match found
}
}
}
ILLUSTRATION

String S = a b c a b a a b c a b a c
Pattern P = a b a a

Step 1: Compare P[1] with S[1]

abcabaabcabac

abaa

Step 2: Compare P[2] with S[2]

abcabaabcabac

abaa
ILLUSTRATION

Step 3: Compare P[3] with S[3]

abcabaabcabac

abaa

Since mismatch is detected, shift ‘p’ one position to the left and
perform steps analogous to those from step 1 to step 3. At
position where mismatch is detected, shift ‘p’ one position to
right and repeat matching procedure.
ILLUSTRATION

Finally, a match is found after shifting ‘p’ three times to the right
side.

abcabaabcabac

abaa

Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

of text T, then the matching time is O(n*m), which is certainly a
very slow running time
RABIN-KARP ALGORITHM

This is actually the naive approach augmented with a powerful

programming technique - hash function

Algorithm :
1. Calculate the hash for the pattern P
2. Calculate the hash values for all the prefixes of the text T.
3. Now, we can compare a substring of length |s| in constant
time using the calculated hashes.

This algorithm was authored by Michael Rabin and Richard Karp

in 1987.
STRING HASHING

Problem - Given a string S of length n = |S| . Calculate the hash

value of S

Solution -

where p and m are suitably chosen prime numbers.

CHOICE OF PARAMETERS

‘p’ should be taken roughly equal to the number of characters in

the input alphabet. If input is composed of only lowercase
characters of English alphabet, p=31 is a good choice. If the
input may contain both uppercase and lowercase letters, then
p=53 is a good choice.

‘m’ should be a large prime. A popular choice is m = 10^9+7

This is a large number but still small enough so that we can
perform multiplication of two values using 64 bit integers.
HASH CALCULATION OF SUBSTRINGS OF GIVEN STRING

Problem : Given string S and indices i and j . Find the hash value
of S[i..j]

Solution :
By definition we have,

Multiplying by pi gives,

So by knowing the hash value of each prefix of string S, we can

compute the hash of any substring in constant O(1) time.
PSEUDOCODE
vector<int> rabin_karp(string const& pat, string const& text) {
const int p = 31, m = 1e9 + 9;
int S = pat.size(), T = text.size();

vector<long long> p_pow(max(S, T));

p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;

vector<long long> h(T + 1, 0);

for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (text[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (pat[i] - 'a' + 1) * p_pow[i]) % m;

vector<int> occurrences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurrences.push_back(i);
}
return occurrences;
}
KNUTH-MORRIS-PRATT ALGORITHM

Knuth, Morris and Pratt proposed a linear time algorithm for the
string matching problem.

A matching time of O(n) is achieved by avoiding comparisons with

elements of ‘S’ that have previously been involved in comparison
with some element of the pattern ‘p’ to be matched ie.
backtracking on the string ‘S’ never occurs.

KMP makes use of ‘prefix function’

PREFIX FUNCTION

The prefix function of a string is defined as an array Ⲡ of length n,

where Ⲡ[i] is the length of the longest proper prefix of the
substring s[0..i] which is also a suffix of this substring.
A proper prefix of a string is a prefix that is not equal to the string
itself. So by definition Ⲡ[0] = 0

Mathematically,
EXAMPLE
S = “aabaaab”
PREFIX Ⲡ[i]

a a 0

aa aa 1

aab aab 0

aaba aaba 1

aabaa aabaa 2

aabaaa aabaaa 2

aabaaab aabaaab 3
ALGORITHM TO COMPUTE PREFIX FUNCTION

● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

to i=n-1 (Ⲡ[0] just gets assigned with 0)

● To calculate the current value Ⲡ[i] we set the variable j

denoting the length of the best suffix for ‘i-1’ . Initially j = Ⲡ[i-1]

● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

and s[i]. If they are equal then we assign Ⲡ[i] = j+1 . Otherwise,
we reduce j to Ⲡ[j-1] and repeat this step.

● If we have reached the length j=0 and still don’t have the
match, then we assign Ⲡ[i] = 0 and go to the next index ‘i+1’
PSEUDOCODE

vector<int> prefix_function(string s){

int n = (int)s.length();
vector<int> pi(n);
for(int i=1;i<n;i++){
int j = pi[i-1];
while(j>0 and s[i]!=s[j]) j = pi[j-1];
if(s[i] == s[j]) ++j;
pi[i] = j;
}
return pi;
}

Runtime - O(n)
KMP MATCHER

● This is a classical application of prefix function, which we just

learned
● Given text T and string S, we need to find all occurrences of S
in T
● Denote with n the length of the string S and with m the length
of the string T ie. n = |S| and m = |T|
● Generate a string S + # + T , where # is a separator that neither
appears in S nor T . Now calculate the prefix function of this
string
● By definition, Ⲡ[i] in this string corresponds to the largest block
that coincides with S and ends at position ‘i’ .
● Note: Ⲡ[i] can not be larger than ‘n’ because of the separator #
that we used
● If Ⲡ[i] == n, then we can say that string S appears completely at
this position.
EXAMPLE

S = “aba”
T = “aababac”
Generated string(G) = “aba#aababac”
Index (i) PREFIX Ⲡ[i]

4 a 1

5 aa 1

6 aab 2

7 aaba 3

8 aabab 2

9 aababa 3

10 aababac 0

Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

i = 1 and i=3 in the Text , there is occurrence of the pattern(S)
PSEUDOCODE

vector<int> kmp(string pattern,string text){

string str = pattern + "#" + text;
int n = pattern.length(), m = str.length();
vector<int> pi = prefix_function(str);
vector<int> ret;
for(int i=n+1;i<m;i++) {
if(pi[i] == n) ret.pb(i-2*n);
}
return ret;
}

Runtime: O(n+m)
SUMMARY

Algorithm Time Complexity Key Ideas Approach

Brute Force (Naive) O(m*n) Searching with all Linear Searching

alphabets

Rabin-Karp Θ(m+n) Compare the text Hashing Based

and patterns using
their hash functions

Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

automaton from the
pattern

n = |pattern| , length of pattern

m = |text| , length of text
THANK YOU

String Matching
No ratings yet
String Matching
30 pages
String Matching
100% (1)
String Matching
27 pages
Qualcomm Interview Exp - 20
No ratings yet
Qualcomm Interview Exp - 20
5 pages
Data Structure Final Print PDF
100% (1)
Data Structure Final Print PDF
121 pages
String Matching
100% (1)
String Matching
12 pages
Unit 4 Notes SW
No ratings yet
Unit 4 Notes SW
20 pages
Unit 3 - Software Estimation Technique
No ratings yet
Unit 3 - Software Estimation Technique
20 pages
Interprocess Communication & Process Synchronization: Fall 09
No ratings yet
Interprocess Communication & Process Synchronization: Fall 09
51 pages
Compiler Design: Type Checking Guide
No ratings yet
Compiler Design: Type Checking Guide
39 pages
6.2 Elements of Transport Protocols PDF
No ratings yet
6.2 Elements of Transport Protocols PDF
12 pages
Interprocess Communicatio N
No ratings yet
Interprocess Communicatio N
29 pages
Inheritance
No ratings yet
Inheritance
26 pages
Chapter3 State Space Search
100% (1)
Chapter3 State Space Search
75 pages
Data Str-Time &space Complexity
No ratings yet
Data Str-Time &space Complexity
48 pages
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
No ratings yet
Functional Requirements Non Functional Requirements: 4. I. Ii. Iii. Iv. V
7 pages
Advanced Networking and Communication Systems CSIS 430 CG
No ratings yet
Advanced Networking and Communication Systems CSIS 430 CG
6 pages
Software Engineering
No ratings yet
Software Engineering
29 pages
Functional Coverage in QuestaSIM Tool
No ratings yet
Functional Coverage in QuestaSIM Tool
27 pages
IoT Module-3 Notes
0% (1)
IoT Module-3 Notes
6 pages
Embedded Systems Interface Review
No ratings yet
Embedded Systems Interface Review
7 pages
Chapter 4 - Wireless Local Area Networks
No ratings yet
Chapter 4 - Wireless Local Area Networks
75 pages
DSA Assignment PDF
No ratings yet
DSA Assignment PDF
50 pages
Allslides Handout
No ratings yet
Allslides Handout
269 pages
The Osi Model and TCP Ip Protocol Suite
No ratings yet
The Osi Model and TCP Ip Protocol Suite
24 pages
Objectives: The C++ Programming Skills That Should Be Acquired in This Lab
100% (1)
Objectives: The C++ Programming Skills That Should Be Acquired in This Lab
7 pages
Java Multithreading Guide
No ratings yet
Java Multithreading Guide
57 pages
Fpga Implementation of Binary Search 1
No ratings yet
Fpga Implementation of Binary Search 1
5 pages
Embedded System Development Coding Reference Guide
100% (2)
Embedded System Development Coding Reference Guide
190 pages
Chapter 5 - Strings, Procedures and Macros: From Microprocessors and Interfacing by Douglas Hall
No ratings yet
Chapter 5 - Strings, Procedures and Macros: From Microprocessors and Interfacing by Douglas Hall
25 pages
Operating Systems for Beginners
No ratings yet
Operating Systems for Beginners
7 pages
Embedded and Real-Time Operating Systems: Course Code: 70439
100% (2)
Embedded and Real-Time Operating Systems: Course Code: 70439
76 pages
PG NP Mod 1 Notes
No ratings yet
PG NP Mod 1 Notes
16 pages
Chapter 7 - Sorting
No ratings yet
Chapter 7 - Sorting
82 pages
Unit I
No ratings yet
Unit I
53 pages
Python Tuples, Dictionary and Sets
No ratings yet
Python Tuples, Dictionary and Sets
29 pages
2012 IN4392 Lecture-5 CloudProgrammingModels
100% (1)
2012 IN4392 Lecture-5 CloudProgrammingModels
95 pages
CD Unit 4 Compiler Design Jntuk r20
No ratings yet
CD Unit 4 Compiler Design Jntuk r20
17 pages
250+ C Programs For Practice PDF
No ratings yet
250+ C Programs For Practice PDF
13 pages
Mealy vs Moore Machines Guide
No ratings yet
Mealy vs Moore Machines Guide
21 pages
Software Configuration Management (SCM)
No ratings yet
Software Configuration Management (SCM)
9 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
24 pages
Software Metrics-5
100% (1)
Software Metrics-5
40 pages
CAPL Scripting - AutomotiveGeeks
No ratings yet
CAPL Scripting - AutomotiveGeeks
12 pages
C++ Notes
No ratings yet
C++ Notes
7 pages
Aditya Engineering College (A) : Python Data Structures
No ratings yet
Aditya Engineering College (A) : Python Data Structures
7 pages
Software Engineering and Project Management - Unit 4
No ratings yet
Software Engineering and Project Management - Unit 4
14 pages
Time Complexity: 3.1: Which Is The Dominant Operation? Def Dominant (N) : For I in Xrange (N) : Return Result
No ratings yet
Time Complexity: 3.1: Which Is The Dominant Operation? Def Dominant (N) : For I in Xrange (N) : Return Result
4 pages
Unit-5 Oose Question and Answers
100% (1)
Unit-5 Oose Question and Answers
14 pages
RTOS - Real Time Operating Systems
No ratings yet
RTOS - Real Time Operating Systems
36 pages
Sppu CN Insem Solved Paper Aug 2018
No ratings yet
Sppu CN Insem Solved Paper Aug 2018
14 pages
NP-UNIT1 - Elementary TCP Sockets
No ratings yet
NP-UNIT1 - Elementary TCP Sockets
84 pages
Object Oriented Programming Interview Cheatsheet
No ratings yet
Object Oriented Programming Interview Cheatsheet
4 pages
Fundamentals of Data Structures in C - , 2 - Ellis Horowitz, Sahni, Dinesh Mehta
No ratings yet
Fundamentals of Data Structures in C - , 2 - Ellis Horowitz, Sahni, Dinesh Mehta
521 pages
Unit 4 - Run - Time Environment
No ratings yet
Unit 4 - Run - Time Environment
34 pages
Chapter-3 Real Time OS
No ratings yet
Chapter-3 Real Time OS
130 pages
Chapter 1-Introduction To Finite Automata
No ratings yet
Chapter 1-Introduction To Finite Automata
52 pages
Intro to Data Structures & Algorithms
No ratings yet
Intro to Data Structures & Algorithms
331 pages
Os PPT Disk Sheduling 22
No ratings yet
Os PPT Disk Sheduling 22
16 pages
String Matching Kmprabin Karp and Naive
No ratings yet
String Matching Kmprabin Karp and Naive
41 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
ZFS Cheatsheet: This Is A Quick and Dirty Cheatsheet On Sun's ZFS
No ratings yet
ZFS Cheatsheet: This Is A Quick and Dirty Cheatsheet On Sun's ZFS
7 pages
Remote Procedure Call (RPC) Remote Method Invocation (RMI)
No ratings yet
Remote Procedure Call (RPC) Remote Method Invocation (RMI)
47 pages
APM3715 - Major Tests 1 - 2025 - Scan
No ratings yet
APM3715 - Major Tests 1 - 2025 - Scan
6 pages
Cdata-Xpon Onu - FD714GS1-R880
No ratings yet
Cdata-Xpon Onu - FD714GS1-R880
4 pages
TfhkaNet English Venezuela Ver1.0
No ratings yet
TfhkaNet English Venezuela Ver1.0
30 pages
Chapter 1 AI Engineering Book Chip Huyen 1741338497
100% (1)
Chapter 1 AI Engineering Book Chip Huyen 1741338497
17 pages
Rate Card
No ratings yet
Rate Card
1 page
SAP PM DGDSF
No ratings yet
SAP PM DGDSF
8 pages
Designing Conventions For Automated Negotiation: Jeffrey S. Rosenschein and Gilad Zlotkin
No ratings yet
Designing Conventions For Automated Negotiation: Jeffrey S. Rosenschein and Gilad Zlotkin
18 pages
JCL Reference
No ratings yet
JCL Reference
722 pages
Z8S Installation Guide V1
No ratings yet
Z8S Installation Guide V1
42 pages
Medical Laboratory Devices Communication Median System
No ratings yet
Medical Laboratory Devices Communication Median System
1 page
ReceivablesManagement - Whats New Viewer - SAP S4HANA 2021
No ratings yet
ReceivablesManagement - Whats New Viewer - SAP S4HANA 2021
54 pages
Bca I&ii
No ratings yet
Bca I&ii
57 pages
Best Practice - How To Open Files Created in Newer Release, in Older Version of Creo Parametric
No ratings yet
Best Practice - How To Open Files Created in Newer Release, in Older Version of Creo Parametric
5 pages
CS25C01 Computer Programming C
100% (3)
CS25C01 Computer Programming C
98 pages
Microsoft Teams, Sharepoint
No ratings yet
Microsoft Teams, Sharepoint
11 pages
Leguizard Datasheet
No ratings yet
Leguizard Datasheet
11 pages
Module 1 Introduction To Cyber Security
No ratings yet
Module 1 Introduction To Cyber Security
11 pages
Bomb Lab Guide for CS Students
No ratings yet
Bomb Lab Guide for CS Students
19 pages
Avaya 1600 Series IP Deskphones: Installation and Maintenance Guide Release 1.2.x
No ratings yet
Avaya 1600 Series IP Deskphones: Installation and Maintenance Guide Release 1.2.x
90 pages
Menciptakan Efisiensi Dan Keunggulan (Implementasi BIM PT PP)
No ratings yet
Menciptakan Efisiensi Dan Keunggulan (Implementasi BIM PT PP)
88 pages
Exam Prep Resources on Telegram
No ratings yet
Exam Prep Resources on Telegram
506 pages
Finance Management System
50% (2)
Finance Management System
32 pages
LG 42PQ20 Plasma TV Training Manual A
No ratings yet
LG 42PQ20 Plasma TV Training Manual A
123 pages
It 2
No ratings yet
It 2
10 pages
Spring Boot with React and AWS: Learn to Deploy a Full Stack Spring Boot React Application to AWS 1st Edition Ravi Kant Soni latest pdf 2025
No ratings yet
Spring Boot with React and AWS: Learn to Deploy a Full Stack Spring Boot React Application to AWS 1st Edition Ravi Kant Soni latest pdf 2025
159 pages
Java OOPs and Data Types Guide
No ratings yet
Java OOPs and Data Types Guide
17 pages
Communication For Various Purposes: To Obtain, Provide and Disseminate, To Persuade and Argue
100% (3)
Communication For Various Purposes: To Obtain, Provide and Disseminate, To Persuade and Argue
28 pages
Netflix Clone: Stripe & Firebase Setup
No ratings yet
Netflix Clone: Stripe & Firebase Setup
7 pages

String Matching Algorithms

Uploaded by

String Matching Algorithms

Uploaded by

STRING MATCHING

Aditya Pratap Singh

● String matching algorithms are an important class of string

● Why do we need string matching?

● To find all occurrences of a pattern in a given text

*text is the string that we are searching

1. Naive Algorithm - The naive approach is accomplished by

2. Rabin-Karp Algorithm - It compares the string’s hash values,

3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

One of the most obvious approach towards the string matching

If the first element of ‘p’ matches the first element of ‘s’ ,

This approach is easy to understand and implement but it can be

function naive(text[], pattern[]){

Step 1: Compare P[1] with S[1]

Step 2: Compare P[2] with S[2]

Step 3: Compare P[3] with S[3]

Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

This is actually the naive approach augmented with a powerful

This algorithm was authored by Michael Rabin and Richard Karp

Problem - Given a string S of length n = |S| . Calculate the hash

where p and m are suitably chosen prime numbers.

‘p’ should be taken roughly equal to the number of characters in

‘m’ should be a large prime. A popular choice is m = 10^9+7

So by knowing the hash value of each prefix of string S, we can

vector<long long> p_pow(max(S, T));

vector<long long> h(T + 1, 0);

A matching time of O(n) is achieved by avoiding comparisons with

KMP makes use of ‘prefix function’

The prefix function of a string is defined as an array Ⲡ of length n,

● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

● To calculate the current value Ⲡ[i] we set the variable j

● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

vector<int> prefix_function(string s){

● This is a classical application of prefix function, which we just

Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

vector<int> kmp(string pattern,string text){

Algorithm Time Complexity Key Ideas Approach

Brute Force (Naive) O(m*n) Searching with all Linear Searching

Rabin-Karp Θ(m+n) Compare the text Hashing Based

Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

n = |pattern| , length of pattern

You might also like