Originally Posted on SeedBx
Pattern searching is a classic question which refers to the act of
searching for the occurrences of a smaller string in a bigger string (in terms
of length).
The KMP Algorithm, which was proposed by Donald Knuth and Vaughan Pratt and independently by James Morris in 1970, solves this problem in linear time.
Note : In all the pseudo codes, 0-based indexing is used and the indentations are used to differentiate between block of codes.
Problem Statement
Let S be a given string and P be a given pattern such that |S|>|P| (where |•| refers to the cardinality of •). Find the number of occurrences of P in S.
Naive Approach
Loosely speaking we can iterate over
the two strings to check for the occurrences of P in S. More concretely we can
iterate over the indices of the string S and as soon as we encounter a position
pos such that S[pos] = P[0] (or P[1] depending on the language), we check for
all the positions in P. And if we reach the end of P, i.e. an occurrence of
pattern is found we increment the counter.
no_of_occurrences=0
for i from 0 up to S.Length-P.Length+1
for j from 0 up to P.length
if S[i+j]!=P[j]
break
if j==P.Length
no_of_occurrences++
Best-Case Analysis -
Let S = "abcdef" and P = "xyz".
If we run the naive pattern
searching algorithm on this pair of strings, we never traverse the inner loop
as there is no occurrence of P in S.
Hence, in best case, the time
complexity is O(|S|).
Worst-Case Analysis -
Let S = "aaaaaab" and P =
"aaab".
If we run the naive pattern
searching algorithm on this pair of strings, we will always traverse the inner
loop to the full extent. Hence, in worst case, the time complexity is
O(|S|*|P|).
Time Complexity : O(|S|*|P|)
Space Complexity : O(1)
Knuth-Morris-Pratt aka KMP Algorithm
If we look carefully at the worst-case analysis example, we will notice that a lot of computations while comparing S and P repeats. What KMP Algorithm does is to reduce these repeated comparisons to a minimum and rather use some pre-computed values to skip the comparisons.
LPS Array/ π Table -
The algorithm uses a very subtle
idea to skip comparisons. To achieve this, the algorithm uses a pre-processed
auxiliary array LPS[]. Let us denote this array by π. Hence for a given
position i, π(i) stores the maximum length of proper prefix which is also a
suffix (or equivalently maximum length of proper suffix which is also a prefix)
in the substring of length i of P i.e. P[0… π(i)] is equal to P[i- π(i)+1…i].
The term proper here implies that π(i) can never be equal to the length of the
substring P[0...i].
For example :
Let P = “aabcaad”
i |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
P[i] |
a |
a |
b |
c |
a |
a |
d |
π(i) |
0 |
1 |
0 |
0 |
1 |
2 |
0 |
- The substring P[0… π(i-1)] if it exists is also a suffix of P[0…i].
- The values of the π table can only increase by 1 at most i.e. π(i)- π(i-1) <=1.
Proof by Contradiction :Let us assume that the values of π(i)> π(i-1) +1.
Let us consider the substring by removing the last character of the string P[0… π(i)]. By doing so we end up with a LPS ending in position i with length π(i)-1, which is better than π(i-1).This is a contradiction.
- All the prefixes of P that are also a suffix of P can obtained recursively i.e. P[0… π(i)], P[0… π(π(i))], P[0… π(π(π(i)))] and so on are all prefix of P that are also suffix of P.
Intuition :
P[0… π(π(i))] is the LPS of P[0… π(i)], hence P[0… π(π(i))] = P[π(i)-π(π(i))+1… π(i)].
As P[0… π(i)]= P[i- π(i)+1…i] implies P[π(i)-π(π(i))+1… π(i)]=P[i-π(π(i))+1… i]. Therefore P[0… π(π(i))] = P[i-π(π(i))+1… i] and hence P[0… π(π(i))] is a lps of P[0…i].
Computing π Table/LPS array -
Basic Overview:
To calculate π(i), we start from j=i-1 and iteratively apply the π value. We set π(i)= πk(j)+1 whenever we find the least k for which P[πk(j)+1]=P[i], where πk(j) = π(π…k times(i-1)). If no such value of k is found then we set π(i)=0.
Pseudo Code - Π(0)=0
for i from 1 up to P.Length
j= Π(i-1)
while j>0 and P[i]!=P[j]
j=Π(j-1)
if P[i]==P[j]
j++
Π(i)=j
Using π Table/LPS Array -
Let us
look now at how the algorithm uses the π table to actually skip
comparisons by looking at an example.
S = “ABCABCAABD”
P = “ABCAABD”
Hence π =
(0, 0, 0, 1, 1, 2, 0)
- Start Matching at the first position of S.
- Mismatch at S[4].
- We matched i = 4 letters so far, and π(3)=1, thus there is no point in starting the comparison at S[1], S[2].
- Hence, we shift P by i- π(i-1) = 3 letters.
- All the characters match, we record this match and again shift P by 7- π(6)=7 characters, which terminates the loop.
Final Algorithm -
Combining all the ideas we developed so far, the final algorithm is
Pseudo Code –
no_of_occurences=0
i=0
j=0
while i<S.Length
if P[j]==S[i]
j++
i++
if j==P.Length
j=Π(j-1)
no_of_occurences++
else if i<S.Length and P[j]!=S[i]
if j!=0
j=Π(j-1)
else
i++
Time Complexity : O(|S|+|P|)
Space Complexity : O(|P|)