Tuesday, June 30, 2020

Pattern Searching : KMP Algorithm

Originally Posted on SeedBx


Pattern searching is a classic question which refers to the act of searching for the occurrences of a smaller string in a bigger string (in terms of length).

The KMP Algorithm, which was proposed by Donald Knuth and Vaughan Pratt and independently by James Morris in 1970, solves this problem in linear time.


Note : In all the pseudo codes, 0-based indexing is used and the indentations are used to differentiate between block of codes.


Problem Statement

Let S be a given string and P be a given pattern such that |S|>|P| (where |•| refers to the cardinality of •). Find the number of occurrences of P in S.


Naive Approach

Loosely speaking we can iterate over the two strings to check for the occurrences of P in S. More concretely we can iterate over the indices of the string S and as soon as we encounter a position pos such that S[pos] = P[0] (or P[1] depending on the language), we check for all the positions in P. And if we reach the end of P, i.e. an occurrence of pattern is found we increment the counter.

 

Pseudo Code -
 no_of_occurrences=0
for i from 0 up to S.Length-P.Length+1
    for j from 0 up to P.length
        if S[i+j]!=P[j]
            break
    if j==P.Length
        no_of_occurrences++

Best-Case Analysis - 

Let S = "abcdef" and P = "xyz".

If we run the naive pattern searching algorithm on this pair of strings, we never traverse the inner loop as there is no occurrence of P in S.

Hence, in best case, the time complexity is O(|S|).

Worst-Case Analysis - 

Let S = "aaaaaab" and P = "aaab".

If we run the naive pattern searching algorithm on this pair of strings, we will always traverse the inner loop to the full extent. Hence, in worst case, the time complexity is O(|S|*|P|).

 

Time Complexity : O(|S|*|P|)

Space Complexity : O(1)


Knuth-Morris-Pratt aka KMP Algorithm

If we look carefully at the worst-case analysis example, we will notice that a lot of computations while comparing S and P repeats. What KMP Algorithm does is to reduce these repeated comparisons to a minimum and rather use some pre-computed values to skip the comparisons.

LPS Array/ π Table -

The algorithm uses a very subtle idea to skip comparisons. To achieve this, the algorithm uses a pre-processed auxiliary array LPS[]. Let us denote this array by π. Hence for a given position i, π(i) stores the maximum length of proper prefix which is also a suffix (or equivalently maximum length of proper suffix which is also a prefix) in the substring of length i of P i.e. P[0… π(i)] is equal to P[i- π(i)+1…i]. The term proper here implies that π(i) can never be equal to the length of the substring P[0...i].

For example :

Let P = “aabcaad”

i

1

2

3

4

5

6

7

P[i]

a

a

b

c

a

a

d

π(i)

0

1

0

0

1

2

0


Key Observations : 
  • The substring P[0… π(i-1)] if it exists is also a suffix of P[0…i].
  • The values of the π table can only increase by 1 at most i.e. π(i)- π(i-1) <=1.
Proof by Contradiction :

Let us assume that the values of π(i)> π(i-1) +1.

Let us consider the substring by removing the last character of the string P[0… π(i)]. By doing so we end up with a LPS ending in position i with length π(i)-1, which is better than π(i-1).This is a contradiction. 

  • All the prefixes of P that are also a suffix of P can obtained recursively i.e. P[0… π(i)], P[0… π(π(i))], P[0… π(π(π(i)))] and so on are all prefix of P that are also suffix of P.

Intuition :

P[0… π(π(i))] is the LPS of P[0… π(i)], hence P[0… π(π(i))] = P[π(i)-π(π(i))+1… π(i)].

As P[0… π(i)]= P[i- π(i)+1…i] implies P[π(i)-π(π(i))+1… π(i)]=P[i-π(π(i))+1… i]. Therefore P[0… π(π(i))] = P[i-π(π(i))+1… i] and hence P[0… π(π(i))] is a lps of P[0…i].

 

Computing π Table/LPS array -

Basic Overview:

To calculate π(i), we start from j=i-1 and iteratively apply the π value. We set π(i)= πk(j)+1 whenever we find the least k for which P[πk(j)+1]=P[i], where πk(j) = π(π…k times(i-1)). If no such value of k is found then we set π(i)=0.

Pseudo Code -
 Π(0)=0 
for i from 1 up to P.Length
    j= Π(i-1)
     while j>0 and P[i]!=P[j]
         j=Π(j-1)
     if P[i]==P[j]
         j++
     Π(i)=j

 

Using π Table/LPS Array - 

Let us look now at how the algorithm uses the π table to actually skip comparisons by looking at an example.

S = “ABCABCAABD”

P = “ABCAABD”

Hence π = (0, 0, 0, 1, 1, 2, 0)

  • Start Matching at the first position of S.
Example Image

  • Mismatch at S[4].
  • We matched i = 4 letters so far, and π(3)=1, thus there is no point in starting the comparison at S[1], S[2].
  • Hence, we shift P by i- π(i-1) = 3 letters.
Example Image

  • All the characters match, we record this match and again shift P by 7- π(6)=7 characters, which terminates the loop.

Final Algorithm - 

Combining all the ideas we developed so far, the final algorithm is

Pseudo Code –
 no_of_occurences=0 
i=0
j=0
while i<S.Length
    if P[j]==S[i]
         j++
        i++
    if j==P.Length
         j=Π(j-1)
         no_of_occurences++
    else if i<S.Length and P[j]!=S[i]
         if j!=0
             j=Π(j-1)
         else
             i++

Time Complexity : O(|S|+|P|)

Space Complexity : O(|P|)


At last, KMP algorithm is a very powerful algorithm with it being applied in the field of DNA matching among many others.

Thanks for reading.
As always your valuable comments and suggestions are welcome.

No comments:

Post a Comment