Tuesday, June 30, 2020

Pattern Searching : KMP Algorithm

Originally Posted on SeedBx


Pattern searching is a classic question which refers to the act of searching for the occurrences of a smaller string in a bigger string (in terms of length).

The KMP Algorithm, which was proposed by Donald Knuth and Vaughan Pratt and independently by James Morris in 1970, solves this problem in linear time.


Note : In all the pseudo codes, 0-based indexing is used and the indentations are used to differentiate between block of codes.


Problem Statement

Let S be a given string and P be a given pattern such that |S|>|P| (where |•| refers to the cardinality of •). Find the number of occurrences of P in S.


Naive Approach

Loosely speaking we can iterate over the two strings to check for the occurrences of P in S. More concretely we can iterate over the indices of the string S and as soon as we encounter a position pos such that S[pos] = P[0] (or P[1] depending on the language), we check for all the positions in P. And if we reach the end of P, i.e. an occurrence of pattern is found we increment the counter.

 

Pseudo Code -
 no_of_occurrences=0
for i from 0 up to S.Length-P.Length+1
    for j from 0 up to P.length
        if S[i+j]!=P[j]
            break
    if j==P.Length
        no_of_occurrences++

Best-Case Analysis - 

Let S = "abcdef" and P = "xyz".

If we run the naive pattern searching algorithm on this pair of strings, we never traverse the inner loop as there is no occurrence of P in S.

Hence, in best case, the time complexity is O(|S|).

Worst-Case Analysis - 

Let S = "aaaaaab" and P = "aaab".

If we run the naive pattern searching algorithm on this pair of strings, we will always traverse the inner loop to the full extent. Hence, in worst case, the time complexity is O(|S|*|P|).

 

Time Complexity : O(|S|*|P|)

Space Complexity : O(1)


Knuth-Morris-Pratt aka KMP Algorithm

If we look carefully at the worst-case analysis example, we will notice that a lot of computations while comparing S and P repeats. What KMP Algorithm does is to reduce these repeated comparisons to a minimum and rather use some pre-computed values to skip the comparisons.

LPS Array/ π Table -

The algorithm uses a very subtle idea to skip comparisons. To achieve this, the algorithm uses a pre-processed auxiliary array LPS[]. Let us denote this array by π. Hence for a given position i, π(i) stores the maximum length of proper prefix which is also a suffix (or equivalently maximum length of proper suffix which is also a prefix) in the substring of length i of P i.e. P[0… π(i)] is equal to P[i- π(i)+1…i]. The term proper here implies that π(i) can never be equal to the length of the substring P[0...i].

For example :

Let P = “aabcaad”

i

1

2

3

4

5

6

7

P[i]

a

a

b

c

a

a

d

π(i)

0

1

0

0

1

2

0


Key Observations : 
  • The substring P[0… π(i-1)] if it exists is also a suffix of P[0…i].
  • The values of the π table can only increase by 1 at most i.e. π(i)- π(i-1) <=1.
Proof by Contradiction :

Let us assume that the values of π(i)> π(i-1) +1.

Let us consider the substring by removing the last character of the string P[0… π(i)]. By doing so we end up with a LPS ending in position i with length π(i)-1, which is better than π(i-1).This is a contradiction. 

  • All the prefixes of P that are also a suffix of P can obtained recursively i.e. P[0… π(i)], P[0… π(π(i))], P[0… π(π(π(i)))] and so on are all prefix of P that are also suffix of P.

Intuition :

P[0… π(π(i))] is the LPS of P[0… π(i)], hence P[0… π(π(i))] = P[π(i)-π(π(i))+1… π(i)].

As P[0… π(i)]= P[i- π(i)+1…i] implies P[π(i)-π(π(i))+1… π(i)]=P[i-π(π(i))+1… i]. Therefore P[0… π(π(i))] = P[i-π(π(i))+1… i] and hence P[0… π(π(i))] is a lps of P[0…i].

 

Computing π Table/LPS array -

Basic Overview:

To calculate π(i), we start from j=i-1 and iteratively apply the π value. We set π(i)= πk(j)+1 whenever we find the least k for which P[πk(j)+1]=P[i], where πk(j) = π(π…k times(i-1)). If no such value of k is found then we set π(i)=0.

Pseudo Code -
 Π(0)=0 
for i from 1 up to P.Length
    j= Π(i-1)
     while j>0 and P[i]!=P[j]
         j=Π(j-1)
     if P[i]==P[j]
         j++
     Π(i)=j

 

Using π Table/LPS Array - 

Let us look now at how the algorithm uses the π table to actually skip comparisons by looking at an example.

S = “ABCABCAABD”

P = “ABCAABD”

Hence π = (0, 0, 0, 1, 1, 2, 0)

  • Start Matching at the first position of S.
Example Image

  • Mismatch at S[4].
  • We matched i = 4 letters so far, and π(3)=1, thus there is no point in starting the comparison at S[1], S[2].
  • Hence, we shift P by i- π(i-1) = 3 letters.
Example Image

  • All the characters match, we record this match and again shift P by 7- π(6)=7 characters, which terminates the loop.

Final Algorithm - 

Combining all the ideas we developed so far, the final algorithm is

Pseudo Code –
 no_of_occurences=0 
i=0
j=0
while i<S.Length
    if P[j]==S[i]
         j++
        i++
    if j==P.Length
         j=Π(j-1)
         no_of_occurences++
    else if i<S.Length and P[j]!=S[i]
         if j!=0
             j=Π(j-1)
         else
             i++

Time Complexity : O(|S|+|P|)

Space Complexity : O(|P|)


At last, KMP algorithm is a very powerful algorithm with it being applied in the field of DNA matching among many others.

Thanks for reading.
As always your valuable comments and suggestions are welcome.

Monday, June 22, 2020

Gentle Introduction to C : Call by Value and Reference

Originally Posted on SeedBx


Welcome to the fourth part of the Gentle Introduction to C series. As said before this series aims to provide you a brief introduction to C language.

In this part we will be looking at call by value and call by reference.


Before talking about call by value and call by reference let’s get some more terminologies under our belt.

Function Call : It is a request made by the program by using the function name and a list of parameters (if any) enclosed in braces.

Formal Parameters :  These are the parameters which are defined in the function.

Actual Parameters : These are the parameters that the caller location passes to the function.

Interesting Fact : C is case-sensitive

 

Call By Value

Call by Value is a function calling technique in which the formal parameters and actual parameters are different independent variables, more so the formal parameters are separate variables which share the same value. Internally what happens is when a function is called by value the values of the actual parameters are copied to the formal parameters (dummy variables). As the formal parameters and actual parameters are different, any changes made to formal parameters are not reflected in actual parameters.

Example-

    Code :

    #include<stdio.h>
    void callByValue(int a)
    {
        a=3;
        printf("Value of Formal Parameter = %d\n", a);
    }
    int main()
    {
        int a;
        a=2;
        callByValue(a);
        printf("Value of Actual Parameter = %d", a);
        return 0;
    }

    Output :

        Value of Formal Parameter = 3
        Value of Actual Parameter = 2

As you can notice above, the value of 'a' in the main() function is not affected by the change in the value of 'a' in callByValue() function.

 

Call By Reference

Call by reference is a function calling technique in which the formal parameters and actual parameters are actually identical variables, i.e. they share the same memory location internally. Call by reference can be initiated by passing either pointers to the variable or passing reference to the variable. As the formal and actual parameter share the same memory location, any changes to the formal parameters is actually reflected in the actual parameters.

Example-

    Code :

    #include<stdio.h>
    void callByReference(int &a)
    {
        a=3;
        printf("Value of Formal Parameter = %d\n", a);
    }
    int main()
    {
        int a;
        a=2;
        callByReference(a);
        printf("Value of Actual Parameter = %d", a);
        return 0;
    }

    Output :

        Value of Formal Parameter = 3
        Value of Actual Parameter = 3

 

As you can notice above, the value of 'a' in the main() function is affected by the change in the value of 'a' in callByReference() function.

Both of these function calling techniques have their fair share of uses in actual applications, however the calling technique to be used totally depends on the use case.

 

As always your suggestions are welcome. 

Do comment out what are some interesting ways in which you use these calling techniques.

 

Wednesday, June 17, 2020

Gentle Introduction to C : main() Function

Originally Posted on SeedBx


Welcome to the third part of the Gentle Introduction to C series. As said before this series aims to provide you with a brief introduction to C language.

In this part we are going to look at one of the most important function in C, the main() function.

 

main()

The main() function, as many like to say, is the entry point of a C program. It is the function that is executed at the beginning of any program. Intuitively you can also understand the main() function as an user-defined function.

 

Note : Technically the main function is not the actual entry point as the entry point is compiler depended.

 

At the basic level, the main() function is similar to a normal function, having its own return value, parameters and definition. But unlike any other function, the main function is called by the operating system ( all other functions are actually called by the main function or any other function). This happens primarily because whenever a program runs, the operating system has to pass the control of the computer over to the program. The main() function is just what the operating system looks for in a C program to pass its control to. So technically if any parameter that’s needed to be passed on to the main() function must be passed during runtime. Also as the main() function is the part that receives the operating system’s control it is required to be present in the program (some exceptions do occur).

 

Interesting Fact : Despite its importance, main is not a keyword in C.

 

Example -

      Code :

    #include<stdio.h>

    // Defining a main function with int return type

    int main()

    {

        printf("Hello you are in the main function");

        return 0;

     }

      Output : 

     Hello you are in the main function

 

Lastly main() function is an important component of almost every C program and writing a good main() function is a very crucial and important step.

 

Thanks for reading. 

As always your valuable comments and suggestions are welcome.

Also See : FunctionCall by Value and ReferenceRecursion

 

Saturday, June 13, 2020

Gentle Introduction to C : Function

Originally Posted on SeedBx


Welcome to the second part of the Gentle Introduction to C series. As said before this series aims to provide you a brief introduction to C language.

In this part we will look into the one of the basic components of C, namely functions and how using it can improve your programming experience.

Function

Function is a block of statements that performs a specific functionality. Mainly its aim is to divide a complex routine in hand to simpler subroutines which can then be later combined to perform the more complex routine.

There are basically two types of function :

  1. Standard-library functions
  2. User-defined functions

We will delve more into these types at a later part, but for now you can intuitively understand them as predefined and as the name suggests user-defined functions respectively.

Interesting Fact : C is the only programming language to have not lost its popularity even after such a long duration.


Given below are some basic components related to functions.

Common Terminologies

Function Name : It is the name/identity of the function which is used to call the function.

Parameters/Arguments : It is the list of data items (variables, pointers, references etc.) that the function take as input in its block of statements to either do computation on it or to facilitate the computation. The number of parameters can be any finite integer (possibly zero). The process of passing the parameters can be done in two ways : by value or by reference.

 Function Declaration : It is an introductory statement which is used to inform the compiler of the number of the parameters and the type of each individual parameter associated with the given function name. It also specifies about the type ( int, char, float, double, void)  of the return value of the function.

Function Definition : It is a sequence of statements that define the functionality of the function along with all details such as the names of parameter variables. It starts with a statement like a function declaration (except that it is now also compulsory to provide with parameter's name)  and followed by curly braces which contain the function in itself.

Return Value : The return value is a data item (possibly void) that the functions returns to the caller. The value is returned using the return keyword

Example : Let us consider a function add which will return the sum of the two parameters.

    int add(int,int);   ---------------------> Function Declaration

    /* some code */

    int add(int num1,int num2)  -------------> Function Definition

    {

        int sum;

        sum = num1 + num2;

        return sum;

    }

Note :

1.     The use of function declaration can be skipped if the function definition comes before the statement(s) in which it is called. However it is a good practice to include function declaration (and is a must when the calling statement(s) precedes the function definition).

2.     The return statement is optional in case of void return type.

3.     The parameters should be separated by commas.

4.      It is not necessary to provide with parameter name in the function declaration.

So you may be wondering why will you ever use a function in the first place. However there are following advantages which you shouldn't overlook :

1.     It improves readability and makes the code look cleaner.

2.     This saves you from writing a block of code multiple times

Although for now you might be wondering these advantages are not at all significant but as you will start designing bigger and complex code these advantages of functions will save you a lot of time while debugging. Also as C is a relatively low level programming language in comparison to C++ or python, functions becomes a powerful tool to do low level stuff like memory management etc.

 

There is still more to be said about functions but I will keep them for different parts.

 

Side Note : 

People also often classify functions on the basis of arguments and return type.

1.     Functions with no arguments and no return type.

2.     Functions with arguments and no return type.

3.     Functions with no arguments and return type.

4.     Functions with arguments and a return type.

 

Do write in comments what other advantages you see of using functions.

Your suggestions as always are welcome.




Wednesday, June 3, 2020

Gentle Introduction to C : Header Files

Originally Posted on SeedBx


Welcome to the first part of the Gentle Introduction to C series. This series will aim to give you a brief introduction to C language.

In this part we will look at one of the most important take-for-granted tool in C, namely the header files.

Header Files

A header file is a file that contains macros and function declarations. These header files are saved using a ‘.h’ extension and can easily be included in a file using the #include pre-processor directive. 

C provides with two ways to include a header file :

  1.     #include <filename.h>
  2.     #include “filename.h”

However it is a common convention to use the former for standard files and the latter for user - defined files.


Example : 

    #include <stdio.h> - Includes the standard stdio header file in the program

    #include "hello.h" - Includes the user-defined hello header file in the program


Including a header file copies all the header file’s contents to file during pre-processing time. And rightly so, header files should not be included more than once as it will copy the contents more than once, which will result in multi declaration(s) error during compile time.


Interesting Fact : Original Windows kernel was written in a dialect of C language along with an assembly language. Even now also some OS kernels are written in C.

Header file provides a way to link to standard file(s) which usually contains basic yet important functions (like input and output). The header file makes the code look cleaner and less cumbersome, hence enhancing readability. It also provides the user with optimized functions with very subtle implementational details which could help to reduce the running time of the program.

Examples of  Standard header files : 

stdio.h - Contains standard I/O operations

stdlib.h - General purpose header file containing functions like memory allocation, process control, conversions and others

and many more


Lastly, in my view, the header files are C’s way of implementing modern day concepts like modularity in the code.


Thank you for reading. Do leave your valuable comments.


Also See : extern Keyword