Sunday, July 7, 2013

Sampling

Given a stream of numbers, choose a sample of k numbers. This idea is very useful in data analysis applications where a random sample is taken for analysis from a very large stream.

In the case where the stream size n is small, a Knuth's shuffle can be performed. The required sample is the first k numbers in the generated permutation.

In many applications, the size of the stream is very large and continuously getting updated. The stream size is so large that the numbers do not fit into the memory. In these cases, a variation of Knuth's shuffle can be performed to obtain a random sample. The algorithm is outlined below:

Step 1: Create an array R of size k. Copy the first k elements (index 0 through k-1) of the stream S to R.

for(int i = 0; i < k; ++i) {
  R[i] = S[i]
}

S1
S2
...
Sk-2
Sk-1
S6
S7
S8
...
Sn

S1
S2
...
Sk--2
Sk-1

If the size of R is <= k, no further steps are necessary. We have a sample of size <= k.

Step 2: Starting with the element at index k of the input, perform the following:
for(int i = k; i < n; ++i) {
  Generate a random number j in the range (0, i)
  if(j < k) {
    swap(R[j], S[i])
  }
}
Return R

The invariant of the algorithm is that as we iterate through the input array, at all time we maintain R to be a random sample of size k from the input we have seen so far i.e. if we have seen n elements in S, the probability of each element being in R is n / k.
When we see the (n + 1)'th element, if we swap it with an element in R with a probability of (n + 1) / k then we will continue maintaining the invariant for the newly seen element as well as for the already selected elements.

The time complexity of this algorithm is O(n) and requires O(k) additional space.

Friday, June 14, 2013

Shuffle

A shuffle refers to a random permutation. It is widely used in casinos and online gambling. A shuffle must be unpredictable and produce each of the permutations with equal probability.

In 2010, Microsoft Windows 7 browser ballot was tested. The idea was to prompt the users to install a browser providing a list of five browsers to select from. The order of the browsers had to be completely random. Due to a flaw in the shuffle generation algorithm, the browsers were not listed in a random manner. The IE browser was shown in the last position 50% of the time and the Chrome browser was shown in one of the first three positions most of the time. (http://www.robweir.com/blog/2010/02/microsoft-random-browser-ballot.html)

Considering a deck of cards, there are 52! possible permutations of cards. All these permutations must appear with same probability. To get a perfect shuffle, the Knuth's shuffling algorithm can be used. Here is the outline of this algorithm:

void shuffle(int* cards, int n) { //cards: an initial or previous permutation, n: 52
  for(int i = (n-1); i >= 0; --i) {
    j = a random number between 0 and i
    swap(cards[i], cards[j])
  }
}

The algorithm starts filling the permutation in a reverse order. Consider the last position, i = 51. As no other cards are chosen yet, this position can be filled in 52 possible ways. Once the 51st card is fixed, the 50th position can be filled by choosing one of the remaining 51 cards. In this manner, the entire permutation can be filled in 52 * 51 * 50 * … * 1 = 52! ways. The time complexity of this algorithm is O(n).

The random number generator used to generate j is crucial for a generating perfect shuffle. The use of a psuedo random generator for generating j produces a biased permutation.

An alternative to Knuth's shuffle is to perform a sort based shuffling. The 52 cards are assigned numbers at random. The cards are then arranged in the sorted order of these random numbers. Care must be taken care that the random numbers are generated properly and no two cards are assigned the same random number. In cases where the random numbers are duplicated, the sorting algorithm results in a biased permutation. The complexity of this sort based shuffle is O(n logn).


The shuffle concept is not restricted to cards, it can be applied to shuffle a set of numbers as well.

Thursday, May 30, 2013

Check for balanced parenthesis in an expression

Given a sequence of parenthesis, check if the parenthesis are balanced and make a valid expression. For example, an expression { [ ] }is balanced and correct, where as an expression { [ } ] or { [ { } ] is not balanced.

There are three kinds of parenthesis - {}, [], (). One approach to solve this problem is to use a stack. The idea is that a close parenthesis is always preceded by an open parenthesis of the same type. When an open parenthesis is encountered, it is pushed onto the stack. When a close parenthesis is encountered, it should always pair up with the open parenthesis on the top of the stack to make it a valid expression. This pair is then discarded. When the entire expression is traversed, an empty stack confirms that the input expression is balanced and valid.

Consider an expression { [ ( ) ] ( ) } as an example. Following shows the step-by-step traversal: (the top of the stack is highlighted)

Input character = {. Push it onto the stack.



{

Input character = [. Push it onto the stack.


[
{

Input character = (. Push it onto the stack.

(
[
{

Input character = ). Check it with the character on the top of the stack, which is (. It makes a valid pair. Pop ( from the stack.


[
{

Input character = ]. Check it with the character on the top of the stack, which is [. It makes a valid pair. Pop [ from the stack. If the input character is a } instead of a ], it does not make a valid pair and hence the parenthesis expression is not balanced.



{

Input character = (. Push it onto the stack.


(
{

Input character = ). Check it with the character on the top of the stack, which is (. It makes a valid pair. Pop ( from the stack.



{
Input character = }. Check it with the character on the top of the stack, which is {. It makes a valid pair. Pop { from the stack.

The end of input is reached and the stack is empty. Hence, the given input expression is balanced.

Following is the C++ code:

bool IsBalanced(const string& expression_input) {
  stack<char> p;
  int i = 0;
  int expression_input_length = expression_input.length();
  while(i < expression_input_length) {
    char input = expression_input[i];
    if(input == '{' || input == '[' || input == '(') {
      p.push(input);
    } else if(input == '}' || input == ']' || input == ')') {
        if(p.empty()) { //stack is empty
          return false;
        }
        char stack_top = p.top();
        p.pop();
        //does not make a valid pair
        if((stack_top == '(' && input != ')') || (stack_top == '{' && input != '}') || 
           (stack_top == '[' && input != ']') ) {
          return false;
        }
    } else { //a character other than parenthesis is encountered
        cout << "Invalid input character encountered" << endl;
        return false;
    }
   ++i;
  }
  if(p.empty()) { //end of input is reached; stack is empty
    return true;
  }
  return false;
}

















Implementing a queue using two stacks

Implement the enqueue, dequeue and peek functionalities of a queue using two stacks.

A queue is a FIFO (first-in, first-out) data structure and a stack is a LIFO (last-in, first-out) data structure. Consider two stacks stack_A and stack_B. A queue can be implemented by using stack_A for the enqueue() operation and stack_B for the dequeue() and peek() operations. Consider the following example: (the top of the stack is highlighted)

enqueue(10): Push 10 onto the stack_A

stack_A
stack_B












10


enqueue(20): Push 20 onto the stack_A

stack_A
stack_B










20

10


enqueue(30): Push 30 onto the stack_A

stack_A
stack_B








30

20

10



Peek(): The element which was inserted first has to be returned without popping it. In order to do this, first transfer all the elements from stack_A to stack_B if stack_B is empty. Return the topmost element from the stack_B. If the stack_B is not empty, just return the topmost element.

In this example, first pop the elements from stack_A and push them onto stack_B. Return the topmost element from stack_B, which is 10.

stack_A
stack_B









10

20

30

enqueue(40): Push 40 onto the stack_A

stack_A
stack_B









10

20
40
30

Peek(): As stack_B is not empty, just return the topmost element from stack_B. In this case, it is 10.

dequeue(): The same logic as for the peek() operation holds true here. The difference here is the topmost element from the stack_B has to be popped out instead of returning it.


stack_A
stack_B











20
40
30


Following is the C++ code:
#include <algorithm>
#include <stack>

class QueueStack {
  stack<int> stack_A;
  stack<int> stack_B;
public:
  void enqueue(int item);
  void dequeue();
  int peek();
};

void QueueStack::enqueue(int item) {
  stack_A.push(item);
}

void QueueStack::dequeue() {
  if(stack_B.empty()) {
    if(stack_A.empty()) {
      cout << "The queue is empty" << endl;
      return;
    } else {
      while(!stack_A.empty()) {
        stack_B.push(stack_A.top());
        stack_A.pop();
      }
      cout << "Removing " << stack_B.top() << " from the queue" << endl;
      stack_B.pop();
    }
  } else {
      cout << "Removing " << stack_B.top() << " from the queue" << endl;
      stack_B.pop();
  }
}

int QueueStack::peek() {
  if(stack_B.empty()) {
    if(stack_A.empty()) {
      cout << "The queue is empty" << endl;
      return -1;
    } else {
      while(!stack_A.empty()) {
        stack_B.push(stack_A.top());
        stack_A.pop();
      }
      cout << "The topmost element is " << stack_B.top() << endl;
      return stack_B.top();
    }
  } else {
      cout << "The topmost element is " << stack_B.top() << endl;
      return stack_B.top();
  }
}


In our case, the enqueue() worked only on stack_A and dequeue() worked on both stack_A and stack_B. A modified implementation could be to update stack_A and stack_B after each enqueue() and work on stack_B for dequeue().

A sequence of n combined enqueues and dequeues take O(n) time.

Phone number to strings


Consider that a person enters a sequence of numbers on a telephone key pad. What are all the possible strings that correspond to the dialed number?

For example, consider the number 23, the possible strings are ad,ae,af,bd,be,bf,cd,ce,cf.

On a telephone keypad, there is mapping between the numbers and the letters associated. Here is the mapping. The numbers 0 and 1 are not associated with any letters.
0
0
1
1
2
a, b, c
3
d, e, f
4
g, h, I
5
j, k, l
6
m, n, o
7
p, q, r, s
8
t, u, v
9
w, x, y, z
Given a number sequence, we have to consider all possible letter combinations. We can build a recursive solution by considering one digit at a time. We stop when all the digits in the input number are explored. Here is the C++ code:

void PhoneNumbers(const string& num, const string& curr, int start) {
  const char Letters[][4] = {
    {'0'},
    {'1'},
    {'a', 'b', 'c'}, //letters starts from 2
    {'d', 'e', 'f'},
    {'g', 'h', 'i'},
    {'j', 'k', 'l'},
    {'m', 'n', 'o'},
    {'p', 'q', 'r', 's'},
    {'t', 'u', 'v'},
    {'w', 'x', 'y', 'z'}
  };

  if(start == num.length()) {
    cout << curr << endl;
    return;
  }
  int digit = num[start] - '0';
  for(int i = 0; i < 4; ++i) { //all possible letters for this digit
    char letter = Letters[digit][i];
    if (letter != 0) {
      PhoneNumbers(num,curr+letter,start+1); 
    } 
  }
}

For an input number 328, invoke the function as:
string n = "328";
PhoneNumbers(n,"",0);

The algorithm is exponential in the length of the input.

As an extension, a dictionary can be used. For an input of reasonable length, only some of the strings make valid English language words. A dictionary can be used to eliminate invalid words. A data structure, such as a hash set can be used. If the length of the input number is fixed, considering a set of dictionary words of that length is sufficient. Invalid words can also be pruned earlier in the enumeration using a dictionary stored in the form of a Trie to detect prefixes that do not occur in the dictionary.