Sunday, July 7, 2013

Sampling

Given a stream of numbers, choose a sample of k numbers. This idea is very useful in data analysis applications where a random sample is taken for analysis from a very large stream.

In the case where the stream size n is small, a Knuth's shuffle can be performed. The required sample is the first k numbers in the generated permutation.

In many applications, the size of the stream is very large and continuously getting updated. The stream size is so large that the numbers do not fit into the memory. In these cases, a variation of Knuth's shuffle can be performed to obtain a random sample. The algorithm is outlined below:

Step 1: Create an array R of size k. Copy the first k elements (index 0 through k-1) of the stream S to R.

for(int i = 0; i < k; ++i) {
  R[i] = S[i]
}

S1
S2
...
Sk-2
Sk-1
S6
S7
S8
...
Sn

S1
S2
...
Sk--2
Sk-1

If the size of R is <= k, no further steps are necessary. We have a sample of size <= k.

Step 2: Starting with the element at index k of the input, perform the following:
for(int i = k; i < n; ++i) {
  Generate a random number j in the range (0, i)
  if(j < k) {
    swap(R[j], S[i])
  }
}
Return R

The invariant of the algorithm is that as we iterate through the input array, at all time we maintain R to be a random sample of size k from the input we have seen so far i.e. if we have seen n elements in S, the probability of each element being in R is n / k.
When we see the (n + 1)'th element, if we swap it with an element in R with a probability of (n + 1) / k then we will continue maintaining the invariant for the newly seen element as well as for the already selected elements.

The time complexity of this algorithm is O(n) and requires O(k) additional space.