Given
a stream of numbers, choose a sample of k numbers. This idea is very
useful in data analysis applications where a random sample is taken
for analysis from a very large stream.
In
the case where the stream size n is small, a Knuth's shuffle can be
performed. The required sample is the first k numbers in the
generated permutation.
In
many applications, the size of the stream is very large and
continuously getting updated. The stream size is so large that the
numbers do not fit into the memory. In these cases, a variation of
Knuth's shuffle can be performed to obtain a random sample. The
algorithm is outlined below:
Step
1: Create an array R of size k. Copy the first k elements (index 0
through k-1) of the stream S to R.
for(int
i = 0; i < k; ++i) {
R[i]
= S[i]
}
S1
|
S2
|
...
|
Sk-2
|
Sk-1
|
S6
|
S7
|
S8
|
...
|
Sn
|
S1
|
S2
|
...
|
Sk--2
|
Sk-1
|
If
the size of R is <= k, no further steps are necessary. We have a
sample of size <= k.
Step
2: Starting with the element at index k of the input, perform the
following:
for(int
i = k; i < n; ++i) {
Generate
a random number j in the range (0, i)
if(j
< k) {
swap(R[j],
S[i])
}
}
Return
R
The
invariant of the algorithm is that as we iterate through the input
array, at all time we maintain R to be a random sample of size k from
the input we have seen so far i.e. if we have seen n elements in S,
the probability of each element being in R is n / k.
When
we see the (n + 1)'th element, if we swap it with an element in R
with a probability of (n + 1) / k then we will continue maintaining
the invariant for the newly seen element as well as for the already
selected elements.
The
time complexity of this algorithm is O(n) and requires O(k)
additional space.