*I totally rewrote this post. Sorry for that, but the first version was a huge mistake. Likely the greatest blunder of my life. 🙂 (This statment isn’t true, I made pretty much mistakes in my life and a bad article hasn’t such unpleasant consequenses like other mistakes, it cannot be the greatest. In other words, the previous statment is a mistake too.) I wanted to write a post about algorithm design and I tried to construct different algorithms for a nice problem. It was a long article (about 1600 words) and I’m ashamed, because I constructed long and complicated algorithms for a very simple problem. There is a linear time and constant space algorithm for the current problem and that’s why I was absolutely on the wrong way. However, I won’t delete this article, I think it is pretty educational.
*

## Introduction

**People selection problem:** Suppose we have a very large database of people, which listed millions of people’s details, and the whole database is stored on magnetic tape storages. We want to select a certain number of people randomly, and we want to print out their names by using a printer. Give an efficient algorithm to solve this problem! We can assume, that the memory of the computer is too small to store the names of all people.

**Terminology and formalization:** The number of all people is N, the number of the selected people is K. Our job is to generate K unique random numbers, which are greater or equal than 0 and which are less than N. The generated random numbers must be sorted, because the magnetic tape storage is pretty slow and we want to use a single-pass read to find the names of the selected people. We assume that we have a small and fast algorithm which generates good (pseudo) random bits in** O(1)**` (constant) time, and it is implemented on our computer so we don’t need to deal with the generation of (pseudo) random numbers.`

**Constraints:** We work only with 8, 16, 32 and 64 bit integers. Of course we could emulate bigger integers (e.g. a 128 bit integer with two 64 bit integers), but it is not encouraged. We guarantee that the total number of people can be always stored in 64 bits.

## Bad approaches

**Algorithm1: naive algorithm
**

HashTable ht(2*K); //a hashtable to remove duplicates int64 numbers[K]; for (i = 0; i < K; i++) { int64 cval = randomNumber(N); while (ht.contains(cval)) { cval = randomNumber(N); } ht.insert(cval); numbers[i] = cval; } qsort(numbers);

Space complexity: O(K)

Average-time complexity: O(K + K^K/(N-K) + K log K)

**Two-pass algorithms**

Okay, I designed two or three other algorithms, each of them were “two pass” algorithms, but I don’t want to give the details, because these algorithms are big, slow and inefficient. The basic idea of two pass algorithms: we divide the blocks into regions, and we create a map, which says, how much selected block is in a region. In the first pass, we generate the map, in the second pass, we select the blocks in each region:

*Figure 1: The (bad) basic idea of two pass algorithms*

The best two pass algorithm, that I created, used binary trees, subdivision, it managed the two passes at the same time, it traversed a binary tree and it stored only a path in the memory (and not the whole tree). Its time complexity was O(N), its space complexity was O(log N). But I introduced pretty much bad algorithms before the best one and these algorithms were just wasting of the time.

*Figure 2: My best experimented algorithm, the two-in-one pass algorithm*

## The optimal solution

The answer is simple. Just open the Volume 2 of Donald Knuth’s classic series The Art of Computer Programming. Yes, it is so simple. I knew a girl once who was pretty superstitious and she belived that her tarot cards had magical power. She told me that if she draws a tarot card from the deck, then she will draw exactly that card which the cards will. Sometimes I think these book has magical power too and it is able to show you the solution for your problem if it wills.

So, if you take the volume 2, the book will open in chapter 3, in section 3.4.2 (Random Sampling and Shufflingat), at page 142 (if you has the Third Edition). By the way, it is a famous section, it contains the short description of Knuth shuffle, which has become popular in cetrain circles. In our case, the first two algorithms in the section are more important, the algorithm S (selection sampling technique) and the algorithm R (reservoir sampling). The selection sampling is *the perfect solution* for the People Selection Problem, it solves it in O(N) time and in O(1) space (assuming there is possible to print the names of selected people to a printer, without storing these names in an array in the memory).

(Note: When I uploaded the picture of figure 2, I started to feel that there is a nice algorithm for generation of a random subset of a bigger set (the size of the bigger set is N, the size of the smaller set is K), which has O(K) time complexity and O(log K + K) = O(K) space complexity. It is just a feeling and I don’t know, whether it is true or not.)