## Freitag, April 25, 2014

### Finding duplicate elements in an array

This is a story about the power of randomness and universal hashing.

Suppose you are given an array of elements of some fixed type. How would you go about figuring out whether two elements of the array are equal?

At the time that I am writing this post, searching for its title leads almost universally to discussions about an interview-type puzzle question where you are given an array of size `N` that contains integers in the range from `0` to `N-1`. Indeed, clever solutions to this problem are proposed. Most of them amount to the same kind of trickery that radix sort employs to get a linear-time sorting algorithm. I hope the reader agrees that this is cute, but of limited use.

The most straightforward efficient solution to this problem that applies to general types is to sort the array using an arbitrary sorting function. Then a simple linear scan suffices to answer the question. The overall running time of this approach is `O(N log N)`, and Misra and Gries showed in the 1980s that no comparison-based algorithm can be asymptotically faster.

Let me present their argument, which should feel familiar to anybody who has seen the proof that comparison-based sorting requires time `Omega(n log n)` (the result in the paper is more general).

Every deterministic algorithm for finding duplicates in an array using comparisons can be thought of as a family of infinitely many ternary decision trees, one tree for every array size. For any given array size, the algorithm starts at the root node of the tree for that size, comparing the elements at the indices given by that node. Depending on how they compare, the algorithm moves down in the tree. This process continues until a leaf node is reached which is labelled either YES (duplicate elements have been found) or NO (all elements are unique). Here is a decision tree for an array of size 3: We will argue that every decision tree for arrays of size `N` must have more than `N!` leaves. Via Stirling's approximation, this implies that any such tree has depth at least `Omega(N log N)`, which in turn implies the same bound for the worst-case running time. In fact, the average depth of leaves is `Omega(N log N)`. This means that even a randomized algorithm, which we can think of as choosing one of many decision trees at random, has an expected worst-case running time of at least `Omega(N log N)`.

Let us fix a decision tree `T` for arrays of size `N`. There are `N!` different permutations of size `N`. We can think of such a permutation as an array of size `N` filled with the integers from `1` to `N` without duplicates. Let `P1` and `P2` be two different such permutations/arrays, and define the array `Q` as
```Q[i] := min(P1[i], P2[i])
```
The array `Q` contains duplicate entries. In fact, let `m` be the smallest number that appears at different positions in `P1` and `P2`. Then `m` appears twice in `Q`.

As a consequence, the computation for `Q` must end in a YES-leaf of `T`, while `P1` and `P2` must lead to NO-leaves. What if `P1` and `P2` lead to the same NO-leaf `N`? Let `V` be a node on the path to `N`, and let us say that `i` and `j` are the array indices that are compared at the node `V`. Then the comparison `P1[i]` vs. `P1[j]` has the same result as the comparison `P2[i]` vs. `P2[j]`. But then, by the definition of `Q`, the comparison `Q[i]` vs. `Q[j]` also has the same result. This means that the computation for `Q` must also follow the same path and end in a NO-leaf! That is clearly a contradiction, and so we can conclude that the computations for `P1` and `P2` must end in different NO-leaves. Hence, the decision tree `T` must have at least `N!` NO-leaves, and this completes the proof.

So, to summarize up to this point: As long as we can only compare array elements, no algorithm can beat the running time of `O(N log N)` that a simple sort followed by a linear scan offers. However, this isn't the end of the story.

Suppose you have a universal family of hash functions for the type of elements in the array (and indeed, such families are not too difficult to construct). That is, suppose you can randomly choose a hash function `h:U -> [m]`, with `m` within a constant factor of `N`, such that for different array elements `A[i]` and `A[j]` you know that the probability of a hash collision is low:
```Pr[h(A[i]) = h(A[j])] <= 1/N
```
Note that if the guarantee is slightly weaker, e.g. a bound of `2/N` on the probability of a collision, the following argument will still work with only minor modifications to the constants in the computations.

We will now follow a simple strategy: Build a hash table of the elements of `A` according to `h`. Since equal elements will be hashed into the same bucket, it is then sufficient to check for duplicate elements within each bucket by sorting and a linear scan, or by an even simpler quadratic check.

What is the running time of this approach? If `b(i)` is the number of array elements mapped into the `i`-th bucket, then even the running time of the simple quadratic check can be bounded by
```sum(b(i)2, i=1..m),
```
which is exactly the number of hash collision pairs:
```sum(b(i)2, i=1..m) = #{ (i,j) : h(A[i]) = h(A[j]) } = N + #{ (i,j) : i != j, h(A[i]) = h(A[j]) }
```
The expected cardinality of the last part is bounded by the number of pairs `(i,j)` with `i != j` times our bound on the probability that a pair results in a hash collision. This product is `N-1`. That is, the expected running time of the algorithm we have outlined is linear.

Observe that we must choose the hash function randomly for this approach to work. If we were to use a fixed, deterministic hash function, then an adversary could construct an array in which all elements hash to the same bucket (using the pigeonhole principle and assuming that the universe `U` of possible array elements is large enough). We would be back in the case where only comparisons are allowed, and hence a linear running time is impossible.

So we see that the combination of hashing and randomness, in the form of universal hashing, allows us to solve the original problem in a running time that is asymptotically better than what is achievable using deterministic algorithms.