Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
ABSTRACT
The nearest neighbor problem is the following: Given a set P of n points
in some metric space X, preprocess P so as to efficiently answer queries
which require finding the point in P closest to a query point q in X. This
fundamental geometric problem is of growing importance as an abstraction
of similarity search in areas such as information retrieval and image
processing. We will focus on the particularly interesting case of the
d-dimensional Euclidean space. Despite decades of effort, the current
solutions are far from satisfactory; in fact, for large d, in theory or in
practice, they provide little improvement over the brute-force algorithm
which compares the query point to each data point. Of late, there has been
some interest in the approximate nearest neighbors problem where we are
only required to produce a point p' whose distance from the query q is
within a factor (1+e) of the distance between q and its true nearest neighbor.
We will present two algorithmic results for the approximate version that
significantly improve the known bounds: (a) preprocessing cost polynomial
in n and d, and a truly sublinear query time (for e>1); and, (b) query
time polynomial in d and log n, and only a mildly exponential preprocessing
cost O(n/e^d). Further, applying a classical geometric lemma on random
projections (for which we give a simpler proof), we obtain the first known
algorithm with polynomial preprocessing and query time polynomial in d and
log n. Unfortunately, for small e, the latter is a purely theoretical result
since the exponent depends on 1/e. On the other hand, experimental results
indicate that our first algorithm offers extremely fast running times on real
data sets. Its key ingredient is the notion of locality-sensitive hashing
which may be of independent interest; in fact, we will give applications to
information retrieval, pattern recognition, dynamic closest-pairs, and fast
clustering algorithms.
Time permitting, we will also touch upon a new abstraction of similarity
search that leads to an interesting generalization of the nearest neighbors
problem.
(Joint work with Piotr Indyk.)