This is a collection of C++ procedures for performing k-means
clustering based on a combination of local search and Lloyd's
algorithm (also known as the k-means algorithm). Given any set
of k centers Z, for each center z in Z, let V(z) denote its
neighborhood, that is, the set of data points for which z
is the nearest neighbor. Each stage of Lloyd's algorithm moves
every center point z to the centroid of V(z) and then updates
V(z) by recomputing the distance from each point to its nearest
center. These steps are repeated until convergence.
However, Lloyd's algorithm can get stuck in locally minimal
solutions that are far from the optimal. For this reason it is
common to consider heuristics based on local search, in
which centers are swapped in and out of an existing solution
(typically at random). Such a swap is accepted if it decreases
the average distortion, and otherwise it is ignored. It is also
possible to combine these two approaches (Lloyd's algorithm and
local search), producing a type of hybrid solution.
This program provides a number of different algorithms for doing
k-means clustering based on these ideas and combinations.
Further information can be found in the software documentation and the above
research papers.
There are four different procedures for performing k-means,
which have been implemented here. The main issue is how the
neighbors are computed for each center.
- Lloyd's:
-
Repeatedly applies Lloyd's algorithm with randomly sampled
starting points.
- Swap:
-
A local search heuristic, which works by performing swaps
between existing centers and a set of candidate centers.
- EZ_Hybrid:
-
A simple hybrid algorithm, which does one swap
followed by some number of iterations of Lloyd's.
- Hybrid:
-
A more complex hybrid of Lloyd's and Swap, which performs
some number of swaps followed by some number of iterations
of Lloyd's algorithm. To avoid getting trapped in local
minima, an approach similar to simulated annealing is included
as well.
|