A Fast Implementation of the ISODATA Clustering Algorithm

Technical Overview:

This is a collection of C++ programs that implement the popular clustering algorithm known as ISODATA. (See Algorithms for Clustering Data by A. K. Jain and R. C. Dubes, Prentice Hall, 1988). Intuitively, the algorithm tries to find the best set of cluster centers for a given set of points in d-dimensional space through an iterative approach until some maximum number of iterations are performed. It uses a number of different heuristics to determine when to merge or split clusters.

At a high level, in each iteration of the algorithm the following takes place: points are assigned to their closest cluster centers, cluster centers are updated to be the centroid of their associated points, clusters with very few points are deleted, large clusters satisfying some heuristics are split, and small clusters satisfying other heuristics are merged. The algorithm continues until maximum number of iterations are performed. Here we go over the algorithm in more detail. See the related publications below for further information.

The code is written in C++ and has been successfully compiled and run on a SUN Blade 100 running Solaris 2.8, using the g++ compiler (version 2.95.3). After downloading and unzipping each version, type "make" to compile the program into the executable called "IsoClus". To run the code, type "IsoClus < filename" where "filename" is the name of the input file containing running time parameters. A sample input file listing all directives is available here.

Various implementations of the algorithm as mentioned in the paper below are available below for download. These are described below.

Brute-Force Implementations:

For the purposes of comparison against our efficient implementations, we have provided two brute-force implementations. These two implementations are both combined in one distribution and are described below.

Standard Implementation:

A straightforward implementation of the ISODATA algorithm. (Specifically this is a variant called ISOCLUS which is described in the publication below and here).

Hybrid Implementation:

A modification of the standard version using squared Euclidean distances and the square root of average squared Euclidean distances when computing cluster dispersion (in Steps 5 and 8, respectively of the algorithm).

To activate this, include the option "SQUARED 1" in the input file.

Download Brute-Force Software

Filtering Implementations:

These implementations use a more efficient approach, called filtering. In assigning each point to its nearest cluster center, this method employs a kd-tree data structure for storing the points, which permits many points to be dealt with in a aggregat manner. These two implementations are both combined in one distribution and are described below.

Filtering (Exact)

This implementation is functionally equivalent to the Hybrid version, but uses filtering.

Approximate Filtering

A modification to Filtering version in which, instead of performing filtering exactly it allows for an approximation. Given a value ε > 0, the algorithm may assign a point to a center whose distance is at most (1 + ε ) greater than the distance to its true nearest center.

To activate this, include the option "epsilon xx" in the input file, where "xx" is the value of ε.

Download Filtering Software

Conditions of Use:

This code is released under the terms of the GNU General Public License.

Disclaimer: The University of Maryland and NASA and the authors make no representations about the suitability or fitness of this software for any purpose. This software is provided "as is" without any warranty of any kind, either express, implied, or statutory, including, but not limited to, any warranty that the software will conform to specification, any implied warranties of merchantability, fitness for a particular purpose, and freedom from infringement, and any warranty that the documentation will conform to the program, or any warranty that the software will be error free.

Publications

Abstract: Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both empirical and theoretical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic images and remotely sensed Landsat and Modis images that show that our approach has significantly lower running times.

Preliminary Version:: N. Memarsadeghi, D. M. Mount, N. S. Netanyahu, and J. Le Moigne,
A Fast Implementation of the ISOCLUS Algorithm,
Proc. Intl. Geosci. and Remote Sensing Symp. (IGARSS'03), Toulouse, France, 2003, Vol. III, 2057-2059.
Expanded Version:: N. Memarsadeghi, D. M. Mount, N. S. Netanyahu, and J. Le Moigne,
A Fast Implementation of the ISODATA Clustering Algorithm
International Journal of Computational Geometry and Applications, 17 (2007), 71-103.

Summary of Results

Landsat image (bands 3,4,5), and classifications using the standard and filtering algorithms.

CPU Times and Speed-ups for the various algorithms. (Click on images for full resolution.)

CPU Running Times vs Dimension for various algorithms. (Click on image for full resolution.)

Distortion error of approximate filtering version with respect to the standard version.

This is a joint collaboration between University of Maryland and NASA GSFC.

To Dave Mount's home page.

To Nargess Memarsadeghi's home page.

Last updated November 11, 2005.