Next: The -means Framework Up: Background and Related Work Previous: Overview Contents

Clustering Algorithms

This section gives a brief overview over previous work in general purpose clustering algorithms. The most popular and the best understood clustering algorithm is -means. A robust version of -means that uses the medoid instead of the mean to assure that a cluster's representative is an actually observed sample is -medoids. In agglomerative nearest-neighbor clustering, the singleton cluster samples are successively merged into a tree structure in steps. In each step the closest pair, the nearest-neighbors, are joined. All three algorithms generally are intended for use with metric distances. However, especially for our high-dimensional applications it may be useful to replace the metric distances by a more general similarity measure. Also, artificial neural systems and dimensionality reducing techniques have been successfully applied to clustering. In statistical pattern recognition, the data is modeled as a mixture of parametric density functions, which yields a theoretically optimal treatment of the problem. However, the appropriate parametric families are often not known in advance and the growth of the number of parameters raises estimation error (variance) issues. Recently, the database community proposed a variety of highly scalable approaches for large data-sets. However, the models are very simple (usually Euclidean/Gaussian) and may not fit the properties of high-dimensional data. The graph metaphor provides a well-known framework to pose the clustering problem. Efficient partitioning algorithms exist, but theoretical properties for clustering and experimental evaluations still need to be explored.

Subsections

Next: The -means Framework Up: Background and Related Work Previous: Overview Contents

Alexander Strehl 2002-05-03