Next: Motivation Up: Relationship-based Clustering and Cluster Previous: Summary Contents

Cluster Ensembles

The whole is more than the sum of its parts.
- Aristotle^5.1

It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple non-hierarchical clusterings. The idea of combining object partitionings without accessing the original objects' features leads us to a general knowledge reuse framework that we call cluster ensembles. Our contribution in this chapter is to formally define the cluster ensemble problem as an optimization problem in terms of mutual information and to propose three effective and efficient combiners (consensus functions) for solving it. The combiners are designed with the relationship-based approach developed in this dissertation, because similarity can be used naturally in the label space to infer the relationships between clusters and/or objects. The first combiner induces a pairwise similarity measure between objects from the partitionings and then reclusters the objects. The second combiner creates multi-fold relationships (hyperedges) amongst the objects and uses them to repartition based on hypergraph partitioning. The third one uses the similarity of labels to group clusters into meta-clusters. Collapsed meta-clusters then compete for each object to determine the combined clustering. We also compare the approaches in a controlled experiment and propose a supra-consensus function that combines all three. We present three situations where our combiners can be used as wrappers to integrate sets of groupings: robust centralized clustering, object-distributed clustering, and feature-distributed clustering. Results on synthetic as well as real web data-sets are given to show that cluster ensembles can: (i) improve quality and robustness, (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.

Subsections

Next: Motivation Up: Relationship-based Clustering and Cluster Previous: Summary Contents

Alexander Strehl 2002-05-03