Next: The Cluster Ensemble Problem Up: Cluster Ensembles Previous: Cluster Ensembles Contents

Motivation

The notion of integrating multiple data sources and/or learned models is found in several disciplines, for example, the combining of estimators in econometrics [Gra89], evidences in rule-based systems [Bar81] and multi-sensor data fusion [Das94]. A simple but effective type of such multi-learner systems are ensembles, wherein each component learner (typically a regressor or classifier) tries to solve the same task. Each learner may receive somewhat different subsets of the data for `training' or parameter estimation (as in bagging [Bre94] and boosting [DCJ$^+$94]), and may use different feature extractors on the same raw data. The system output is determined by combining the outputs of the individual learners via a variety of methods including voting, (weighted) averaging, order statistics, product rule, entropy, and stacking. In the past few years, a host of experimental results have shown that such ensembles provide statistically significant improvements in performance along with tighter confidence intervals, especially when the component models are `unstable' [Wol92,GBC92,Sha96,CS95]. Moreover, theoretical analysis has been developed for both regression [Per93,Has93] and classification [TG96] to estimate the gains achievable. Combining is an effective way of reducing model variance, and in certain situations it also reduces bias [Per93,TG99]. It works best when each learner is well trained, but different learners generalize in different ways, i.e., there is diversity in the ensemble [KV95,Die01].

In unsupervised learning, the clustering problem is concerned with partitioning a set of objects $\mathcal{X}$ into disjoint^5.2groups / clusters $\mathcal{C}_1,\ldots,\mathcal{C}_k$ such that similar objects are in the same cluster while dissimilar objects are in different clusters. Clustering is often posed as an optimization problem by defining a suitable error criterion to be minimized. Using this formulation, many standard algorithms such as -means and agglomerative clustering have been established over the past forty years [Eve74,JD88]. Recent interest in the data mining community has lead to a new breed of approaches such as CLARANS, DBScan, BIRCH, CLIQUE, and WaveCluster [RS99], that are oriented towards clustering large data-sets kept in databases and emphasize scalability.

Unlike classification problems, there are no well known approaches to combining multiple clusterings. We call the combination of multiple partitionings of the same underlying set of objects without accessing the original features as the cluster ensemble design problem. Since the combiner has no more access to the original features and only cluster labels are available, this is a framework for knowledge reuse [BG99]. This problem is more difficult than designing classifier ensembles since cluster labels are symbolic and so one must also solve a correspondence problem. In addition, the number and shape of clusters provided by the individual solutions may vary based on the clustering method as well as on the particular view of the data presented to that method. Moreover, the desired number of clusters is often not known in advance. In fact, the `right' number of clusters in a data-set depends on the scale at which the data is inspected, and sometimes, equally valid (but substantially different) answers can be obtained for the same data [CG96].

A clusterer consists of a particular clustering algorithm with a specific view of the data. A clustering is the output of a clusterer and consists of the group labels for some or all objects (a labeling). Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. It is useful in a variety of contexts:

Quality and Robustness. Combining several clusterings can lead to improved quality and robustness of results. For classification or regression problems, it has been shown the gains from using ensemble methods are directly related to the amount of diversity among the individual component models [KV95,TG99]. One desires that each individual model be good, but at the same time, these models should have different inductive biases and thus generalize in distinct ways [Die01]. No wonder ensembles are most popular for integrating relatively unstable models such as decision trees and multi-layered perceptrons. In the clustering context, diversity can be created in numerous ways, including:
- using different features to represent the objects. For example, images can be represented by their pixels, histograms, location and parameters of perceptual primitives, 3D scene coordinates, etc.
- varying the number and/or location of initial cluster centers in iterative algorithms such as -means,
- varying the order of data presentation in on-line methods such as BIRCH,
- using a portfolio of very different clustering algorithms e.g. using density based, -means or soft variants such as fuzzy c-means, graph partitioning based, statistical mechanics based, etc.
It is well known that no clustering method is perfect, and the performance of a given method can vary significantly across data-sets. For example, the popular -means algorithm that performs respectably in several applications, has an underlying assumption that the data (after appropriate normalization such as using Mahalanobis distance) is generated by a mixture of Gaussians, each having identical, isotropic covariance matrices. If the actual data distribution is very different from this generative model, then, even with the `correct' guess for , its performance can be worse than a random partitioning, specially in sparse, high dimensional spaces (see chapter 3). In fact, for difficult data-sets, comparative studies across multiple clustering algorithms typically show much more variability in results than comparative studies on multiple classification methods, where any good approximator of Bayesian a posteriori class probabilities will provide comparable results [RL91]. Thus there should be a potential for greater gains when using an ensemble for clustering problems.
A different but related motivation for using a cluster ensemble is to build a robust clustering portfolio that can perform well over a wide range of data-sets with little hand-tuning. For example, by using an ensemble that includes approaches suitable for low-dimensional, metric spaces (e.g., -means, SOM [Koh95], DBSCAN [EKSX96]) as well as algorithms tailored for high-dimensional, sparse spaces (e.g., spherical -means [DM01], and Jaccard-based graph partitioning (e.g., see chapter 3), one may perform very well in 3D as well as in 30000D spaces without having to switch models. This characteristic is very attractive in practical settings.
Knowledge Reuse. Another important consideration is the reuse of existing clusterings. In several applications, a variety of clusterings for the objects under consideration may already exist. For example, on the web, pages are categorized e.g., by Yahoo! (according to a manually-crafted taxonomy), by your Internet service provider (according to request patterns and frequencies) and by your personal bookmarks (according to your preferences). In grouping customers for a direct marketing campaign, various legacy customer segmentations might already exist, based on demographics, credit rating, geographical region, or income tax bracket. In addition, customers can be clustered based on their purchasing history (e.g., market-baskets). Can we reuse such pre-existing knowledge to create a single consolidated clustering? Knowledge reuse in this context means that we exploit the information in the provided cluster labels without going back to the original features or the algorithms that were used to create the clusterings.
Distributed Computing. The ability to deal with clustering in a distributed fashion is useful to improve scalability, security, and reliability. Also, real applications often involve distributed databases due to organizational or operational constraints [KC00]. One can argue that by transferring all data to a single warehouse and performing a series of merges and joins, one can get a single (albeit very large), flat file. A traditional algorithm can be used after randomizing and subsampling this file. But in real applications this approach may not be feasible because of the computational, bandwidth and storage costs. In certain cases, it may not even be possible for a variety of practical reasons including security, privacy, proprietary nature of data, need for fault tolerant distribution of data and services, real-time processing requirements, statutory constraints imposed by law, etc. [PCS00].
A cluster ensemble can combine individual results from the distributed computing entities, under two scenarios:
- Object-distributed data. Each processor / clusterer has access to only a subset of all objects, and can thus only cluster the observed objects. For example, corporations tend to split their customers regionally for more efficient management. Analysis such as clustering is often performed locally, and a cluster ensemble provides a way of obtaining a holistic analysis without complete integration of the local data warehouses. Note that in order to successfully combine clusterings, the subsets of objects observed by the individual clusterers must have some overlap.
- Feature-distributed data. In this scenario, each processor / clusterer sees only a limited number of features or attributes of each object, i.e. it observes a particular aspect or view of the data. Aspects can be completely disjoint features or have partial overlaps.

Contributions. The idea of integrating independently computed clusterings into a combined clustering leads us to a general knowledge reuse framework that we call the cluster ensemble. Our contribution in this chapter is to formally define the design of a cluster ensemble as an optimization problem and to propose three effective and efficient frameworks for solving it. We show how any set of clusterings can be represented as a hypergraph. Using this hypergraph, three methods are proposed for computing a combined clustering based on similarity-based reclustering, hypergraph partitioning, and meta-clustering, respectively. We also describe applications of cluster ensembles for each of the three application scenarios described above and show results on real data.

Notation. Let $\mathcal{X} = \{ x_1, x_2,\ldots,x_n \}$ denote a set of objects / samples / points. A partitioning of these objects into clusters can be represented as a set of sets of objects $\{ \mathcal{C}_{\ell} \vert \ell = 1,\ldots,k \}$ or as a label vector $\mathbf{\lambda} \in \mathbb{N}^n$ . We also refer to a clustering / labeling as a model. A clusterer $\Phi$ is a function that delivers a label vector given an ordered set of objects. Figure 5.1 shows the basic setup of the cluster ensemble: A set of labelings $\mathbf{\lambda}^{(1,\ldots,r)}$ is combined into a single labeling $\mathbf{\lambda}$ using a consensus function $\Gamma$ . We call the original set of labelings, the profile and the resulting single labeling, the consensus clustering. Vector / matrix transposition is indicated with a superscript $\dagger$ . A superscript in brackets denotes an index and not an exponent.

**Figure 5.1:** The cluster ensemble. A consensus function $\Gamma$ combines clusterings $\mathbf{\lambda}^{(q)}$ from a variety of sources, without resorting to the original object features $\mathcal{X}$ or algorithms $\Phi$ .
$\resizebox{.7\textwidth}{!}{\includegraphics*{eps/absproc_consens_bw}}$

Organization. In the next section, we will introduce the cluster ensemble problem in more detail and in section 5.3 we propose and compare three possible combining schemes, $\Gamma$ . In section 5.4 we will apply these schemes to both centralized and distributed clustering problems and present case studies on a variety of data-sets to highlight these applications.

Next: The Cluster Ensemble Problem Up: Cluster Ensembles Previous: Cluster Ensembles Contents

Alexander Strehl 2002-05-03