next up previous contents
Next: The Cluster Ensemble Problem Up: Cluster Ensembles Previous: Cluster Ensembles   Contents


Motivation

The notion of integrating multiple data sources and/or learned models is found in several disciplines, for example, the combining of estimators in econometrics [Gra89], evidences in rule-based systems [Bar81] and multi-sensor data fusion [Das94]. A simple but effective type of such multi-learner systems are ensembles, wherein each component learner (typically a regressor or classifier) tries to solve the same task. Each learner may receive somewhat different subsets of the data for `training' or parameter estimation (as in bagging [Bre94] and boosting [DCJ$^+$94]), and may use different feature extractors on the same raw data. The system output is determined by combining the outputs of the individual learners via a variety of methods including voting, (weighted) averaging, order statistics, product rule, entropy, and stacking. In the past few years, a host of experimental results have shown that such ensembles provide statistically significant improvements in performance along with tighter confidence intervals, especially when the component models are `unstable' [Wol92,GBC92,Sha96,CS95]. Moreover, theoretical analysis has been developed for both regression [Per93,Has93] and classification [TG96] to estimate the gains achievable. Combining is an effective way of reducing model variance, and in certain situations it also reduces bias [Per93,TG99]. It works best when each learner is well trained, but different learners generalize in different ways, i.e., there is diversity in the ensemble [KV95,Die01].

In unsupervised learning, the clustering problem is concerned with partitioning a set of objects $ \mathcal{X}$ into $ k$ disjoint5.2groups / clusters $ \mathcal{C}_1,\ldots,\mathcal{C}_k$ such that similar objects are in the same cluster while dissimilar objects are in different clusters. Clustering is often posed as an optimization problem by defining a suitable error criterion to be minimized. Using this formulation, many standard algorithms such as $ k$-means and agglomerative clustering have been established over the past forty years [Eve74,JD88]. Recent interest in the data mining community has lead to a new breed of approaches such as CLARANS, DBScan, BIRCH, CLIQUE, and WaveCluster [RS99], that are oriented towards clustering large data-sets kept in databases and emphasize scalability.

Unlike classification problems, there are no well known approaches to combining multiple clusterings. We call the combination of multiple partitionings of the same underlying set of objects without accessing the original features as the cluster ensemble design problem. Since the combiner has no more access to the original features and only cluster labels are available, this is a framework for knowledge reuse [BG99]. This problem is more difficult than designing classifier ensembles since cluster labels are symbolic and so one must also solve a correspondence problem. In addition, the number and shape of clusters provided by the individual solutions may vary based on the clustering method as well as on the particular view of the data presented to that method. Moreover, the desired number of clusters is often not known in advance. In fact, the `right' number of clusters in a data-set depends on the scale at which the data is inspected, and sometimes, equally valid (but substantially different) answers can be obtained for the same data [CG96].

A clusterer consists of a particular clustering algorithm with a specific view of the data. A clustering is the output of a clusterer and consists of the group labels for some or all objects (a labeling). Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. It is useful in a variety of contexts:

Contributions. The idea of integrating independently computed clusterings into a combined clustering leads us to a general knowledge reuse framework that we call the cluster ensemble. Our contribution in this chapter is to formally define the design of a cluster ensemble as an optimization problem and to propose three effective and efficient frameworks for solving it. We show how any set of clusterings can be represented as a hypergraph. Using this hypergraph, three methods are proposed for computing a combined clustering based on similarity-based reclustering, hypergraph partitioning, and meta-clustering, respectively. We also describe applications of cluster ensembles for each of the three application scenarios described above and show results on real data.

Notation. Let $ \mathcal{X} = \{ x_1, x_2,\ldots,x_n \}$ denote a set of objects / samples / points. A partitioning of these $ n$ objects into $ k$ clusters can be represented as a set of $ k$ sets of objects $ \{
\mathcal{C}_{\ell} \vert \ell = 1,\ldots,k \}$ or as a label vector $ \mathbf{\lambda} \in \mathbb{N}^n$. We also refer to a clustering / labeling as a model. A clusterer $ \Phi$ is a function that delivers a label vector given an ordered set of objects. Figure 5.1 shows the basic setup of the cluster ensemble: A set of $ r$ labelings $ \mathbf{\lambda}^{(1,\ldots,r)}$ is combined into a single labeling $ \mathbf{\lambda}$ using a consensus function $ \Gamma$. We call the original set of labelings, the profile and the resulting single labeling, the consensus clustering. Vector / matrix transposition is indicated with a superscript $ \dagger$. A superscript in brackets denotes an index and not an exponent.

Figure 5.1: The cluster ensemble. A consensus function $ \Gamma$ combines clusterings $ \mathbf{\lambda}^{(q)}$ from a variety of sources, without resorting to the original object features $ \mathcal{X}$ or algorithms $ \Phi$.
\resizebox{.7\textwidth}{!}{\includegraphics*{eps/absproc_consens_bw}}

Organization. In the next section, we will introduce the cluster ensemble problem in more detail and in section 5.3 we propose and compare three possible combining schemes, $ \Gamma$. In section 5.4 we will apply these schemes to both centralized and distributed clustering problems and present case studies on a variety of data-sets to highlight these applications.


next up previous contents
Next: The Cluster Ensemble Problem Up: Cluster Ensembles Previous: Cluster Ensembles   Contents
Alexander Strehl 2002-05-03