next up previous contents
Next: Efficient Consensus Functions Up: The Cluster Ensemble Problem Previous: Illustrative Example   Contents


Objective Function for Cluster Ensembles

Given $ r$ groupings with the $ q$-th grouping $ \mathbf{\lambda}^{(q)}$ having $ k^{(q)}$ clusters, a consensus function $ \Gamma$ is defined as a function $ \mathbb{N}^{n \times r} \rightarrow \mathbb{N}^n$ mapping a set of clusterings to an integrated clustering:

$\displaystyle \Gamma : \{\mathbf{\lambda}^{(q)} \left\vert \right. q \in \{1,\ldots,r\} \} \rightarrow \mathbf{\lambda}.$ (5.1)

We abbreviate the set of groupings $ \mathbf{\Lambda} =
\{\mathbf{\lambda}^{(q)} \left\vert \right. q \in \{1,\ldots,r\} \}$. The optimal combined clustering should share the most information with the original clusterings. But how do we measure shared information between clusterings? In information theory, mutual information is a symmetric measure to quantify the statistical information shared between two distributions [CT91]. Suppose there are two labelings $ \mathbf{\lambda}^{(a)}$ and $ \mathbf{\lambda}^{(b)}$. Let there be $ k^{(a)}$ groups in $ \mathbf{\lambda}^{(a)}$ and $ k^{(b)}$ groups in $ \mathbf{\lambda}^{(b)}$. Let $ n^{(h)}$ be the number of objects in cluster $ \mathcal{C}_h$ according to $ \mathbf{\lambda}^{(a)}$, and $ n_{\ell}$ the number of objects in cluster $ \mathcal{C}_{\ell}$ according to $ \mathbf{\lambda}^{(b)}$. Let $ n_{\ell}^{(h)}$ denote the number of objects that are in cluster $ h$ according to $ \mathbf{\lambda}^{(a)}$ as well as in group $ \ell$ according to $ \mathbf{\lambda}^{(b)}$. Then, a [0,1]-normalized mutual information criterion $ \phi^{(\mathrm{NMI})}$ is computed as follows (see appendix B.1 for details):

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda}^{(a)},\mathbf{\lambda}^{(...
...a)} \cdot k^{(b)}} \left( \frac{ n_{\ell}^{(h)} n} { n^{(h)} n_{\ell} } \right)$ (5.2)

Therefore, the Average Normalized Mutual Information (ANMI) between a set of $ r$ labelings, $ \mathbf{\Lambda}$, and a labeling $ \hat{\mathbf{\lambda}}$ is defined as follows:

$\displaystyle \phi^{(\mathrm{ANMI})}(\mathbf{\Lambda}, \hat{\mathbf{\lambda}}) ...
...q = 1}^{r} \phi^{(\mathrm{NMI})}(\hat{\mathbf{\lambda}},\mathbf{\lambda}^{(q)})$ (5.3)

We define the optimal combined clustering $ \mathbf{\lambda}^{(k-\mathrm{opt})}$ to be the one that has maximal average mutual information with all individual labelings $ \mathbf{\lambda}^{(q)}$ in $ \mathbf{\Lambda}$ given that the number of consensus clusters desired is $ k$. Our objective function is average normalized mutual information (ANMI, equation 5.3), and $ \mathbf{\lambda}^{(k-\mathrm{opt})}$ is defined as

$\displaystyle \mathbf{\lambda}^{(k-\mathrm{opt})} = \arg \max_{\hat{\mathbf{\la...
...q = 1}^{r} \phi^{(\mathrm{NMI})}(\hat{\mathbf{\lambda}},\mathbf{\lambda}^{(q)})$ (5.4)

where $ \hat{\mathbf{\lambda}}$ goes through all possible $ k$-partitions [SG02a].

There may be situations where not all labels are known for all objects, i.e. there is missing data in the label vectors. For such cases, the consensus clustering objective from equation 5.4 can be generalized by computing a weighted average of the mutual information with the known labels, with the weights proportional to the comprehensiveness of the labelings as measured by the fraction of known labels. Let $ \mathcal{L}^{(q)}$ be the set of object indices with known labels for the $ q$-th labeling. Then, the generalized objective function becomes:

$\displaystyle \mathbf{\lambda}^{(k-\mathrm{opt})} = \arg \max_{\hat{\mathbf{\la...
...lambda}}_{\mathcal{L}^{(q)}}, \hat{\mathbf{\lambda}}^{(q)}_{\mathcal{L}^{(q)}})$ (5.5)

For finite populations, the trivial solution is to exhaustively search through all possible clusterings with $ k$ labels for the one with the maximum ANMI. However, for $ n$ objects and $ k$ partitions there are

$\displaystyle \frac{1}{k!} \sum_{\ell=1}^{k} \left( \begin{array}{c} k \  \ell \end{array} \right) (-1)^{k-\ell} \ell^n$ (5.6)

possible clusterings, or approximately $ k^n / k!$ for $ n \gg k$. For example, there are 171,798,901 ways to form 4 groups of 16 objects. For 8 groups of 32 objects this number grows to over $ 1.75\cdot10^{24}$ and for 16 groups of 64 objects to over $ 4.23\cdot10^{63}$. Clearly, this exponential growth in $ n$ makes a full search prohibitive. Having defined the objective, we now propose three algorithms to find good heuristic solutions. Note that greedy optimizations of the ANMI objective are difficult since it is a hard combinatorial problem. So our three proposed algorithms in this chapter have been developed from intuitive ideas for maximizing the objective.


next up previous contents
Next: Efficient Consensus Functions Up: The Cluster Ensemble Problem Previous: Illustrative Example   Contents
Alexander Strehl 2002-05-03