Objective Function for Cluster Ensembles

Given groupings with the -th grouping having clusters, a consensus function is defined as a function mapping a set of clusterings to an integrated clustering:

We abbreviate the set of groupings . The optimal combined clustering should share the most information with the original clusterings. But how do we measure shared information between clusterings? In information theory, mutual information is a symmetric measure to quantify the statistical information shared between two distributions [CT91]. Suppose there are two labelings and . Let there be groups in and groups in . Let be the number of objects in cluster according to , and the number of objects in cluster according to . Let denote the number of objects that are in cluster according to as well as in group according to . Then, a [0,1]-normalized mutual information criterion is computed as follows (see appendix B.1 for details):

Therefore, the Average Normalized Mutual Information (ANMI) between a set of labelings, , and a labeling is defined as follows:

We define the optimal combined clustering to be the one that has maximal average mutual information with all individual labelings in given that the number of consensus clusters desired is . Our objective function is average normalized mutual information (ANMI, equation 5.3), and is defined as

where goes through all possible -partitions [SG02a].

There may be situations where not all labels are known for all objects, i.e. there is missing data in the label vectors. For such cases, the consensus clustering objective from equation 5.4 can be generalized by computing a weighted average of the mutual information with the known labels, with the weights proportional to the comprehensiveness of the labelings as measured by the fraction of known labels. Let be the set of object indices with known labels for the -th labeling. Then, the generalized objective function becomes:

For finite populations, the trivial solution is to exhaustively search through all possible clusterings with labels for the one with the maximum ANMI. However, for objects and partitions there are

possible clusterings, or approximately for . For example, there are 171,798,901 ways to form 4 groups of 16 objects. For 8 groups of 32 objects this number grows to over and for 16 groups of 64 objects to over . Clearly, this exponential growth in makes a full search prohibitive. Having defined the objective, we now propose three algorithms to find good heuristic solutions. Note that greedy optimizations of the ANMI objective are difficult since it is a hard combinatorial problem. So our three proposed algorithms in this chapter have been developed from intuitive ideas for maximizing the objective.