Next: Challenges Up: Background and Related Work Previous: Visualization Contents

Ensembles and Knowledge Reuse

There is an extensive body of work on combining multiple classifiers or regression models, but little on combining multiple clusterings so far in the machine learning community. However, in traditional pattern recognition, there is a substantial body of largely theoretical work on consensus classification during the mid-80's and earlier [NN86a,NN86b,BLM86]. These studies used the term `classification' in a very general sense, encompassing partitions, dendrograms and

-trees as well. Today, such operations are typically referred to as clusterings. In consensus classification, a profile is a set of classifications which is sought to be integrated into a single consensus classification. A representative work is that of [NN86a], which investigated techniques for strict consensus. They first construct a lattice over the set of all partitionings by using the refinement relation. Partitioning A is a refinement of partitioning B if every cluster in A is a subset of some cluster in B. This refinement relation defines a partial order, and thus for each pair of partitionings, a supremum and an infimum exist. A strict consensus finds the supremum and infimum of all given pairs of clusterings, thus yielding the consensus interval.

Such work on strict consensus works well for small data-sets with little noise and little diversity. But, in presence of strong noise the results can be trivial, namely the supremum is the monolithic clustering (one cluster) and the infimum is the set of singletons. Also, the computations are intractable for large data-sets. Another drawback is that the strict consensus is not at the same level of resolution.

The most prominent application of strict consensus is by the computational biology community to obtain phylogenetic trees [KW99,KWY95]. A set of DNA sequences can be used to generate evolutionary trees using criteria such as maximum parsimony, but often one obtains several hundreds of trees with the same score function. In such cases, biologists look for the strict consensus tree, the `infimum', which has lower resolution but is compatible with all the individual trees. Note that such systems are different from the cluster ensembles proposed in chapter 5 in that (i) they are hierarchical clusterings, typically using unrooted trees, (ii) have domain specific distance metrics (e.g., Robinson-Foulds distance) and evaluation criteria such as parsimony, specificity and density, and (iii) strict consensus is a requirement.

The most important difference between our work in chapter 5 and studies of the 80's on `consensus classification' is that the latter obtains consensus at a different level of refinement, whereas this dissertation focuses on consensus at the same level or scale. For example, taking the intersections of two partitionings results in a number of partitions up to the product of the numbers of partitions in the original partitionings, thus moving the analysis to a much finer scale.

Two interesting applications of using consensus ideas to help classification/regression problems have emerged recently. Consensus decision trees have been used to generate a single decision tree from -fold cross-validated C4.5 results [KLF01]. Objects are clustered according to their positions or paths in the decision trees. Then, only objects of the majority class in each cluster are selected to form a new training set that generates the consensus decision tree. The goal here is to obtain a single, simplified decision tree without compromising much on classification accuracy. In [Rag01], many models are obtained for a regression problem. These models are then clustered, based on their pairwise mutual information. Finally, a smaller committee is obtained for the regression problem by selecting a representative model from each cluster. Note that neither of these works actually combine multiple partitionings of the input data, but instead use clustering as an intermediate step in solving a classification or regression problem.

In chapter 5, we will propose to exploit multiple existing groupings of the data. Several analogous approaches exist in supervised learning scenarios (class labels are known), under categories such as `life-long learning' [Thr96], `learning to learn' [TP97] and `knowledge reuse' [BG98,BG99]. Several researchers have attempted to directly reuse the internal state information from classifiers under the belief that related classification tasks may benefit from common internal features. One approach to this idea is to use the weights of hidden layers in an Multi-Layer Perceptron (MLP) classifier architecture to represent the knowledge to be shared among the multiple tasks being trained on simultaneously [Car95]. Pratt [Pra94] uses some of the trained weights from one MLP network to initialize weights in another MLP to be trained for a later, related task. In a related work, Silver and Mercer [SM96] have developed a system consisting of task networks and an experience network. The experience network tries to learn the converged weights of related task networks in order to initialize weights of target task networks. Both of these weight initialization techniques resulted in improved learning speed. Besides weight reuse in MLP type classifiers, other state reuse methods have been developed. One approach, developed by Thrun [TO96], is based on a nearest neighbor classifier in which each of the dimensions of the input space is scaled to bring examples within a class closer together while pushing examples between different classes apart. The scaling vector derived for one classification task is then used in another, related task. We have previously proposed a knowledge reuse framework wherein the labels produced by old classifiers are used to improve the generalization performance of a new classifier for a different but related task [BG98]. This improvement is facilitated by a supra-classifier that accesses only the outputs of the old and new classifiers, and does not need the training data that was used to create the old classifiers. Substantial gains are achieved when the training set size for the new problem is small, but can be compensated for by the extraction of information from the existing related solutions.

More recently, a host of semi-supervised methods have emerged that augment a limited training set of labeled data by a larger amount of unlabelled data. One powerful idea is to use co-training [BM98], whose success hinges on the presence of multiple `redundantly sufficient' views of the data. For example, Muslea et al. introduced a multi-view algorithm including active sampling based on co-training [MMK01]. Nigam and Ghani investigated the effectiveness of co-training in semi-supervised settings [NG00].

Another application of our cluster ensembles is to combine multiple clustering that were obtained based on only partial sets of features. This problem has been approached recently as a case of collective data mining [KPHJ99]. In [JK99] a feasible approach to combining distributed agglomerative clusterings is introduced. First, each local site generates a dendrogram. The dendrograms are collected and pairwise similarities for all objects are created from them. The combined clustering is then derived from the similarities. In [KHSJ01], a distributed method of principal components analysis is introduced for clustering.

The usefulness of having multiple views of data for better clustering has been recognized by others as well. In multi-aspect clustering [MS00], several similarity matrices are computed separately and then integrated using a weighting scheme. Also, Mehrotra has proposed a multi-viewpoint clustering, where several clusterings are used to semi-automatically structure rules in a knowledge base [Meh99].

In our proposed algorithms, we use hypergraph representations which have been extensively studied [GJ79]. Hypergraphs have been previously used for (a single) high-dimensional clustering [HKKM97] but not for combining multiple groupings. Mutual information [CT91] is a useful measure in a variety of contexts. For example, the information bottleneck method [ST00] is an information-theoretical approach that uses mutual information to do dimensionality reduction (e.g., through clustering) while trying to preserve as much information about the class labels as possible.

Next: Challenges Up: Background and Related Work Previous: Visualization Contents

Alexander Strehl 2002-05-03