next up previous contents
Next: Contents   Contents

Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining

Alexander Strehl

Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy


Supervisor: Joydeep Ghosh


Copyright by Alexander Strehl 2002


PDF Version of Dissertation

Abstract:

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the `curse of dimensionality' issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account.

The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters. The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to $ k$-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures.

It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.




next up previous contents
Next: Contents   Contents
Alexander Strehl 2002-05-03