Next: Organization Up: Introduction Previous: Current Challenges in Clustering Contents

Contributions

The goal of this dissertation is to improve cluster analysis of complex, high-dimensional, and sparse data, especially when the application scenario imposes constraints on the desired results and on the distribution of and access to the data. This dissertation utilizes ideas from pattern recognition, machine learning, statistics, graph theory, matrix reordering, multi-learner systems, and information theory to build a novel paradigm for cluster analysis based on relationships. The specific contributions of this dissertation are as follows:

Development of a complete framework for behavioral customer segmentation. The framework extends previous work through domain specific similarity measures such as the extended Jaccard coefficient and constraints such as revenue or customer balancing.
Proposal of an intuitive and interactive clustering visualization method based on a reordering of the similarity matrix.
Development of a comparative framework for semi-supervised text clustering and investigation of several popular clustering approaches on a variety of data-sets. The empirical evaluation demonstrates how relationship-based methods improve both quality as well as balance of results.
Definition of the cluster ensemble problem as a counterpart to classification ensembles in unsupervised learning. The problem of combining previous clusterings without resorting to the original features is posed as a mutual information maximization problem.
Development and comparison of three relationship-based algorithms for the cluster ensemble problem. It is demonstrated that all of them work well on real data and are able to deal with missing labels and soft clusterings.
Application of cluster ensembles to foster robustness and to enable distributed clustering.

Next: Organization Up: Introduction Previous: Current Challenges in Clustering Contents

Alexander Strehl 2002-05-03