In this dissertation, cluster analysis was extended in several directions driven by information-rich, complex data. Real-life objects are often characterized by an abundance of features as well as taxonomies from a variety of views and times. The main contributions of this dissertation are summarized in figure 6.1. The contributions are organized along three areas of cluster analysis activities, namely applications, algorithms, and result assessment.
Transactional shopping records motivated us to develop better analysis
tools than the available tools such as a-priori association rule
mining or -means. We developed a relationship-based clustering
framework that relies only on pairwise similarities to side-step the
curse of dimensionality. In this spirit, we developed an intuitive
clustering visualization using sparse seriation of the pairwise
similarity matrix. We improved similarity measures by proposing and
analyzing an extended Jaccard similarity measure. In the retail
domain, we also introduced application-driven constraints of value-balancing to cluster analysis (e.g., clusters have
approximately equal total revenue or number of customers). These
constraints turned out to be useful in a variety of other domains
such as clustering web-sessions and text documents.
In our work with text document data, the availability of human engineered document taxonomies inspired a clustering performance evaluation measure based on mutual information. The results obtained through this framework showed that relationship-based clustering is indeed superior when appropriate domain-specific similarities are chosen.
When a multitude of taxonomies is already present, one might just want to reuse such existing knowledge and integrate previous results instead of starting from scratch. This led us to develop cluster ensembles, an approach to adopt multi-learner systems for clustering. We proposed a formal cluster ensemble problem and developed three effective and efficient algorithms to solve it. This combiner framework is useful in a variety of applications besides knowledge reuse. For example, it enables distributed clustering and can provide robustness through multiple clusterings.