Next: Motivation Up: Relationship-based Clustering and Cluster Previous: Clustering Objectives Contents

Relationship-based Clustering and Visualization

One picture is worth ten thousand words.
- Frederick R. Barnard^3.1

In several real-life data mining applications, data resides in very high (1000 and more) dimensional space, where both clustering techniques developed for low dimensional spaces (

-means, BIRCH, CLARANS, CURE, DBScan, etc.) as well as visualization methods, such as parallel coordinates or projective visualizations, are rendered ineffective. This chapter proposes a relationship-based approach that alleviates both problems, side-stepping the `curse of dimensionality' issue by working in a suitable similarity space instead of the original high-dimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. While 2-dimensional visualization of a similarity matrix is by itself not novel, its combination with the order-sensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the high-dimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user towards the right number of clusters. Results are presented on a real retail industry data-set of several thousand customers and products, as well as on clustering of web-document collections and of web-log sessions.

Subsections

Next: Motivation Up: Relationship-based Clustering and Cluster Previous: Clustering Objectives Contents

Alexander Strehl 2002-05-03