Next: Impact of Similarity Measures Up: Relationship-based Clustering and Visualization Previous: Parallel Implementation Contents

Summary

A recent poll (June 2001) by KDNuggets (http://www.kdnuggets.com/) indicated that clustering was by far the most popular type of analysis in the last 12 months at 22% (followed by direct marketing at 14% and cross-sell models at 12%). The clustering process is characterized by extensive explorative periods where better domain understanding is gained. Often, in this iterative process the crucially important definitions of features and/or similarity are refined. The visualization toolkit CLUSION allows even non-specialists to get an intuitive visual impression of the grouping nature of objects that may be originally defined in a high-dimensional space. Taking CLUSION from a post-processing step into the loop can significantly accelerate the process of discovering domain knowledge, as it provides a powerful visual aid for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived and readily applied via a point-and-click user interface, and different similarity metrics can be compared visually. It also guides the user towards the `right' number of clusters. A demo of this tool can be found at http://strehl.com/. This chapter originally stemmed from our encounter with several retail data-sets, where even after substantial pre-processing we were left with records with over 1000 attributes, and further attempts to reduce the number of attributes by selection/projection led to loss of vital information. Relationship-based clustering provides one way out by transforming the data to another space (in time linear in the number of dimensions) where the high dimensionality gets `hidden', since once similarity is computed, the original dimensions are not encountered again. This suggests a connection of our approach with kernel-based methods, such as support vector machines, that are currently very popular for classification problems [Vap95,Joa98]. A kernel function of two vectors is a generalized inner product between the corresponding mappings of these vectors into a derived (and typically very high dimensional) feature space. Thus one can view it as a similarity measure between the two original vectors. It will be worthwhile to further investigate this connection for a variety of applications [JH99]. The clustering algorithm presented in this chapter is largely geared towards the needs of segmenting transactional data, with provision of getting balanced clusters and for selecting the quantity (revenue, margins) of interest to influence the grouping. Thus, rather than evaluating business objectives (such as revenue contribution) after clustering is done, they are directly integrated into the clustering algorithm. Moreover, it is a natural fit with the visualization algorithm. Also, it can be extended to other domains, as illustrated by our results on document clustering and grouping web-logs. We also examined several ways of scaling the clustering routine to a large number of data points, and elaborated on one approach that is able to use sampling effectively because of the balanced nature of the desired clusters.

Next: Impact of Similarity Measures Up: Relationship-based Clustering and Visualization Previous: Parallel Implementation Contents

Alexander Strehl 2002-05-03