Next: Impact of Similarity Measures
Up: Relationship-based Clustering and Visualization
Previous: Parallel Implementation
  Contents
Summary
A recent poll (June 2001) by KDNuggets (http://www.kdnuggets.com/)
indicated that clustering was by far the most popular type of analysis
in the last 12 months at 22% (followed by direct marketing at 14%
and cross-sell models at 12%).
The clustering process is characterized by extensive explorative
periods where better domain understanding is gained. Often, in this
iterative process the crucially important definitions of features
and/or similarity are refined. The visualization toolkit CLUSION
allows even non-specialists to get an intuitive visual impression of
the grouping nature of objects that may be originally defined in a
high-dimensional space.
Taking CLUSION from a post-processing step into the loop can
significantly accelerate the process of discovering domain knowledge,
as it provides a powerful visual aid for assessing and improving
clustering. For example, actionable recommendations for splitting or
merging of clusters can be easily derived and readily applied via a
point-and-click user interface, and different similarity metrics can
be compared visually. It also guides the user towards the `right'
number of clusters. A demo of this tool can be found at
http://strehl.com/.
This chapter originally stemmed from our encounter with several retail
data-sets, where even after substantial pre-processing we were left
with records with over 1000 attributes, and further attempts to reduce
the number of attributes by selection/projection led to loss of vital
information.
Relationship-based clustering provides one way out by transforming the
data to another space (in time linear in the number of dimensions)
where the high dimensionality gets `hidden', since once similarity
is computed, the original dimensions are not encountered again.
This suggests a connection of our approach with kernel-based methods,
such as support vector machines, that are currently very popular for
classification problems [Vap95,Joa98]. A kernel function of two
vectors is a generalized inner product between the corresponding
mappings of these vectors into a derived (and typically very high
dimensional) feature space. Thus one can view it as a similarity
measure between the two original vectors. It will be worthwhile to
further investigate this connection for a variety of applications
[JH99].
The clustering algorithm presented in this chapter is largely geared
towards the needs of segmenting transactional data, with provision of
getting balanced clusters and for selecting the quantity (revenue,
margins) of interest to influence the grouping. Thus, rather than
evaluating business objectives (such as revenue contribution) after
clustering is done, they are directly integrated into the clustering
algorithm. Moreover, it is a natural fit with the visualization
algorithm. Also, it can be extended to other domains, as illustrated
by our results on document clustering and grouping web-logs. We also
examined several ways of scaling the clustering routine to a large
number of data points, and elaborated on one approach that is able to
use sampling effectively because of the balanced nature of the desired
clusters.
Next: Impact of Similarity Measures
Up: Relationship-based Clustering and Visualization
Previous: Parallel Implementation
  Contents
Alexander Strehl
2002-05-03