One picture is worth ten thousand words.
- Frederick R. Barnard3.1
In several real-life data mining applications, data resides in
very high (1000 and more) dimensional space, where both clustering
techniques developed for low dimensional spaces (-means, BIRCH,
CLARANS, CURE, DBScan, etc.) as well as visualization methods, such as
parallel coordinates or projective visualizations, are rendered
ineffective. This chapter proposes a relationship-based approach that
alleviates both problems, side-stepping the `curse of
dimensionality' issue by working in a suitable similarity space
instead of the original high-dimensional attribute space. This
intermediary similarity space can be suitably tailored to satisfy
business criteria such as requiring customer clusters to represent
comparable amounts of revenue. We apply efficient and scalable
graph partitioning-based clustering techniques in this space. The
output from the clustering algorithm is used to re-order the data
points so that the resulting permuted similarity matrix can be readily
visualized in 2 dimensions, with clusters showing up as bands. While
2-dimensional visualization of a similarity matrix is by itself not
novel, its combination with the order-sensitive partitioning of a
graph that captures the relevant similarity measure between objects
provides three powerful properties: (i) the high-dimensionality of the
data does not affect further processing once the similarity space is
formed; (ii) it leads to clusters of (approximately) equal importance,
and (iii) related clusters show up adjacent to one another, further
facilitating the visualization of results. The visualization is very
helpful for assessing and improving clustering. For example,
actionable recommendations for splitting or merging of clusters can be
easily derived, and it also guides the user towards the right number
of clusters. Results are presented on a real retail industry data-set
of several thousand customers and products, as well as on clustering
of web-document collections and of web-log sessions.
Subsections