next up previous contents
Next: Experiments Up: CLUSION: Cluster Visualization Previous: Visualization   Contents

Comparison

CLUSION gives a relationship-centered view, as contrasted with common projective techniques, such as the selection of dominant features or optimal linear projections (PCA), which are object-centered. In CLUSION, the actual features are transparent, instead, all pair-wise relationships, the relevant aspect for the purpose of clustering, are displayed.

Figure 3.3: Comparison of cluster visualization techniques. All tools work well on the 4-dimensional IRIS data (a). But on the 2903-dimensional YAHOO news document data (b), only CLUSION reveals that clusters 1 and 2 are actually highly related, cluster 3 is strong and interdisciplinary, 4 is weak, and 5 is strong.
\resizebox{\textwidth}{!}{\includegraphics*{eps/iliri1234}}
\resizebox{\textwidth}{!}{\includegraphics*{eps/ildoc1234}}

Figure 3.3 compares CLUSION with some other popular visualizations. In figure 3.3(a) parallel axis, PCA projection, CViz (projection through plane defined by centroids of clusters 1, 2, and 3) as well as CLUSION succeed in visualizing the IRIS data (see also appendix A.2). Membership in cluster 1 / 2 / 3 is indicated by colors red / blue / green (parallel axis), colors red / blue / green and shapes $ \circ$/$ \times$/$ +$ (PCA and CViz), and position on diagonal from upper left to lower right corner (CLUSION), respectively. All four tools succeed in visualizing three clusters and making apparent that clusters 2 and 3 are closer than any other and cluster 1 is very compact. Figure 3.3(b) shows the same comparison for 293 documents from which 2903 word frequencies where extracted to be used as features. In fact this data consists of 5 clusters selected from 40 clusters extracted from a Yahoo! news document collection which will be described in more detail in subsection 3.5.2 (YAHOO). The colors black / magenta and the shapes $ \Box$ / $ \ast$ have been added to indicate cluster 4 / 5, respectively. The parallel axis plot becomes useless clutter due to the high number of dimensions as well as the large number of objects. PCA and CViz succeed in separating three clusters each (2, 3, 5 and 1, 2, 3, respectively) and show all others superimposed on the axis origin. They give no suggestions towards which clusters are compact or which clusters are related. Only CLUSION suggests that clusters 1 and 2 are actually highly related, cluster 3 is interdisciplinary, 4 is weak, and 5 is a strong cluster. And indeed, when looking at the cluster descriptions (which might not be so easily available and understandable in all domains), the intuitive interpretations revealed by CLUSION are proven to be very true:
cluster dominant category purity entropy most frequent word stems
1 health (H) 100% 0.00 hiv, depress, immun
2 health (H) 100% 0.00 weight, infant, babi
3 online (o) 58% 0.43 apple, intel, electron
4 film (f) 38% 0.72 hbo, ali, alan
5 television (t) 83% 0.26 household, sitcom, timeslot



Note that the majority category, purity, and entropy are only available where a supervised categorization is given. Of course the categorization cannot be used to tune the clustering. Clusters 1 and 2 contains only documents from the Health category so they are highly related. The 4th cluster, which is indicated to be weak by CLUSION, has in fact the lowest purity in the group with 38% of documents from the most dominant category (film). CLUSION also suggests cluster 3 is not only strong, as indicated by the dark diagonal region, but also has distinctly above average relationships to all other 4 clusters. On inspecting the word stems typifying this cluster (Apple, Intel, and electron(ics)) it is apparent that this is because of the interdisciplinary appearance of technology savvy words in recent news releases. Since such cluster descriptions might not be so easily available or well understood in all domains, the intuitive display of CLUSION is very useful. CLUSION has several other powerful properties. For example, it can be integrated with product hierarchies (meta-data) to provide simultaneous customer and product clustering, as well as multi-level views / summaries. It also has a graphical user interface so one can interactively browse / split / merge a data-set which is of great help to speed-up the iterations of analysis during a data mining project.
next up previous contents
Next: Experiments Up: CLUSION: Cluster Visualization Previous: Visualization   Contents
Alexander Strehl 2002-05-03