Next: Ensembles and Knowledge Reuse
Up: Background and Related Work
Previous: Scalability
  Contents
Visualization
Visualization of high-dimensional data clusters can be largely divided
into four popular approaches:
- Dimensionality reduction by selection of 2 or 3 dimensions,
or, more generally, projecting the data down to 2 or 3 dimensions.
Often these dimensions correspond to principal components or a
scalable approximation thereof (e.g., FASTMAP
[FL95]). Chen, for example, creates a browsable
2-dimensional space of authors through co-citations [Che99].
Another noteworthy method is CViz [DMS98], which projects
onto the plane that passes through three selected cluster centroids to
yield a `discrimination optimal' 2-dimensional projection. These
projections are useful for a medium number of dimensions, i.e., if
is not too large ( 100).2.3Nonlinear projections have also been studied [CG01].
Recreating a 2- or 3-dimensional space from a similarity graph can
also be done through multi-dimensional scaling [Tor52].
- Parallel axis plots show each object as a line along
parallel axis. However, this technique is rendered ineffective if the
number of dimensions or the number of objects gets too high.
- Kohonen's Self Organizing Map (SOM) [Koh90]
provides an innovative and powerful way of clustering while enforcing
constraints on a logical topology imposed on the cluster centers. If
this topology is 2-dimensional, one can readily "visualize" the
clustering of data. Essentially a 2-dimensional manifold is mapped
onto the (typically higher dimensional) feature space, trying to
approximate data density while maintaining topological constraints.
Since the mapping is not bijective, the quality can degrade very
rapidly with increasing dimensionality of feature space, unless the
data is largely confined to a much lower order manifold within this
space [CG01]. Multi-Dimensional Scaling (MDS) and associated
methods also face similar issues.
- Visualization can also be done by showing the data matrix as
an image by converting entries to brightness values. The ordering of
data points for visualization has previously been used in conjunction
with clustering in different contexts. For example, in OPTICS
[ABKS99] instead of producing an explicit clustering, an
augmented ordering of the database is produced. Subsequently, this
ordering is used to display various metrics such as reachability
values.
In cluster analysis of genome data [ESBB98] re-ordering the
primary data matrix and representing it graphically has been explored.
This visualization takes place in the primary data space rather than
in the relationship-space.
Sparse primary data matrix reorderings have also been considered for
browsing hypertext [BHR96].
A useful survey of visualization methods for data mining in general
can be found in [KK96]. The popular books by E. Tufte
[Tuf83] on visualizing information are also recommended.
Next: Ensembles and Knowledge Reuse
Up: Background and Related Work
Previous: Scalability
  Contents
Alexander Strehl
2002-05-03