next up previous contents
Next: Clustering Objectives Up: Challenges Previous: The Problem of Scale   Contents


Curse of Dimensionality

When dealing with very high-dimensional data, one is faced with the `curse of dimensionality' [Fri94] and the associated sparsity issues. Essentially the amount of data to sustain a given spatial density increases exponentially with the dimensionality of the input space, or alternatively, the sparsity increases exponentially given a constant amount of data, with points tending to become equidistant from one another. In general, this will adversely impact any method based on spatial density, unless the data follows certain simple distributions. Figure 2.2 gives a simple illustration of the curse of dimensionality.

Figure 2.2: Curse of dimensionality illustrated with 256 $ d$-dimensional points from a [0,1] uniform distribution with $ d = 2$ (left), 4 (middle) and 32 (right). The top row shows the results of the 2D Principal Components Analysis (PCA). The bottom row shows how similarity (as a monotonically decreasing function of Euclidean distance) is distributed. As $ d$ increases, projections approach Gaussian distributions. Also, an average pair of points' similarity decreases rapidly and similarities become approximately equal for most pairs with increasing $ d$.
\resizebox{.9\columnwidth}{!}{\includegraphics*{epsexcl/cursebetterbw}}


next up previous contents
Next: Clustering Objectives Up: Challenges Previous: The Problem of Scale   Contents
Alexander Strehl 2002-05-03