Next: Contributions Up: Introduction Previous: Relationship-based Clustering Approach Contents

Current Challenges in Clustering

Many traditional clustering techniques [Har75,Nie81,JD88] do not perform satisfactorily in data mining scenarios due to a variety of reasons. These reasons can be divided into those arising from the data distribution and those caused by application constraints:

Data Distribution

Large number of samples.

The number of samples to be processed is very high. Algorithms have to be very conscious of scaling issues. Like many interesting problems, clustering in general is NP-hard, and practical and successful data mining algorithms usually scale linear or log-linear. Quadratic and cubic scaling may also be allowable but a linear behavior is highly desirable.

High dimensionality.

The number of features is very high and may even exceed the number of samples. So one has to face the curse of dimensionality [Fri94].

Sparsity.

Most features are zero for most samples, i.e. the object-feature matrix is sparse. This property strongly affects the measurements of similarity and the computational complexity.

Strong non-Gaussian distribution of feature values.

The data is so skewed that it can not be safely modeled by normal distributions.

Significant outliers.

Outliers may have significant importance. Finding these outliers is highly non-trivial, and removing them is not necessarily desirable.
Application context

Legacy clusterings.

Previous cluster analysis results are often available. This knowledge should be reused instead of starting each analysis from scratch.

Distributed data.

Large systems often have heterogeneous distributed data sources. Local cluster analysis results have to be integrated into global models.

For example, in market-basket analysis and text document clustering, we found the number of samples ranging from

, each sample having around

attributes. On average, over 99% of the attributes are zero, resulting in a very non-Gaussian feature value distribution. In fact, the distribution is often modeled best by a Poisson with a point mass at 0. Outliers are often present and important, such as restaurant owners in grocery shopping records, or index pages in document clustering.

In the document clustering application context, a multitude of legacy clusterings is available from several sources, such as Yahoo!, DMOZ, or Northern Light, which can be exploited for current analysis. The new challenges of high-dimensional, large-scale, heterogeneous databases create the need for new approaches to clustering.

Next: Contributions Up: Introduction Previous: Relationship-based Clustering Approach Contents

Alexander Strehl 2002-05-03