next up previous contents
Next: Contributions Up: Introduction Previous: Relationship-based Clustering Approach   Contents

Current Challenges in Clustering

Many traditional clustering techniques [Har75,Nie81,JD88] do not perform satisfactorily in data mining scenarios due to a variety of reasons. These reasons can be divided into those arising from the data distribution and those caused by application constraints:

For example, in market-basket analysis and text document clustering, we found the number of samples ranging from $ 10^3$ to $ 10^5$, each sample having around $ 10^3$ to $ 10^5$ attributes. On average, over 99% of the attributes are zero, resulting in a very non-Gaussian feature value distribution. In fact, the distribution is often modeled best by a Poisson with a point mass at 0. Outliers are often present and important, such as restaurant owners in grocery shopping records, or index pages in document clustering.

In the document clustering application context, a multitude of legacy clusterings is available from several sources, such as Yahoo!, DMOZ, or Northern Light, which can be exploited for current analysis. The new challenges of high-dimensional, large-scale, heterogeneous databases create the need for new approaches to clustering.


next up previous contents
Next: Contributions Up: Introduction Previous: Relationship-based Clustering Approach   Contents
Alexander Strehl 2002-05-03