Recent Database-driven Approaches

Next: Graph-based Clustering Up: Clustering Algorithms Previous: Mixture Density Estimation Contents

Recent Database-driven Approaches

In BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL97], the authors propose to incrementally build a balanced tree representing the data. Each node $\mathcal{C}_{\ell}$ holds Cluster-Features (CF), defined as the triplet of the three sufficient statistics $n_{\ell} = \vert \mathcal{C}_{\ell} \vert$ , $\mathrm{LS}_{\ell} = \sum_{\mathbf{x}_j \in \mathcal{C}_{\ell}} \mathbf{x}_j$ , and $\mathrm{SS}_{\ell} = \sum_{\mathbf{x}_j \in \mathcal{C}_{\ell}} \mathbf{x}_j^2$ . These zeroth, first and second statistical moments are sufficient to compute centroid (mean), radius (isotropic standard deviation), diameter (average pairwise intra-cluster Euclidean distance) at any time during the incremental algorithm. The user parameters are the branching factor (maximum number of children) and a splitting threshold on the diameter of a cluster. After building this tree, a clustering algorithm of the users choice is used to cluster the leaf nodes of the BIRCH tree.

CURE (Clustering Using REpresentatives) [GRS98] uses a fixed number of well-scattered representatives for each cluster. This allows the algorithm to adopt the middle ground between storing only the centroid and retaining all samples. Thus the representation is non-parametric, which allows for non-Gaussian distributions. A shrinking factor increases robustness by scaling all samples around the corresponding cluster-center which reduces the effects of outliers.

In CLARANS (Clustering Large Applications based upon RANdomized Search) [NH94], the authors propose to cluster on a randomly selected sub-set of the data using a -medoids based algorithm. Subsequently, neighboring solutions (a clustering with one replaced medoid) are explored and the sub-sample is re-randomized until a local optimum is found. The entire process is repeated to de-localize the search and find a more global solution.

When sampling, the success depends strongly on the size of the sub-set, which has to be big enough to preserve the knowledge from the original data. In large database applications, random sub-sampling is often as expensive as performing an entire scan, since the used data structures are not tuned for random access.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] is based on the notion of core objects (defined by having a minimum number of points within an $\epsilon$ -neighborhood), density-reachability (non-symmetric), and density-connectedness (symmetric). However, considering the high number of dimensions in data mining applications (and consequently the sparseness of data in feature space) it seems questionable if the notion of density remains meaningful.

Next: Graph-based Clustering Up: Clustering Algorithms Previous: Mixture Density Estimation Contents

Alexander Strehl 2002-05-03