In BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies) [ZRL97], the authors propose to
incrementally build a balanced tree representing the data. Each node
holds
Cluster-Features (CF), defined as the
triplet of the three sufficient statistics
,
, and
. These
zeroth, first and second statistical moments are sufficient to compute
centroid (mean), radius (isotropic standard deviation), diameter
(average pairwise intra-cluster Euclidean distance) at any time during
the incremental algorithm. The user parameters are the branching
factor (maximum number of children) and a splitting threshold on the
diameter of a cluster. After building this tree, a clustering
algorithm of the users choice is used to cluster the leaf nodes of the
BIRCH tree.
CURE (Clustering Using REpresentatives) [GRS98] uses a fixed number of well-scattered representatives for each cluster. This allows the algorithm to adopt the middle ground between storing only the centroid and retaining all samples. Thus the representation is non-parametric, which allows for non-Gaussian distributions. A shrinking factor increases robustness by scaling all samples around the corresponding cluster-center which reduces the effects of outliers.
In CLARANS (Clustering Large Applications based upon
RANdomized Search) [NH94], the authors propose to cluster on a
randomly selected sub-set of the data using a -medoids based
algorithm. Subsequently, neighboring solutions (a clustering with one
replaced medoid) are explored and the sub-sample is re-randomized
until a local optimum is found. The entire process is repeated to
de-localize the search and find a more global solution.
When sampling, the success depends strongly on the size of the sub-set, which has to be big enough to preserve the knowledge from the original data. In large database applications, random sub-sampling is often as expensive as performing an entire scan, since the used data structures are not tuned for random access.
DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) [EKSX96] is based on the notion of core objects
(defined by having a minimum number of points within an
-neighborhood), density-reachability (non-symmetric), and
density-connectedness (symmetric). However, considering the
high number of dimensions in data mining applications (and consequently the
sparseness of data in feature space) it seems questionable if the
notion of density remains meaningful.