next up previous contents
Next: Robust -medoids Up: Clustering Algorithms Previous: Clustering Algorithms   Contents

The $ k$-means Framework

The $ k$-means algorithm is a very simple and very powerful iterative technique to partition a data-set into $ k$ disjoint clusters, where the value $ k$ has to be pre-determined [DH73,Har75]. A generalized, similarity-based description of the algorithm can be given as follows:

  1. Start at $ t=0$ with $ k$ randomly selected objects as the cluster centers $ \mathbf{c}_{\ell}^{(0)}$, $ \ell \in \{1,\ldots,k \}$.
  2. Assign each object $ \mathbf{x}_j$ to the cluster center with maximum similarity:

    $\displaystyle \lambda_j^{(t)} = \arg \max_{\ell \in \{1,\ldots,k \}} s(\mathbf{x}_j, \mathbf{c}_{\ell}^{(t)})$ (2.1)

  3. Update all $ k$ cluster means:

    $\displaystyle \mathbf{c}_{\ell}^{(t+1)} = \frac{1}{\vert \mathcal{C}_{\ell}^{(t)} \vert } \sum_{\lambda_j^{(t)} = \ell} \mathbf{x}_j$ (2.2)

  4. If any $ \mathbf{c}_{\ell}^{(t+1)}$ differs from $ \mathbf{c}_{\ell}^{(t)}$ go to step 2 with $ t \leftarrow t+1$ unless termination criteria (such as exceeding the maximum number of iterations) are met.
When similarity is based on a strictly monotonic decreasing mapping of Euclidean distances, $ k$-means greedily minimizes the sum of squared distances of the samples $ \mathbf{x}_j$ to the closest cluster centroid $ \mathbf{c}_{\lambda_j}$ as given by equation 2.3.

$\displaystyle \sum_{j=1}^n \Vert \mathbf{x}_j - \mathbf{c}_{\lambda_j} \Vert _2...
...thbf{x} \in \mathcal{C}_{\ell}} \Vert \mathbf{x} - \mathbf{c}_{\ell} \Vert _2^2$ (2.3)


next up previous contents
Next: Robust -medoids Up: Clustering Algorithms Previous: Clustering Algorithms   Contents
Alexander Strehl 2002-05-03