Generalized -means (KM)

Next: Weighted Graph Partitioning (GP) Up: Algorithms Previous: Random Baseline (RND) Contents

Generalized -means (KM)

We also employed the well-known Euclidean -means algorithm and three variations of it using non-metric distance measures. The -means algorithm is an iterative algorithm to minimize the least squares error criterion [DH73]. A cluster $\mathcal{C}_{\ell}$ is represented by its center $\mu_{\ell}$ , the mean of all samples in $\mathcal{C}_{\ell}$ . The centers are initialized with a random selection of data objects. Each sample is then labeled with the index $\ell$ of the nearest or most similar center. In classical -means, `nearest' means the point with the smallest Euclidean distance. However, -means can be generalized by substituting nearness with any other notion of similarity. For our comparison, we will use all four definitions of the similarity $s(\mathbf{x}_a,\mathbf{x}_b)$ between two objects $\mathbf{x}_a$ and $\mathbf{x}_b$ as introduced in the previous section. Subsequent re-computing of the mean for each cluster and re-assigning the cluster labels is iterated until convergence to a fixed labeling after iterations. The complexity of this algorithm is $O(n \cdot d \cdot k \cdot m)$ .

Alexander Strehl 2002-05-03