next up previous contents
Next: Weighted Graph Partitioning (GP) Up: Algorithms Previous: Random Baseline (RND)   Contents

Generalized $ k$-means (KM)

We also employed the well-known Euclidean $ k$-means algorithm and three variations of it using non-metric distance measures. The $ k$-means algorithm is an iterative algorithm to minimize the least squares error criterion [DH73]. A cluster $ \mathcal{C}_{\ell}$ is represented by its center $ \mu_{\ell}$, the mean of all samples in $ \mathcal{C}_{\ell}$. The centers are initialized with a random selection of $ k$ data objects. Each sample is then labeled with the index $ \ell$ of the nearest or most similar center. In classical $ k$-means, `nearest' means the point with the smallest Euclidean distance. However, $ k$-means can be generalized by substituting nearness with any other notion of similarity. For our comparison, we will use all four definitions of the similarity $ s(\mathbf{x}_a,\mathbf{x}_b)$ between two objects $ \mathbf{x}_a$ and $ \mathbf{x}_b$ as introduced in the previous section. Subsequent re-computing of the mean for each cluster and re-assigning the cluster labels is iterated until convergence to a fixed labeling after $ m$ iterations. The complexity of this algorithm is $ O(n \cdot d \cdot k
\cdot m)$.



Alexander Strehl 2002-05-03