next up previous contents
Next: Hypergraph Partitioning (HGP) Up: Algorithms Previous: Generalized -means (KM)   Contents

Weighted Graph Partitioning (GP)

Clustering can be posed as a graph partitioning problem. The objects are viewed as the set of vertices $ \mathcal{V}$. Two documents $ \mathbf{x}_a$ and $ \mathbf{x}_b$ (or vertices $ v_a$ and $ v_b$) are connected with an undirected edge of positive weight $ s(\mathbf{x}_a,\mathbf{x}_b)$, or $ (a,b,s(\mathbf{x}_a,\mathbf{x}_b)) \in \mathcal{E}$. The cardinality of the set of edges $ \vert\mathcal{E}\vert$ equals the number of non-zero similarities between all pairs of samples. A set of edges whose removal partitions a graph $ \mathcal{G}=(\mathcal{V},\mathcal{E})$ into $ k$ pair-wise disjoint sub-graphs $ \mathcal{G}_{\ell}
=(\mathcal{V}_{\ell},\mathcal{E}_{\ell})$, is called an edge separator. The objective in graph partitioning is to find such a separator with a minimum sum of edge weights. While striving for the minimum cut objective, the number of objects in each cluster has to be kept approximately equal. We produce balanced (equal sized) clusters from the similarity matrix using the multi-level graph partitioner METIS [KK98a]. The most expensive step in this $ O(n^2
\cdot d)$ technique is the computation of the $ n \times n$ similarity matrix. In document clustering, sparsity can be induced by looking only at the $ v$ strongest edges or at the subgraph induced by pruning all edges except the $ v$ nearest-neighbors for each vertex. Sparsity makes this approach feasible for large data-sets. Sparsity is induced by particular similarities definitions based e.g., on the cosine of document vectors.


next up previous contents
Next: Hypergraph Partitioning (HGP) Up: Algorithms Previous: Generalized -means (KM)   Contents
Alexander Strehl 2002-05-03