Weighted Graph Partitioning (GP)

Next: Hypergraph Partitioning (HGP) Up: Algorithms Previous: Generalized -means (KM) Contents

Weighted Graph Partitioning (GP)

Clustering can be posed as a graph partitioning problem. The objects are viewed as the set of vertices $\mathcal{V}$ . Two documents $\mathbf{x}_a$ and $\mathbf{x}_b$ (or vertices and ) are connected with an undirected edge of positive weight $s(\mathbf{x}_a,\mathbf{x}_b)$ , or $(a,b,s(\mathbf{x}_a,\mathbf{x}_b)) \in \mathcal{E}$ . The cardinality of the set of edges $\vert\mathcal{E}\vert$ equals the number of non-zero similarities between all pairs of samples. A set of edges whose removal partitions a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ into pair-wise disjoint sub-graphs $\mathcal{G}_{\ell} =(\mathcal{V}_{\ell},\mathcal{E}_{\ell})$ , is called an edge separator. The objective in graph partitioning is to find such a separator with a minimum sum of edge weights. While striving for the minimum cut objective, the number of objects in each cluster has to be kept approximately equal. We produce balanced (equal sized) clusters from the similarity matrix using the multi-level graph partitioner METIS [KK98a]. The most expensive step in this $O(n^2 \cdot d)$ technique is the computation of the $n \times n$ similarity matrix. In document clustering, sparsity can be induced by looking only at the strongest edges or at the subgraph induced by pruning all edges except the nearest-neighbors for each vertex. Sparsity makes this approach feasible for large data-sets. Sparsity is induced by particular similarities definitions based e.g., on the cosine of document vectors.

Next: Hypergraph Partitioning (HGP) Up: Algorithms Previous: Generalized -means (KM) Contents

Alexander Strehl 2002-05-03