Clustering can be posed as a graph partitioning problem. The objects
are viewed as the set of vertices
. Two documents
and
(or vertices
and
) are
connected with an undirected edge of positive weight
, or
. The cardinality of the set of edges
equals the number of non-zero similarities between all pairs of
samples. A set of edges whose removal partitions a graph
into
pair-wise disjoint
sub-graphs
, is called an edge
separator. The objective in graph partitioning is to find such a
separator with a minimum sum of edge weights. While striving for the
minimum cut objective, the number of objects in each cluster has to be
kept approximately equal. We produce balanced (equal sized) clusters
from the similarity matrix using the multi-level
graph partitioner METIS [KK98a].
The most expensive step in this
technique is
the computation of the
similarity matrix. In document
clustering, sparsity can be induced by looking only at the
strongest edges or at the subgraph induced by pruning all edges except
the
nearest-neighbors for each vertex. Sparsity makes this
approach feasible for large data-sets. Sparsity is induced by
particular similarities definitions based e.g., on the cosine of
document vectors.