The objects to be clustered can be viewed as a set of vertices . Two web-pages and (or vertices and ) are connected with an undirected edge of positive weight , or . The cardinality of the set of edges equals the number of non-zero similarities between all pairs of samples. A set of edges whose removal partitions a graph into pair-wise disjoint sub-graphs , is called an edge separator. Our objective is to find such a separator with a minimum sum of edge weights. While striving for the minimum cut objective, the number of objects in each cluster has to be kept approximately equal. In this particular case of balancing, the problem is NP-hard and known as graph partitioning.
Balanced clusters are desirable because each cluster represents an equally important share of the data. However, some natural classes may not be equal size. By using a higher number of clusters we can account for multi-modal classes (e.g., XOR-problem) and clusters can be merged at a latter stage. The most expensive step in this technique is the computation of the similarity matrix. In document clustering, sparsity can be induced by looking only at the strongest edges or at the subgraph induced by pruning all edges except the nearest-neighbors for each vertex. Sparsity makes this approach feasible for large data-sets.
The Kernighan-Lin (KL) algorithm has been a very successful algorithm for graph partitioning [KL70]. The KL is based on iteratively conducting best sequences of swaps involving all vertices. In an improved implementation by Fiduccia and Mattheyses [FM82], the complexity was of the KL algorithm was reduced to .
In spectral bisection [HL95,PSL90] the partitioning problem is reduced to finding the second least dominant eigenvector of the Laplacian matrix . Consider the incidence matrix whose entries are defined as follows:
(2.16) |
(2.17) |
Currently, the best available algorithms use the KL-algorithm or spectral bisection in a multi-level framework which has three stages:
A hypergraph is a graph whose edges can connect more than two vertices (hyperedges). The clustering problem is then formulated as a finding the minimum-cut of a hypergraph. A minimum-cut is the removal of the set of hyperedges (with minimum edge weight) that separates the hypergraph into unconnected components. Again, an object maps to a vertex . Each feature maps to a hyperedge connecting all vertices with a non-zero value for this particular feature, so . The minimum-cut of this hypergraph into unconnected components gives the desired clustering. Hypergraphs are often used in VLSI design. An application to association rule hypergraph clustering can be found in [BGG$^+$99].
A variety of packages for graph partitioning is available. A Matlab kit for geometric mesh partitioning (requires coordinate information for vertices) [GMT95] and spectral bi-section [PSL90] by John R. Gilbert and Shang-Hua Teng is available from Xerox. Currently, the most popular programs for graph partitioning are CHACO (available from Bruce Hendrickson at Los Alamos National Labs) [HL94] and METIS (George Karypis, University of Minnesota) [KK98a,KK98b]. Both implement the currently fastest available algorithm, a multi-level version of Kernighan-Lin [KL70,FM82]. A popular package for hypergraph partitioning is HMETIS (George Karypis, University of Minnesota) [KAKS97].