The objects to be clustered can be viewed as a set of vertices
. Two web-pages
and
(or
vertices and ) are connected with an undirected edge of
positive weight
, or
. The cardinality
of the set of edges
equals the number of *non-zero* similarities between all pairs of samples. A set of edges
whose removal partitions a graph
into pair-wise disjoint
sub-graphs
, is called an edge
separator. Our objective is to find such a separator with a minimum
sum of edge weights. While striving for the minimum cut objective,
the number of objects in each cluster has to be kept approximately
equal. In this particular case of balancing, the problem is NP-hard
and known as *graph partitioning*.

Balanced clusters are desirable because each cluster represents an equally important share of the data. However, some natural classes may not be equal size. By using a higher number of clusters we can account for multi-modal classes (e.g., XOR-problem) and clusters can be merged at a latter stage. The most expensive step in this technique is the computation of the similarity matrix. In document clustering, sparsity can be induced by looking only at the strongest edges or at the subgraph induced by pruning all edges except the nearest-neighbors for each vertex. Sparsity makes this approach feasible for large data-sets.

The Kernighan-Lin (KL) algorithm has been a very successful algorithm for graph partitioning [KL70]. The KL is based on iteratively conducting best sequences of swaps involving all vertices. In an improved implementation by Fiduccia and Mattheyses [FM82], the complexity was of the KL algorithm was reduced to .

In *spectral bisection* [HL95,PSL90] the partitioning
problem is reduced to finding the second least dominant eigenvector of
the Laplacian matrix . Consider the incidence matrix
whose entries are defined as follows:

(2.16) |

The Laplacian can be written as

(2.17) |

and has the vertex degrees on the diagonal. Off-diagonal entries indicate -1 if an edge between the row- and column-vertex exists and 0 otherwise. By construction, is symmetric and thus has real eigenvalues and its eigenvectors are real and orthogonal. Moreover, all columns and rows sum up to 1. Consequently the smallest eigenvalue is 0 with eigenvector , or . For graph bi-section (), it turns out that the ratio-cut objective is maximized when splitting according to the signs of the entries in the second smallest eigenvector of the Laplacian, which is also called the Fiedler vector [Fie73,Fie75]. This algorithm can be extended to and weighted edges.

Currently, the best available algorithms use the KL-algorithm or
spectral bisection in a *multi-level framework* which has three
stages:

- Coarsen graph by collapsing appropriate vertices
- Initial partitioning of simplified graph
- Un-coarsen graph and refine partitioning

A hypergraph is a graph whose edges can connect more than two vertices (hyperedges). The clustering problem is then formulated as a finding the minimum-cut of a hypergraph. A minimum-cut is the removal of the set of hyperedges (with minimum edge weight) that separates the hypergraph into unconnected components. Again, an object maps to a vertex . Each feature maps to a hyperedge connecting all vertices with a non-zero value for this particular feature, so . The minimum-cut of this hypergraph into unconnected components gives the desired clustering. Hypergraphs are often used in VLSI design. An application to association rule hypergraph clustering can be found in [BGG$^+$99].

A variety of packages for graph partitioning is available. A Matlab kit for geometric mesh partitioning (requires coordinate information for vertices) [GMT95] and spectral bi-section [PSL90] by John R. Gilbert and Shang-Hua Teng is available from Xerox. Currently, the most popular programs for graph partitioning are CHACO (available from Bruce Hendrickson at Los Alamos National Labs) [HL94] and METIS (George Karypis, University of Minnesota) [KK98a,KK98b]. Both implement the currently fastest available algorithm, a multi-level version of Kernighan-Lin [KL70,FM82]. A popular package for hypergraph partitioning is HMETIS (George Karypis, University of Minnesota) [KAKS97].