Next: CLUSION: Cluster Visualization
Up: OPOSSUM
Previous: Vertex Weighted Graph Partitioning
Contents
Determining the Number of Clusters
Finding the `right' number of clusters for a dataset is a
difficult and often illposed problem, since even for the same dataset, there can be several answers depending on the scale or
granularity one is interested in. In probabilistic approaches to
clustering, likelihoodratios, Bayesian techniques and Monte Carlo
crossvalidation are popular. In nonprobabilistic methods, a
regularization approach, which penalizes for large , is often
adopted. If the data is labelled, then mutual information between
cluster and class labels can be used to determine the number of
clusters. Other metrics such as purity of clusters or entropy are of
less use as they are biased towards a larger number of clusters
(see chapter 4).
For transactional data, often the number is specified by the enduser
to be typically between 7 and 14 [BL97]. Otherwise, one can
employ a suitable heuristic to obtain an appropriate value of
during the clustering process.
This subsection describes how we find a desirable clustering, with
high overall cluster quality
and a small number of
clusters .
Our objective is to maximize intracluster similarity and minimize
intercluster similarity, given by
and
,
respectively, where and are cluster indices. Note that
intracluster similarity is undefined (0/0) for singleton
clusters. Hence, we define our quality measure
(
in case of pathological / inverse clustering) based on
the ratio of weighted average intercluster to weighted average
intracluster similarity:

(3.5) 
indicates that samples within the same cluster are on
average not more similar than samples from different clusters. On the
contrary,
describes a clustering where every pair of
samples from different clusters has the similarity of 0 and at least
one sample pair from the same cluster has a nonzero similarity. Note
that our definition of quality does not take the `amount of balance'
into account, since balancing is already observed fairly strictly by
the constraints in the graph partitioning step.
To achieve a high quality
as well as a low , the target
function
is the product of the
quality
and a penalty term which works very well in practice.
If and
, then there
exists at least one clustering with no singleton clusters. The
penalized quality gives the penalized quality
and is defined as
. A modest linear penalty was
chosen, since our quality criterion does not necessarily improve with
increasing (unlike e.g. the squared error criterion).
For large , we search for the optimal in the entire window from
. In many cases, however, a forward search starting
at and stopping at the first downtick of penalized quality while
increasing is sufficient.
Finally, a practical alternative, as exemplified by the experimental
results later, is to first overcluster and then use the visualization
aid to combine clusters as needed (subsection 3.5.2).
Next: CLUSION: Cluster Visualization
Up: OPOSSUM
Previous: Vertex Weighted Graph Partitioning
Contents
Alexander Strehl
20020503