next up previous contents
Next: Experiments on Text Documents Up: Evaluation Methodology Previous: Internal (model-based, unsupervised) Quality   Contents

External (model-free, semi-supervised) Quality

External quality measures require an external grouping, for example as indicated by category labels, that is assumed to be `correct'. However, unlike in classification such ground truth is not available to the clustering algorithm. This class of evaluation measures can be used to compare start-to-end performance of any kind of clustering regardless of the models or similarities used. However, since clustering is an unsupervised problem, the performance cannot be judged with the same certitude as for a classification problem. The external categorization might not be optimal at all. For example, the way web-pages are organized in the Yahoo! taxonomy is certainly not the best structure possible. However, achieving a grouping similar to the Yahoo! taxonomy is certainly indicative of successful clustering.

Given $ g$ categories (or classes) $ \mathcal{K}_h$ ( $ h \in
\{1,\ldots,g\}$), we denote the categorization label vector $ \mathbf{\kappa}$, where $ x_a \in \mathcal{K}_h \Leftrightarrow
\kappa_a = h$. Let $ n^{(h)}$ be the number of objects in category $ \mathcal{K}_h$ according to $ \mathbf{\kappa}$, and $ n_{\ell}$ the number of objects in cluster $ \mathcal{C}_{\ell}$ according to $ \mathbf{\lambda}$. Let $ n_{\ell}^{(h)}$ denote the number of objects that are in cluster $ \ell$ according to $ \mathbf{\lambda}$ as well as in category $ h$ given by $ \mathbf{\kappa}$. There are several ways of comparing the class labels with cluster labels:

Purity.
Purity can be interpreted as classification accuracy under the assumption that all objects of a cluster are classified to be members of the dominant class for that cluster. For a single cluster $ \mathcal{C}_{\ell}$, purity is defined as the ratio of the number of objects in the dominant category to the total number of objects:

$\displaystyle \phi^{(\mathrm{A})} (\mathcal{C}_{\ell},\mathbf{\kappa}) = \frac{1}{n_{\ell}} \max_h (n_{\ell}^{(h)})$ (4.17)

To evaluate an entire clustering, one computes the average of the cluster-wise purities weighted by cluster size:

$\displaystyle \phi^{(\mathrm{A})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{1}{n} \sum_{\ell=1}^{k} \max_h (n_{\ell}^{(h)})$ (4.18)

Entropy [CT91].
Entropy is a more comprehensive measure than purity since rather than just considering the number of objects `in' and `not in' the dominant class, it takes the entire distribution into account. Since a cluster with all objects from the same category has an entropy of 0, we define entropy-based quality as 1 minus the [0,1]-normalized entropy. We define entropy-based quality for each cluster as:

$\displaystyle \phi^{(\mathrm{E})} (\mathcal{C}_{\ell},\mathbf{\kappa}) = 1 - \s...
...n_{\ell}^{(h)}}{n_{\ell}} \log_g \left( \frac{n_{\ell}^{(h)}}{n_{\ell}} \right)$ (4.19)

And through weighted averaging, the total entropy quality measure falls out to be:

$\displaystyle \phi^{(\mathrm{E})} (\mathbf{\lambda},\mathbf{\kappa}) = 1 + \fra...
...um_{h=1}^g n_{\ell}^{(h)} \log_g \left( \frac{n_{\ell}^{(h)}}{n_{\ell}} \right)$ (4.20)

Both, purity and entropy are biased to favor large number of clusters. In fact, for both these criteria, the globally optimal value is trivially reached when each cluster is a single object!
Precision, recall & $ \mathrm{F}$-measure [vR79].
Precision and recall are standard measures in the information retrieval community. If a cluster is viewed as the results of a query for a particular category, then precision is the fraction of correctly retrieved objects:

$\displaystyle \phi^{(\mathrm{P})}(\mathcal{C}_{\ell},\mathcal{K}_{h}) = n_{\ell}^{(h)} / n_{\ell}$ (4.21)

Recall is the fraction of correctly retrieved objects out of all matching objects in the database:

$\displaystyle \phi^{(\mathrm{R})}(\mathcal{C}_{\ell},\mathcal{K}_{h}) = n_{\ell}^{(h)} / n^{(h)}$ (4.22)

The $ \mathrm{F}$-measure combines precision and recall into a single number given a weighting factor. The $ \mathrm{F}_1$-measure combines precision and recall with equal weights. The following equation gives the $ \mathrm{F}_1$-measure when querying for a particular category $ \mathcal{K}_{h}$

$\displaystyle \phi^{(\mathrm{F}_1)}(\mathcal{K}_{h}) = \max_{\ell} \frac{2 \ph...
...l},\mathcal{K}_{h}) } = \max_{\ell} \frac{2 n_{\ell}^{(h)}}{n_{\ell} + n^{(h)}}$ (4.23)

Hence, for the entire clustering the total $ \mathrm{F}_1$-measure is:

$\displaystyle \phi^{(\mathrm{F}_1)} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{...
... \sum_{h=1}^{g} n^{(h)} \max_{\ell} \frac{2 n_{\ell}^{(h)}}{n_{\ell} + n^{(h)}}$ (4.24)

Unlike purity and entropy, the $ \mathrm{F}_1$-measure is not biased towards a larger number of clusters. In fact, it favors coarser clusterings. Another issue is that random clustering tends not to be evaluated at 0.
Mutual information [CT91].
The mutual information is the most theoretically well-founded among the considered quality measures. It is unbiased and symmetric in terms of $ \mathbf{\kappa}$ and $ \mathbf{\lambda}$. We propose a [0,1]-normalized mutual information-based quality criterion $ \phi^{(\mathrm{NMI})}$ which can be computed as follows:

$\displaystyle \phi^{(\mathrm{NMI})} (\mathbf{\lambda},\mathbf{\kappa}) = \frac{...
...} \log_{k \cdot g} \left( \frac{ n_{\ell}^{(h)} n} { n^{(h)} n_{\ell} } \right)$ (4.25)

A detailed derivation of this definition can be found in appendix B.1. Mutual information does not suffer from the biases like purity, entropy and the $ \mathrm{F}_1$-measure. Singletons are not evaluated as perfect. Random clustering has mutual information of 0 in the limit. However, the best possible labeling evaluates to less than 1, unless classes are balanced.4.5Note that our normalization penalizes over-refinements unlike the standard mutual information.4.6

External criteria enable us to compare different clustering methods fairly provided the external ground truth is of good quality. One could argue against external criteria that clustering does not have to perform as well as classification. However, in many cases clustering is an interim step to better understand and characterize a complex data-set before further analysis and modeling.

Normalized mutual information will be our preferred choice of evaluation in the next section, because it is an unbiased measure for the usefulness of the knowledge captured in the clustering in predicting category labels.


next up previous contents
Next: Experiments on Text Documents Up: Evaluation Methodology Previous: Internal (model-based, unsupervised) Quality   Contents
Alexander Strehl 2002-05-03