External quality measures require an external grouping, for example as indicated by category labels, that is assumed to be `correct'. However, unlike in classification such ground truth is not available to the clustering algorithm. This class of evaluation measures can be used to compare start-to-end performance of any kind of clustering regardless of the models or similarities used. However, since clustering is an unsupervised problem, the performance cannot be judged with the same certitude as for a classification problem. The external categorization might not be optimal at all. For example, the way web-pages are organized in the Yahoo! taxonomy is certainly not the best structure possible. However, achieving a grouping similar to the Yahoo! taxonomy is certainly indicative of successful clustering.

Given categories (or classes) ( ), we denote the categorization label vector , where . Let be the number of objects in category according to , and the number of objects in cluster according to . Let denote the number of objects that are in cluster according to as well as in category given by . There are several ways of comparing the class labels with cluster labels:

**Purity.**- Purity can be interpreted as classification accuracy under the
assumption that all objects of a cluster are classified to be members
of the dominant class for that cluster. For a single cluster
, purity is defined as the ratio of the number of
objects in the
*dominant*category to the total number of objects:

To evaluate an entire clustering, one computes the average of the cluster-wise purities weighted by cluster size:

**Entropy [CT91].**- Entropy is a more comprehensive measure than purity since rather than
just considering the number of objects `in' and `not in' the dominant
class, it takes the entire distribution into account. Since a cluster
with all objects from the same category has an entropy of 0, we define
entropy-based quality as 1 minus the
[0,1]-normalized entropy. We define entropy-based quality
for each cluster as:

And through weighted averaging, the total entropy quality measure falls out to be:

Both, purity and entropy are biased to favor large number of clusters. In fact, for both these criteria, the globally optimal value is trivially reached when each cluster is a single object! **Precision, recall & -measure [vR79].**- Precision and recall are standard measures in the information
retrieval community. If a cluster is viewed as the results of a query for
a particular category, then precision is the fraction of correctly
retrieved objects:

Recall is the fraction of correctly retrieved objects out of all matching objects in the database:

The -measure combines precision and recall into a single number given a weighting factor. The -measure combines precision and recall with equal weights. The following equation gives the -measure when querying for a particular category

Hence, for the entire clustering the total -measure is:

Unlike purity and entropy, the -measure is not biased towards a larger number of clusters. In fact, it favors coarser clusterings. Another issue is that random clustering tends not to be evaluated at 0. **Mutual information [CT91].**- The mutual information is the most theoretically well-founded among
the considered quality measures. It is unbiased and symmetric in terms
of
and
. We propose a
[0,1]-normalized mutual information-based quality criterion
which can be computed as follows:

A detailed derivation of this definition can be found in appendix B.1. Mutual information does not suffer from the biases like purity, entropy and the -measure. Singletons are not evaluated as perfect. Random clustering has mutual information of 0 in the limit. However, the best possible labeling evaluates to less than 1, unless classes are balanced.^{4.5}Note that our normalization penalizes over-refinements unlike the standard mutual information.^{4.6}

Normalized mutual information will be our preferred choice of evaluation in the next section, because it is an unbiased measure for the usefulness of the knowledge captured in the clustering in predicting category labels.