In this section, we present and compare the results of the eleven approaches on the four document datasets. Clustering quality is understood in terms of mutual information and balance. For each dataset we set the number of clusters to be twice the number of categories , except for the REUT dataset where we used since there are many small categories. Using a greater number of clusters than classes allows multimodal distributions for each class. For example, in an XOR like problem, there are two classes, but four clusters.
Let us first look at a representative result to illustrate the behavior of some algorithms and our evaluation methodology. In figure 4.3, confusion matrices illustrating quality differences of RND, KM E, KM C, and GP C approaches on a sample of 800 documents from N20 are shown. The horizontal and vertical axis correspond to the categories and clusters, respectively. Clusters are sorted in increasing order of dominant category. Entries indicate the number of documents in cluster and category by darkness. Expectedly, random partitioning RND results in indiscriminating clusters with a mutual information score . The purity score indicates that on average the dominant category contributes 16% of the objects in a cluster. However, since labels are drawn from a uniform distribution, cluster sizes are somewhat balanced with . KM E delivers one large cluster (cluster 15) and many small clusters with . This strongly imbalanced clustering is characteristic for KM E on highdimensional sparse data and is problematic because it usually defeats certain application specific purposes such as browsing. It also results in subrandom quality ( ). KM C results are good. A `diagonal' can be clearly seen in the confusion matrix. This indicates that the clusters align with the ground truth categorization which is reflected by an overall mutual information ( ). Balancing is good as well with . GP C exceeds KM C in both aspects with ( ) as well as balance . The `diagonal' is stronger and clusters are very balanced.

The rest of the results are given in summarized form instead of the more detailed treatment in the example above, since the comparative trends are very clear even at this macro level. Some examples of detailed confusion matrices and pairwise ttests can be found in our earlier work [SGM00].
For a systematic comparison, ten experiments were performed for each of the random samples of sizes 50, 100, 200, 400, and 800. Figure 4.4 shows performance curves in terms of (relative) mutual information comparing 10 algorithms on 4 data sets. Each curve shows the difference in mutual informationbased quality compared to random partitioning for 5 sample sizes (at 50, 100, 200, 400, and 800). Error bars indicate 1 standard deviations over 10 experiments. Figure 4.5 shows quality in terms of balance for 4 data sets in combination with 10 algorithms. Each curve shows the cluster balance for 5 sample sizes (again at 50, 100, 200, 400, and 800). Error bars indicate 1 standard deviations over 10 experiments. Figure 4.6 summarizes the results on all four datasets at the highest sample size level (). We also conducted pairwise ttests at to assure differences in average performance are significant. For illustration and brevity, we chose to show mean performance with standard variation bars rather than the ttest results (see our previous work [SGM00]).
First, we look at quality in terms of mutual information (figures 4.4, 4.6(a)). With increasing sample size , the quality of clusterings tends to improve. Nonmetric (cosine, correlation, Jaccard) graph partitioning approaches work best on text data ( ) followed by nonmetric means approaches. Clearly, a nonmetric e.g., dotproductbased similarity measure is necessary for good quality. Due to the conservative normalization, depending on the given dataset the maximum obtainable mutual information (for a perfect classifier!) tends to be around 0.8 to 0.9. A mutual informationbased quality around 0.4 and 0.5 (which is approximately 0.3 to 0.4 better than random at ) is an excellent result.^{4.7}Hypergraph partitioning constitutes the third tier. Euclidean techniques including SOM perform rather poorly. Surprisingly, the SOM still delivers significantly better than random results despite the limited expressiveness of the implicitly used Euclidean distances. The success of SOM is explained with the fact that the Euclidean distance becomes locally meaningful once the cellcentroids are locked onto a good cluster.
All approaches behaved consistently over the four datasets with only slightly different scale caused by the different datasets' complexities. The performance was best on YAHOO and WEBKB followed by N20 and REUT. Interestingly, the gap between GP and KM techniques is wider on YAHOO than on WEBKB. The low performance on REUT is probably due to the high number of classes (82) and their widely varying sizes.
In order to assess which approaches are more suitable for a particular amount of objects , we also looked for intersects in the performance curves of the top algorithms (nonmetric GP and KM, HGP).^{4.8}In our experiments, the curves do not intersect indicating that ranking of the top performers does not change in the range of dataset sizes considered.
In terms of balance (figures 4.5, 4.6(b)) the advantages of graph partitioning are clear. Graph partitioning explicitly tries to achieve balanced clusters ( ). The second tier is hypergraph partitioning which is also a balanced technique ( ) followed by nonmetric means approaches ( ). Poor balancing is shown by SOM and Euclidean means ( ). Interestingly, balancedness does not change significantly for the meansbased approaches as the number of samples increases. Graph partitioning based approaches quickly approach perfect balancing as would be expected since they are explicitly designed to do so.
Nonmetric graph partitioning is significantly better in terms of mutual information as well as in balance. There is no significant difference in performance amongst the nonmetric similarity measures using cosine, correlation, and extended Jaccard. Euclidean distance based approaches do not perform better than random clustering.


