The key contributions of this chapter lie in highlighting the advantages of working in a similarity space when clustering very high-dimensional sparse data such as text documents, and in providing a framework for comparing several clustering approaches across a variety of similarity spaces using mutual information. Another useful contribution is the conceptual assessment of a variety of similarity measures and evaluation criteria.
The comparative results indicate that for word frequency-based clustering of web documents, graph partitioning is better suited than generalized -means, hypergraph partitioning, and SOM. The search procedure implicit in graph partitioning is far less local than the hill-climbing approach of -means. Moreover, it also provides a way to obtain balanced clusters and exhibit a lower variance in results.
Metric distances (such as Euclidean distance) are not appropriate for high-dimensional, sparse domains. Cosine, correlation and extended Jaccard measures are successful and perform equivalently in capturing the similarities implicitly indicated by manual categorizations.