I have had my results for a long time: but I do not yet
know how I am to arrive at them.
- Karl Friedrich Gauß4.1
In the last chapter, we explored the relationship-based approach to clustering in several domains. The work was initially motivated by retail data and extended naturally to other domains where high-dimensional representations are prevalent, such as text documents and web-logs. A particularly interesting application is clustering of text documents which enables unsupervised categorization and facilitates browsing and search. A critical step in adapting a relationship-based clustering to a specific domain is the choice of similarity measure. In this chapter, we investigate the impact of similarity measures on clustering quality. We will first introduce similarities and algorithms for text clustering, then develop a general comparative framework and, finally, conduct case studies on a variety of text corpora.