Clustering has been widely studied in several disciplines, specially since the early 60's [Har75,JMF99]. Some classic approaches include partitional methods such as -means, hierarchical agglomerative clustering, unsupervised Bayes, and soft2.2 techniques, such as those based on fuzzy logic or statistical mechanics [CG96]. Conceptual clustering [Fis87b], which maximizes category utility, a measure of predictability improvement in attribute values given a clustering, is also popular in the machine learning community. In most classical techniques, and even in fairly recent ones proposed in the data mining community (CLARANS, DBSCAN, BIRCH, CLIQUE, CURE, WAVECLUSTER, etc. [RS99,HKT01]), the objects to be clustered only have numerical attributes and are represented by low-dimensional feature vectors. The clustering algorithm is then based on distances between the samples in the original vector space [SWY75]. Thus these techniques are faced with the `curse of dimensionality' and the associated sparsity issues when dealing with very high-dimensional data such as text. Indeed, often, the performance of such clustering algorithms is demonstrated only on illustrative 2-dimensional examples.
Clustering algorithms may take an alternative view based on a notion of similarity or dissimilarity. Similarity is often derived from the inner product between vector representations, a popular way to quantify document similarity. In [DM01], the authors present a spherical -means algorithm for document clustering using this similarity measure. Graph-based clustering approaches that attempt to avoid the `curse of dimensionality' by transforming the problem formulation into a similarity space include [KHK99,BGG$^+$99,SG00c]. Finally, when only pairwise similarities are available, techniques such as Multi-Dimensional Scaling (MDS) [Tor52] have been used to embed such points into a low-dimensional space such that the stress (relative difference between embedded point distances and actual distances) is minimized. Clustering can then be done in the embedding space. However, in document clustering this is not commonly used since for acceptable stress levels the dimensionality of the embedding space is too high.
Clustering has also been studied for the purpose of browsing. A 2-dimensional Self-Organizing Map (SOM) [Koh95] has been applied to produce a map of Usenet postings in WEBSOM [HKLK97]. The emphasis in WEBSOM is not to maximize cluster quality but to produce a human interpretable 2-dimensional spatial map of known categories (e.g., newsgroups). In the Scatter/Gather approach [CKPT92], document clustering is used for improved interactive browsing of large query results. The focus on this work is mostly on speed/scalability and not necessary maximum cluster quality. In [ZE98], clustering effectiveness was studied for its effectiveness on web documents.
There is also substantial work on categorizing documents. Here, since at least some of the documents have labels, a variety of supervised or semi-supervised techniques can be used [MR99,NMTM98]. A technique using the support vector machine is discussed in [Joa98]. There are several comparative studies on document classification [YP97,Yan99].
Dimensionality reduction for text classification/clustering has been studied as well. Often, the data is projected onto a small number of dimensions corresponding to principal components or a scalable approximation thereof (e.g., FASTMAP [FL95]). In Latent Semantic Indexing (LSI) [DDL$^+$90] the term-document matrix is modeled by a rank- approximation using the top singular values. While LSI was originally used for improved query processing in information retrieval, the base idea can be employed for clustering as well.
In bag-of-words approaches, the term-frequency matrix contains occurrence counts of terms in documents. Often, the matrix is preprocessed in order to enhance discrimination between documents. There are many schemes for selecting term and global normalization components. One popular preprocessing is normalized Term Frequency and Inverse Document Frequency (TF-IDF), which also comes in several variants [Sal89,BY99]. However, this dissertation will not discuss the properties of feature extraction, see e.g., [Kol97,Lew92] instead. In [YP97,Yan99] classification performance of several other preprocessing schemes are compared.