Clustering has been widely studied in several disciplines, specially
since the early 60's [Har75,JMF99]. Some classic approaches
include partitional methods such as -means, hierarchical
agglomerative clustering, unsupervised Bayes, and soft2.2 techniques, such
as those based on fuzzy logic or statistical mechanics
[CG96]. Conceptual clustering
[Fis87b], which maximizes category utility, a measure of
predictability improvement in attribute values given a clustering, is
also popular in the machine learning community.
In most classical techniques, and even in fairly recent ones proposed
in the data mining community (CLARANS, DBSCAN, BIRCH, CLIQUE, CURE,
WAVECLUSTER, etc. [RS99,HKT01]), the objects to be clustered
only have numerical attributes and are represented by low-dimensional
feature vectors. The clustering algorithm is then based on distances
between the samples in the original vector space [SWY75]. Thus
these techniques are faced with the `curse of dimensionality' and
the associated sparsity issues when dealing with very
high-dimensional data such as text. Indeed, often, the performance of
such clustering algorithms is demonstrated only on illustrative
2-dimensional examples.
Clustering algorithms may take an alternative view based on a notion
of similarity or dissimilarity. Similarity is often derived from the
inner product between vector representations, a popular way to
quantify document similarity. In [DM01], the authors present
a spherical -means algorithm for document clustering using this
similarity measure. Graph-based clustering approaches that attempt
to avoid the `curse of dimensionality' by transforming the problem
formulation into a similarity space include
Finally, when only pairwise similarities are available, techniques
such as Multi-Dimensional Scaling (MDS) [Tor52] have been used
to embed such points into a low-dimensional space such that the
stress (relative difference between embedded point distances and
actual distances) is minimized. Clustering can then be done in the
embedding space. However, in document clustering this is not commonly
used since for acceptable stress levels the dimensionality of the
embedding space is too high.
Clustering has also been studied for the purpose of browsing. A 2-dimensional Self-Organizing Map (SOM) [Koh95] has been applied to produce a map of Usenet postings in WEBSOM [HKLK97]. The emphasis in WEBSOM is not to maximize cluster quality but to produce a human interpretable 2-dimensional spatial map of known categories (e.g., newsgroups). In the Scatter/Gather approach [CKPT92], document clustering is used for improved interactive browsing of large query results. The focus on this work is mostly on speed/scalability and not necessary maximum cluster quality. In [ZE98], clustering effectiveness was studied for its effectiveness on web documents.
There is also substantial work on categorizing documents. Here, since at least some of the documents have labels, a variety of supervised or semi-supervised techniques can be used [MR99,NMTM98]. A technique using the support vector machine is discussed in [Joa98]. There are several comparative studies on document classification [YP97,Yan99].
Dimensionality reduction for text classification/clustering has been
studied as well. Often, the data is projected onto a small number of
dimensions corresponding to principal components or a scalable
approximation thereof (e.g., FASTMAP [FL95]). In Latent
Semantic Indexing (LSI) [DDL$^+$90] the term-document matrix is
modeled by a rank- approximation using the top
singular values.
While LSI was originally used for improved query processing in
information retrieval, the base idea can be employed for clustering as
In bag-of-words approaches, the term-frequency matrix contains occurrence counts of terms in documents. Often, the matrix is preprocessed in order to enhance discrimination between documents. There are many schemes for selecting term and global normalization components. One popular preprocessing is normalized Term Frequency and Inverse Document Frequency (TF-IDF), which also comes in several variants [Sal89,BY99]. However, this dissertation will not discuss the properties of feature extraction, see e.g., [Kol97,Lew92] instead. In [YP97,Yan99] classification performance of several other preprocessing schemes are compared.