*Clustering* has been widely studied in several disciplines, specially
since the early 60's [Har75,JMF99]. Some classic approaches
include partitional methods such as -means, hierarchical
agglomerative clustering, unsupervised Bayes, and soft^{2.2} techniques, such
as those based on fuzzy logic or statistical mechanics
[CG96]. Conceptual clustering
[Fis87b], which maximizes category utility, a measure of
predictability improvement in attribute values given a clustering, is
also popular in the machine learning community.
In most classical techniques, and even in fairly recent ones proposed
in the data mining community (CLARANS, DBSCAN, BIRCH, CLIQUE, CURE,
WAVECLUSTER, etc. [RS99,HKT01]), the objects to be clustered
only have numerical attributes and are represented by low-dimensional
feature vectors. The clustering algorithm is then based on distances
between the samples in the original vector space [SWY75]. Thus
these techniques are faced with the `curse of dimensionality' and
the associated sparsity issues when dealing with very
high-dimensional data such as text. Indeed, often, the performance of
such clustering algorithms is demonstrated only on illustrative
2-dimensional examples.

Clustering algorithms may take an alternative view based on a notion
of *similarity* or dissimilarity. Similarity is often derived from the
inner product between vector representations, a popular way to
quantify document similarity. In [DM01], the authors present
a spherical -means algorithm for document clustering using this
similarity measure. Graph-based clustering approaches that attempt
to avoid the `curse of dimensionality' by transforming the problem
formulation into a similarity space include
[KHK99,BGG$^+$99,SG00c].
Finally, when only pairwise similarities are available, techniques
such as Multi-Dimensional Scaling (MDS) [Tor52] have been used
to embed such points into a low-dimensional space such that the
stress (relative difference between embedded point distances and
actual distances) is minimized. Clustering can then be done in the
embedding space. However, in document clustering this is not commonly
used since for acceptable stress levels the dimensionality of the
embedding space is too high.

Clustering has also been studied for the purpose of *browsing*.
A 2-dimensional Self-Organizing Map (SOM) [Koh95] has been
applied to produce a map of Usenet postings in WEBSOM
[HKLK97]. The emphasis in WEBSOM is not to maximize cluster
quality but to produce a human interpretable 2-dimensional spatial
map of known categories (e.g., newsgroups).
In the Scatter/Gather approach [CKPT92], document clustering is
used for improved interactive browsing of large query results. The
focus on this work is mostly on speed/scalability and not necessary
maximum cluster quality.
In [ZE98], clustering effectiveness was studied for its
effectiveness on web documents.

There is also substantial work on *categorizing* documents.
Here, since at least some of the documents have labels, a variety of
supervised or semi-supervised techniques can be used
[MR99,NMTM98]. A technique using the support vector machine is
discussed in [Joa98]. There are several comparative studies on
document classification [YP97,Yan99].

*Dimensionality reduction* for text classification/clustering has been
studied as well. Often, the data is projected onto a small number of
dimensions corresponding to principal components or a scalable
approximation thereof (e.g., FASTMAP [FL95]). In Latent
Semantic Indexing (LSI) [DDL$^+$90] the term-document matrix is
modeled by a rank- approximation using the top singular values.
While LSI was originally used for improved query processing in
information retrieval, the base idea can be employed for clustering as
well.

In *bag-of-words* approaches, the term-frequency matrix contains
occurrence counts of terms in documents. Often, the matrix is
preprocessed in order to enhance discrimination between
documents. There are many schemes for selecting term and global
normalization components. One popular preprocessing is normalized Term
Frequency and Inverse Document Frequency (TF-IDF), which also comes in
several variants [Sal89,BY99]. However, this dissertation will
not discuss the properties of feature extraction, see e.g.,
[Kol97,Lew92] instead. In [YP97,Yan99] classification
performance of several other preprocessing schemes are compared.