Next: Mixture Density Estimation Up: Clustering Algorithms Previous: Artificial Neural Systems   Contents

## Projective Techniques

In many domains, the high-dimensional representation contains redundant information considering the clustering task. In document clustering, for example, many different words and phrases can be used to convey a similar meaning. Hence, the vector-space representation contains highly correlated evidence and the clustering process can be extended as follows:

1. Find an appropriate projection of the original data () into concept space ().
2. Perform clustering in the dimensionality reduced space.
Good sub-spaces for projections are characterized by preserving most of the `information' in the data. For a regular matrix the eigenvectors with the highest eigenvalues are the orthogonal directions of projection that explain the maximum possible of the variance in the data [Str88]. The extension of the eigen decomposition to non-square matrices is the Singular Value Decomposition (SVD). SVD is closely related to Principal Component Analysis (PCA) which is based on the SVD of the zero-mean normalized data. In Latent Semantic Indexing (LSI) the word-document matrix is modeled by a rank- approximation using the top singular values. While LSI is originally used for improved query processing in information retrieval the base idea may be employed for clustering as well. A matrix is unitary if and only if . Equation 2.4 gives the singular value decomposition (SVD) of the data matrix into a product of the unitary matrix , the diagonal matrix , and the unitary matrix .

 (2.4)

is the rank of . is the best rank -approximation of (with ) which is obtained by using , which is with the original upper-left diagonal entries and all others set to 0 (equation 2.5)

 (2.5)

The left-singular column vectors in are called the concept vectors with . In general, looses its sparsity since the projection is not axis-parallel. However, the concept space does not have to be instantiated explicitly, but can be rather stored functionally, as for example, a product of two sparse matrices. In the vector space model [SWY75,Kol97], a keyword search can be expressed as a matrix product as shown in 2.6. A binary valued query vector is projected linearly to a match vector .

 (2.6)

In the context of web-search, it has been claimed that projecting the query vector and the web-page vectors to concept space outperforms keyword search. The rationale for this observation lies in the fact that the rank- approximation is actually more representative than the original data matrix, because noise and redundancy have been removed. Also, it has been argued that by reducing the matrix complexity to a small number of concepts, implicitly the query is extended to encompass redundancy as introduced e.g., by synonyms. Using the optimal rank--approximation (LSI) yields equation 2.7, where is given by equation 2.5.

 (2.7)

Generally, the rank of has a magnitude of to and is around . Recently, it has been reported that random subspace projections perform very well for high-dimensional data and are close to the optimal projection as given by the SVD [SS97].

Next: Mixture Density Estimation Up: Clustering Algorithms Previous: Artificial Neural Systems   Contents
Alexander Strehl 2002-05-03