next up previous contents
Next: Mixture Density Estimation Up: Clustering Algorithms Previous: Artificial Neural Systems   Contents

Projective Techniques

In many domains, the high-dimensional representation contains redundant information considering the clustering task. In document clustering, for example, many different words and phrases can be used to convey a similar meaning. Hence, the vector-space representation contains highly correlated evidence and the clustering process can be extended as follows:

  1. Find an appropriate projection of the original data ($ D>10^5$) into concept space ($ d>10^2$).
  2. Perform clustering in the dimensionality reduced space.
Good sub-spaces for projections are characterized by preserving most of the `information' in the data. For a regular matrix the $ K$ eigenvectors with the $ K$ highest eigenvalues are the $ K$ orthogonal directions of projection that explain the maximum possible of the variance in the data [Str88]. The extension of the eigen decomposition to non-square matrices is the Singular Value Decomposition (SVD). SVD is closely related to Principal Component Analysis (PCA) which is based on the SVD of the zero-mean normalized data. In Latent Semantic Indexing (LSI) the $ d \times n$ word-document matrix $ \mathbf{X}$ is modeled by a rank-$ K$ approximation using the top $ K$ singular values. While LSI is originally used for improved query processing in information retrieval the base idea may be employed for clustering as well. A matrix $ \mathbf{U}$ is unitary if and only if $ \mathbf{U}^{\dagger}\mathbf{U} = \mathbf{I}$. Equation 2.4 gives the singular value decomposition (SVD) of the $ d \times n$ data matrix $ \mathbf{A}=\mathbf{X}$ into a product of the unitary $ d \times r$ matrix $ \mathbf{U}$, the diagonal $ r
\times r$ matrix $ \mathbf{\Sigma}$, and the unitary $ r \times n$ matrix $ \mathbf{V}^{\dagger}$.

$\displaystyle \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{\dagger} ,\: ...
...{V}^{\dagger} \mathbf{V} = \mathbf{I} ,\: \forall (i \neq j) : \sigma_{i,j} = 0$ (2.4)

$ r$ is the rank of $ \mathbf{X}$. $ \mathbf{A}_K$ is the best rank $ K$-approximation of $ \mathbf{A}$ (with $ K < r$) which is obtained by using $ \mathbf{\Sigma}_K$, which is $ \mathbf{\Sigma}$ with the original upper-left $ K$ diagonal entries and all others set to 0 (equation 2.5)

$\displaystyle \mathbf{A}_K = \mathbf{U}_K \mathbf{\Sigma}_K \mathbf{V}^{\dagger}_K = \sum_{i = 1}^{K} \sigma_i \mathbf{u}_i \mathbf{v}_i^{\dagger}$ (2.5)

The left-singular column vectors $ \mathbf{u} \in \mathbb{R}^{d}$ in $ \mathbf{U}$ are called the concept vectors with $ \Vert \mathbf{u} \Vert =
1$. In general, $ \mathbf{A}$ looses its sparsity since the projection is not axis-parallel. However, the concept space does not have to be instantiated explicitly, but can be rather stored functionally, as for example, a product of two sparse matrices. In the vector space model [SWY75,Kol97], a keyword search can be expressed as a matrix product as shown in 2.6. A binary valued query vector $ \mathbf{q}$ is projected linearly to a match vector $ \mathbf{p}$.

$\displaystyle \mathbf{p} = \mathbf{A}^{\dagger} \mathbf{q}$ (2.6)

In the context of web-search, it has been claimed that projecting the query vector and the web-page vectors to concept space outperforms keyword search. The rationale for this observation lies in the fact that the rank-$ K$ approximation is actually more representative than the original data matrix, because noise and redundancy have been removed. Also, it has been argued that by reducing the matrix complexity to a small number of concepts, implicitly the query is extended to encompass redundancy as introduced e.g., by synonyms. Using the optimal rank-$ K$-approximation (LSI) yields equation 2.7, where $ \mathbf{A}_K$ is given by equation 2.5.

$\displaystyle \hat{\mathbf{p}} = \mathbf{A}_K^{\dagger} \mathbf{q} = (\mathbf{V}_K \mathbf{\Sigma}_K) (\mathbf{U}_K \mathbf{q})$ (2.7)

Generally, the rank of $ A$ has a magnitude of $ 10^3$ to $ 10^5$ and $ K$ is around $ 10^2$. Recently, it has been reported that random subspace projections perform very well for high-dimensional data and are close to the optimal projection as given by the SVD [SS97].

next up previous contents
Next: Mixture Density Estimation Up: Clustering Algorithms Previous: Artificial Neural Systems   Contents
Alexander Strehl 2002-05-03