In many domains, the high-dimensional representation contains redundant information considering the clustering task. In document clustering, for example, many different words and phrases can be used to convey a similar meaning. Hence, the vector-space representation contains highly correlated evidence and the clustering process can be extended as follows:

- Find an appropriate projection of the original data () into concept space ().
- Perform clustering in the dimensionality reduced space.

is the rank of . is the best rank -approximation of (with ) which is obtained by using , which is with the original upper-left diagonal entries and all others set to 0 (equation 2.5)

The left-singular column vectors in are called the concept vectors with . In general, looses its sparsity since the projection is not axis-parallel. However, the concept space does not have to be instantiated explicitly, but can be rather stored functionally, as for example, a product of two sparse matrices. In the vector space model [SWY75,Kol97], a keyword search can be expressed as a matrix product as shown in 2.6. A binary valued query vector is projected linearly to a match vector .

In the context of web-search, it has been claimed that projecting the query vector and the web-page vectors to concept space outperforms keyword search. The rationale for this observation lies in the fact that the rank- approximation is actually more representative than the original data matrix, because noise and redundancy have been removed. Also, it has been argued that by reducing the matrix complexity to a small number of concepts, implicitly the query is extended to encompass redundancy as introduced e.g., by synonyms. Using the optimal rank--approximation (LSI) yields equation 2.7, where is given by equation 2.5.

Generally, the rank of has a magnitude of to and is around . Recently, it has been reported that random subspace projections perform very well for high-dimensional data and are close to the optimal projection as given by the SVD [SS97].