next up previous contents
Next: Visualization Up: CLUSION: Cluster Visualization Previous: CLUSION: Cluster Visualization   Contents

Coarse Seriation

When data is limited to 2 or 3 dimensions, the most powerful tool for judging cluster quality is usually the human eye. CLUSION, our CLUSter visualizatION toolkit, allows us to convert high-dimensional data into a perceptually more suitable format, and employ the human vision system to explore the relationships in the data, guide the clustering process, and verify the quality of the results. In our experience with two years of Dell customer data, we found CLUSION effective for getting clusters balanced w.r.t. number of customers or net dollar ($) amount, and even more so for conveying the results to marketing management. CLUSION looks at the output of a clustering routine, reorders the data points such that points with the same cluster label are contiguous, and then visualizes the resulting permuted similarity matrix, $ \mathbf{S}'$. More formally, the original $ n \times n$ similarity matrix $ \mathbf{S}$ is permuted with a $ n \times n$ permutation matrix $ \mathbf{P}$ which is defined as follows:

$\displaystyle p_{i,j} = \left\{ \begin{array}{l} 1 \text{ if } j = \sum_{a=1}...
...ell=1}^{\lambda_i - 1} n_{\ell} \   0 \text{ otherwise} \end{array} \right.$ (3.6)

$ l$ are entries in the binary $ n \times k$ cluster membership indicator matrix $ \mathbf{L}$:

$\displaystyle l_{i,j} = \left\{ \begin{array}{l} 1 \text{ if } \lambda_i = j \   0 \text{ otherwise} \end{array} \right.$ (3.7)

In other words, $ p_{i,j}$ is 1 if $ j$ is the sum of the number of points amongst the first $ i$ that belong to the same cluster and the number of points in the first $ \mathbf{\lambda}_i-1$ clusters. Now, the permuted similarity matrix $ \mathbf{S}'$ and the corresponding label vector $ \mathbf{\lambda}'$ and data matrix $ \mathbf{X}'$ are:

\begin{displaymath}\begin{array}{ccccc} \mathbf{S}' = \mathbf{P} \mathbf{S} \ma...
...\lambda} &,& \mathbf{X}' = \mathbf{P} \mathbf{X} \end{array}\end{displaymath} (3.8)

For a `good' clustering algorithm and $ k \rightarrow n$ this is related to sparse matrix reordering, for this results in the generation of a `banded matrix' where high entries should all fall near the diagonal line from the upper left to the lower right of the matrix. Since equation 3.8 is essentially a partial ordering operation we also refer to it as coarse seriation, a phrase used in disciplines such as anthropology and archaeology to describe the reordering of the primary data matrix so that similar structures (e.g., genetic sequences) are brought closer [Mur85,ESBB98].
next up previous contents
Next: Visualization Up: CLUSION: Cluster Visualization Previous: CLUSION: Cluster Visualization   Contents
Alexander Strehl 2002-05-03