next up previous contents
Next: OPOSSUM Up: Relationship-based Clustering and Visualization Previous: Motivation   Contents


Domain Specific Features and Similarity Space

Notation. Let $ n$ be the number of objects / samples / points (e.g., customers, documents, web-sessions) in the data and $ d$ the number of features (e.g., products, words, web-pages) for each sample $ \mathbf{x}_j$ with $ j \in \{ 1,\ldots,n\} $. Let $ k$ be the desired number of clusters. The input data can be represented by a $ d \times n$ data matrix $ \mathbf{X}$ with the $ j$-th column vector representing the sample $ \mathbf{x}_j$. $ \mathbf{x}_j^{\dagger}$ denotes the transpose of $ \mathbf{x}_j$. Hard clustering assigns a label $ \lambda_j \in \{1,\ldots,k\}$ to each $ d$-dimensional sample $ \mathbf{x}_j$, such that similar samples get the same label. In general the labels are treated as nominals with no inherent order, though in some cases, such as 1-dimensional SOMs, any top-down recursive bisection approach as well as our proposed method, the labeling contains extra ordering information. Let $ \mathcal{C}_{\ell}$ denote the set of all objects in the $ \ell$-th cluster ( $ \ell \in \{1,\ldots,k \}$), with $ \mathbf{x}_j \in \mathcal{C}_{\ell}
\Leftrightarrow \lambda_j = \ell$ and $ n_{\ell} = \vert
\mathcal{C}_{\ell} \vert$. Figure 3.1 gives an overview of our relationship-based clustering process from a set of raw object descriptions $ \mathcal{X}$ (residing in input space $ \mathcal{I}$) via the vector space description $ \mathbf{X}$ (in feature space $ \mathcal{F}$) and relationship description $ \mathbf{S}$ (in similarity space $ \mathcal{S}$) to the cluster labels $ \mathbf{\lambda}$ (in output space $ \mathcal{O}$): $ (\mathcal{X} \in
\mathcal{I}^{n})
\stackrel{\Upsilon}{\rightarrow}
(\mathbf{X}...
...\rightarrow}
(\mathbf{\lambda} \in \mathcal{O}^{n} = \{ 1 , \ldots , k \}^{n})
$. For example in web-page clustering, $ \mathcal{X}$ is a collection of $ n$ web-pages $ x_j$ with $ j \in \{ 1,\ldots,n\} $. Extracting features using $ \Upsilon$ yields $ \mathbf{X}$, the term frequencies of stemmed words, normalized such that for all documents $ \mathbf{x} : \Vert
\mathbf{x} \Vert _2 = 1 $. Similarities are computed, using e.g., cosine-based similarity $ \Psi$ yielding the $ n \times n$ similarity matrix $ \mathbf{S}$. Finally, the cluster label vector $ \mathbf{\lambda}$ is computed using a clustering function $ \Phi$, such as graph partitioning. In short, the basic process can be denoted as $ \mathcal{X}
\stackrel{\Upsilon}{\rightarrow} \mathbf{X}
\stackrel{\Psi}{\rightarrow} \mathbf{S}
\stackrel{\Phi}{\rightarrow} \mathbf{\lambda}$.

Figure 3.1: The relationship-based clustering framework.
\resizebox{\columnwidth}{!}{\includegraphics{epsexcl/absproc}}

Similarity Measures. In this dissertation, we prefer working in similarity space rather than the original vector space in which the feature vectors reside. A similarity measure captures the relationship between two $ d$-dimensional objects in a single number (using on the order of the number of non-zero entries in the vectors or $ d$, at worst, computations). Once this is done, the original high-dimensional space is not dealt with at all, we only work in the transformed similarity space, and subsequent processing is independent of $ d$. A similarity measure $ \in [0,1]$ captures how related two data-points $ \mathbf{x}_a$ and $ \mathbf{x}_b$ are. It should be symmetric ( $ s(\mathbf{x}_a,\mathbf{x}_b)=s(\mathbf{x}_b,\mathbf{x}_a)$), with self-similarity $ s(\mathbf{x}_a,\mathbf{x}_a) = 1$. However, in general, similarity functions (respectively their distance function equivalents $ \delta = \sqrt{-\log(s)}$, see below) do not obey the triangle inequality. An obvious way to compute similarity is through a suitable monotonic and inverse function of a Minkowski ($ L_p$) distance, $ \delta$. Candidates include $ s = 1 / (1 + \delta)$ and $ s = e^{-\delta^2}$, the later being preferable due to maximum likelihood properties (see chapter 4). Similarity can also be defined by the cosine of the angle between two vectors:

$\displaystyle s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_...
...ger} \mathbf{x}_b} {\Vert\mathbf{x}_a\Vert _2 \cdot \Vert\mathbf{x}_b\Vert _2}$ (3.1)

Cosine similarity is widely used in text clustering because two documents with the same proportions of term occurrences but different lengths are often considered identical. In retail data such normalization loses important information about the life-time customer value, and we have recently shown that the extended Jaccard similarity measure is more appropriate [SG00c]. For binary features, the Jaccard coefficient [JD88] (also known as the Tanimoto coefficient [DHS01]) measures the ratio of the intersection of the product sets to the union of the product sets corresponding to transactions $ \mathbf{x}_a$ and $ \mathbf{x}_b$, each having binary (0/1) elements.

$\displaystyle s^{(\mathrm{J})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_...
...rt _2^2 + \Vert \mathbf{x}_b \Vert _2^2 - \mathbf{x}_a^{\dagger} \mathbf{x}_b}$ (3.2)

The extended Jaccard coefficient is also given by equation 3.2, but allows elements of $ \mathbf{x}_a$ and $ \mathbf{x}_b$ to be arbitrary positive real numbers. This coefficient captures a vector-length-sensitive measure of similarity. However, it is still invariant to scale (dilating $ \mathbf{x}_a$ and $ \mathbf{x}_b$ by the same factor does not change $ s(\mathbf{x}_a,\mathbf{x}_b)$). A detailed discussion of the properties of various similarity measures can be found in chapter 4. Since, for general data distributions, one cannot avoid the `curse of dimensionality', there is no similarity metric that is optimal for all applications. Rather, one needs to to determine an appropriate measure for the given application, that captures the essential aspects of the class of high-dimensional data distributions being considered.
next up previous contents
Next: OPOSSUM Up: Relationship-based Clustering and Visualization Previous: Motivation   Contents
Alexander Strehl 2002-05-03