Discussion

Clearly, if clusters are to be meaningful, the similarity measure should be invariant to transformations natural to the problem domain. Also, normalization may strongly affect clustering in a positive or negative way. The features have to be chosen carefully to be on comparable scales and similarity has to reflect the underlying semantics for the given task.

Euclidean similarity is translation invariant but scale sensitive while cosine is translation sensitive but scale invariant. The extended Jaccard has aspects of both properties as illustrated in figure 4.1. Iso-similarity lines at

, 0.5 and 0.75 for points $\mathbf{x}_1 = (3,1)^{\dagger}$ and $\mathbf{x}_2 = (1,2)^{\dagger}$ are shown for Euclidean, cosine, and the extended Jaccard. For cosine similarity only the 4 (out of 12) lines that are in the positive quadrant are plotted: The two lines in the lower right part are one of two lines from $\mathbf{x}_1$ at 0.5 and 0.75. The two lines in the upper left are for $\mathbf{x}_2$ at

and 0.75. The dashed line marks the locus of equal similarity to $\mathbf{x}_1$ and $\mathbf{x}_2$ which always passes through the origin for cosine and extended Jaccard similarity.

Using Euclidean similarity $s^{(\mathrm{E})}$ , iso-similarities are concentric hyperspheres around the considered point. Due to the finite range of similarity, the radius decreases hyperbolically as $s^{(\mathrm{E})}$ increases linearly. The radius does not depend on the center-point. The only location with similarity of 1 is the considered point itself and all finite locations have a similarity greater than 0. This last property tends to generate non-sparse similarity matrices. Using the cosine measure $s^{(\mathrm{C})}$ renders the iso-similarities to be hypercones all having their apex at the origin and axis aligned with the considered point. Locations with similarity 1 are on the 1-dimensional sub-space defined by this axis. The locus of points with similarity 0 is the hyperplane through the origin and perpendicular to this axis. For the extended Jaccard similarity $s^{(\mathrm{J})}$ , the iso-similarities are non-concentric hyperspheres. The only location with similarity 1 is the point itself. The hypersphere radius increases with the the distance of the considered point from the origin so that longer vectors turn out to be more tolerant in terms of similarity than smaller vectors. Sphere radius also increases with similarity and as $s^{(\mathrm{J})}$ approaches 0 the radius becomes infinite rendering the sphere to the same hyperplane as obtained for cosine similarity. Thus, for $s^{(\mathrm{J})} \rightarrow 0$ , extended Jaccard behaves like the cosine measure, and for $s^{(\mathrm{J})} \rightarrow 1$ , it behaves like the Euclidean distance.

Figure 4.1: Properties of (a) Euclidean-based, (b) cosine, and (c) extended Jaccard similarity measures illustrated in 2 dimensions. Two points $(1,2)^{\dagger}$ and $(3,1)^{\dagger}$ are marked with $\times$ s. For each point iso-similarity surfaces for

, 0.5, and 0.75 are shown with solid lines. The surface that is equi-similar to the two points is marked with a dashed line.

$\resizebox{\columnwidth}{!}{\includegraphics*{eps/simcompillu1}}$

(a)

(b)

(c)

Figure 4.2: More similarity properties shown on the 2-dimensional example of figure 4.1. The goodness of a location as the common representative of the two points is indicated with brightness. The best representative is marked with a $\star$ . The extended Jaccard (c) adopts the middle ground between Euclidean (a) and cosine-based similarity (b).

$\resizebox{\columnwidth}{!}{\includegraphics*{eps/simcompillu2}}$

(a)

(b)

(c)

In traditional Euclidean

-means clustering the optimal cluster representative $\mathbf{c}_{\ell}$ minimizes the sum of squared error criterion, i.e.,

$\displaystyle \mathbf{c}_{\ell} = \arg \min_{\mathbf{z} \in \mathcal{F}} \sum_{\mathbf{x}_j \in \mathcal{C}_{\ell}} \Vert \mathbf{x}_j - \mathbf{z} \Vert _2^2 .$

(4.5)

$\displaystyle f ( \mathcal{C}_{\ell} , \mathbf{z} ) = \sum_{\mathbf{x}_j \in \m... ...athbf{x}_j \in \mathcal{C}_{\ell}} \Vert \mathbf{x}_j - \mathbf{z} \Vert _2^2 .$

(4.6)

$\displaystyle f ( \mathcal{C}_{\ell} , \mathbf{z} ) = \sum_{\mathbf{x}_j \in \mathcal{C}_{\ell}} - \log(s(\mathbf{x}_j,\mathbf{z}))$

(4.7)

$\displaystyle \mathbf{c}_{\ell} = \arg \max_{\mathbf{z} \in \mathcal{F}} \prod_{\mathbf{x}_j \in \mathcal{C}_{\ell}} s(\mathbf{x}_j,\mathbf{z}) .$

(4.8)

To illustrate the values of the evaluation function $f'(\{ \mathbf{x}_1 , \mathbf{x}_2\},\mathbf{z})$ are used to shade the background in figure 4.2. The maximum likelihood representative of $\mathbf{x}_1$ and $\mathbf{x}_2$ is marked with a $\star$ in figure 4.2. For cosine similarity all points on the equi-similarity are optimal representatives. In a maximum likelihood interpretation, we constructed the distance similarity transformation such that $p(\mathbf{z} \vert \mathbf{c}_{\ell}) \sim s(\mathbf{z},\mathbf{c}_{\ell})$ . Consequently, we can use the dual interpretations of probabilities in similarity space and errors in distance space.