next up previous contents
Next: Cosine Measure Up: Similarity Measures for Document Previous: Similarity Measures for Document   Contents


Conversion from a Distance Metric

The Minkowski distances $ L_p(\mathbf{x}_a,\mathbf{x}_b) = \left(
\sum_{i=1}^d \vert \mathbf{x}_{i,a} - \mathbf{x}_{i,b} \vert^p \right)^{1/p}$ are the standard metrics for geometrical problems. For $ p=2$ we obtain the Euclidean distance. There are several possibilities for converting such a distance metric (in $ [0,\inf)$, with 0 closest) into a similarity measure (in $ [0,1]$, with 1 closest) by a monotonic decreasing function. For Euclidean space, we chose to relate distances $ d$ and similarities $ s$ using $ s = e^{-d^2}$. Consequently, we define Euclidean [0,1]-normalized similarity as

$\displaystyle s^{(\mathrm{E})} (\mathbf{x}_a,\mathbf{x}_b) = e^{ - \Vert \mathbf{x}_a - \mathbf{x}_b \Vert _2^2}$ (4.1)

which has important desirable properties (as we will see in the discussion) that the more commonly adopted $ s(\mathbf{x}_a,\mathbf{x}_b) = 1 / (1 + \Vert
\mathbf{x}_a - \mathbf{x}_b \Vert _2 )$ lacks. Other distance functions can be used as well. The Mahalanobis distance normalizes the features using the covariance matrix. Due to the high-dimensional nature of text data, covariance estimation is inaccurate and often computationally intractable, and normalization is done if need to be, at the document representation stage itself, typically by applying TF-IDF.


next up previous contents
Next: Cosine Measure Up: Similarity Measures for Document Previous: Similarity Measures for Document   Contents
Alexander Strehl 2002-05-03