Conversion from a Distance Metric

Next: Cosine Measure Up: Similarity Measures for Document Previous: Similarity Measures for Document Contents

Conversion from a Distance Metric

The Minkowski distances $L_p(\mathbf{x}_a,\mathbf{x}_b) = \left( \sum_{i=1}^d \vert \mathbf{x}_{i,a} - \mathbf{x}_{i,b} \vert^p \right)^{1/p}$ are the standard metrics for geometrical problems. For we obtain the Euclidean distance. There are several possibilities for converting such a distance metric (in $[0,\inf)$ , with 0 closest) into a similarity measure (in , with 1 closest) by a monotonic decreasing function. For Euclidean space, we chose to relate distances and similarities using $s = e^{-d^2}$ . Consequently, we define Euclidean [0,1]-normalized similarity as

$\displaystyle s^{(\mathrm{E})} (\mathbf{x}_a,\mathbf{x}_b) = e^{ - \Vert \mathbf{x}_a - \mathbf{x}_b \Vert _2^2}$

(4.1)

which has important desirable properties (as we will see in the discussion) that the more commonly adopted $s(\mathbf{x}_a,\mathbf{x}_b) = 1 / (1 + \Vert \mathbf{x}_a - \mathbf{x}_b \Vert _2 )$ lacks. Other distance functions can be used as well. The Mahalanobis distance normalizes the features using the covariance matrix. Due to the high-dimensional nature of text data, covariance estimation is inaccurate and often computationally intractable, and normalization is done if need to be, at the document representation stage itself, typically by applying TF-IDF.

Next: Cosine Measure Up: Similarity Measures for Document Previous: Similarity Measures for Document Contents

Alexander Strehl 2002-05-03