next up previous contents
Next: Other (Dis-)Similarity Measures Up: Similarity Measures for Document Previous: Pearson Correlation   Contents


Extended Jaccard Similarity

The binary Jaccard coefficient measures the degree of overlap between two sets and is computed as the ratio of the number of shared attributes (words) of $ \mathbf{x}_a$ AND $ \mathbf{x}_b$ to the number possessed by $ \mathbf{x}_a$ OR $ \mathbf{x}_b$. For example, given two sets' binary indicator vectors $ \mathbf{x}_a = (0, 1, 1, 0)^{\dagger}$ and $ \mathbf{x}_b = (1, 1, 0,
0)^{\dagger}$, the cardinality of their intersect is 1 and the cardinality of their union is 3, rendering their Jaccard coefficient 1/3. The binary Jaccard coefficient It is often used in retail market-basket applications. In chapter 3, we extended the binary definition of Jaccard coefficient to continuous or discrete non-negative features. The extended Jaccard is computed as

$\displaystyle s^{(\mathrm{J})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_a...
...t _2^2 + \Vert \mathbf{x}_b \Vert _2^2 - \mathbf{x}_a^{\dagger} \mathbf{x}_b} ,$ (4.4)

which is equivalent to the binary version when the feature vector entries are binary. Extended Jaccard similarity [SG00c] retains the sparsity property of the cosine while allowing discrimination of collinear vectors as we will show in the following subsection. Another similarity measure highly related to the extended Jaccard is the Dice coefficient ( $ s^{(\mathrm{D})}
(\mathbf{x}_a,\mathbf{x}_b) = \frac{2 \mathbf{x}_a^{\dagger}
\mathbf{x}_b} { \Vert \mathbf{x}_a \Vert _2^2 + \Vert \mathbf{x}_b \Vert _2^2}
$). The Dice coefficient can be obtained from the extended Jaccard coefficient by adding $ \mathbf{x}_a^{\dagger} \mathbf{x}_b$ to both the numerator and denominator. It is omitted here since it behaves very similar to the extended Jaccard coefficient.


next up previous contents
Next: Other (Dis-)Similarity Measures Up: Similarity Measures for Document Previous: Pearson Correlation   Contents
Alexander Strehl 2002-05-03