Next: Other (Dis-)Similarity Measures
Up: Similarity Measures for Document
Previous: Pearson Correlation
  Contents
Extended Jaccard Similarity
The binary Jaccard coefficient measures the degree of overlap between
two sets and is computed as the ratio of the number of shared
attributes (words) of
AND
to
the number possessed by
OR
. For example, given two sets' binary indicator vectors
and
, the cardinality of their intersect is 1 and the
cardinality of their union is 3, rendering their Jaccard coefficient
1/3. The binary Jaccard coefficient It is often used in retail
market-basket applications. In chapter 3, we extended the binary definition of Jaccard
coefficient to continuous or discrete non-negative features. The
extended Jaccard is computed as
|
(4.4) |
which is equivalent to the binary version when the feature vector
entries are binary. Extended Jaccard similarity [SG00c]
retains the sparsity property of the cosine while allowing
discrimination of collinear vectors as we will show in the following
subsection. Another similarity measure highly related to the extended
Jaccard is the Dice coefficient (
). The Dice coefficient can be obtained from the extended Jaccard
coefficient by adding
to both
the numerator and denominator. It is omitted here since it behaves
very similar to the extended Jaccard coefficient.
Next: Other (Dis-)Similarity Measures
Up: Similarity Measures for Document
Previous: Pearson Correlation
  Contents
Alexander Strehl
2002-05-03