Next: OPOSSUM Up: Relationship-based Clustering and Visualization Previous: Motivation   Contents

# Domain Specific Features and Similarity Space

Notation. Let be the number of objects / samples / points (e.g., customers, documents, web-sessions) in the data and the number of features (e.g., products, words, web-pages) for each sample with . Let be the desired number of clusters. The input data can be represented by a data matrix with the -th column vector representing the sample . denotes the transpose of . Hard clustering assigns a label to each -dimensional sample , such that similar samples get the same label. In general the labels are treated as nominals with no inherent order, though in some cases, such as 1-dimensional SOMs, any top-down recursive bisection approach as well as our proposed method, the labeling contains extra ordering information. Let denote the set of all objects in the -th cluster ( ), with and . Figure 3.1 gives an overview of our relationship-based clustering process from a set of raw object descriptions (residing in input space ) via the vector space description (in feature space ) and relationship description (in similarity space ) to the cluster labels (in output space ): . For example in web-page clustering, is a collection of web-pages with . Extracting features using yields , the term frequencies of stemmed words, normalized such that for all documents . Similarities are computed, using e.g., cosine-based similarity yielding the similarity matrix . Finally, the cluster label vector is computed using a clustering function , such as graph partitioning. In short, the basic process can be denoted as .

Similarity Measures. In this dissertation, we prefer working in similarity space rather than the original vector space in which the feature vectors reside. A similarity measure captures the relationship between two -dimensional objects in a single number (using on the order of the number of non-zero entries in the vectors or , at worst, computations). Once this is done, the original high-dimensional space is not dealt with at all, we only work in the transformed similarity space, and subsequent processing is independent of . A similarity measure captures how related two data-points and are. It should be symmetric ( ), with self-similarity . However, in general, similarity functions (respectively their distance function equivalents , see below) do not obey the triangle inequality. An obvious way to compute similarity is through a suitable monotonic and inverse function of a Minkowski () distance, . Candidates include and , the later being preferable due to maximum likelihood properties (see chapter 4). Similarity can also be defined by the cosine of the angle between two vectors:

 (3.1)

Cosine similarity is widely used in text clustering because two documents with the same proportions of term occurrences but different lengths are often considered identical. In retail data such normalization loses important information about the life-time customer value, and we have recently shown that the extended Jaccard similarity measure is more appropriate [SG00c]. For binary features, the Jaccard coefficient [JD88] (also known as the Tanimoto coefficient [DHS01]) measures the ratio of the intersection of the product sets to the union of the product sets corresponding to transactions and , each having binary (0/1) elements.

 (3.2)

The extended Jaccard coefficient is also given by equation 3.2, but allows elements of and to be arbitrary positive real numbers. This coefficient captures a vector-length-sensitive measure of similarity. However, it is still invariant to scale (dilating and by the same factor does not change ). A detailed discussion of the properties of various similarity measures can be found in chapter 4. Since, for general data distributions, one cannot avoid the `curse of dimensionality', there is no similarity metric that is optimal for all applications. Rather, one needs to to determine an appropriate measure for the given application, that captures the essential aspects of the class of high-dimensional data distributions being considered.

Next: OPOSSUM Up: Relationship-based Clustering and Visualization Previous: Motivation   Contents
Alexander Strehl 2002-05-03