Next: Notation Up: Introduction Previous: Introduction Contents

Cluster Analysis

The ability to form meaningful groups of objects is one of the most fundamental modes of intelligence. Humans perform this task with remarkable ease. In early childhood one learns to distinguish, for example, between cats and dogs or apples and oranges. However, enabling the computer to do this task of grouping automatically is a difficult and often ill-posed problem.

Cluster analysis is a tool for exploring the structure of data. The core of cluster analysis is clustering, the process of grouping objects into clusters such that objects from the same cluster are similar and objects from different clusters are dissimilar. Objects can be described in terms of measurements (e.g., attributes, features) or by relationships with other objects (e.g., pairwise distance, similarity). Unlike classification, clustering does not require assumptions about category labels that tag objects with prior identifiers. Therefore, clustering is an unsupervised learning technique versus classification, which belongs to supervised learning.

The need to structure and learn from the vigorously growing amounts of data has been a driving force for making clustering a highly active research area. Humans are not able to easily discover knowledge from the glut of information in databases without the use of summarization techniques. Basic statistics (such as mean, variance) or histograms can provide an initial feel for the data. However, more intricate relationships among the objects, among the features, and between both can be discovered through cluster analysis.

Clustering is used in many areas, including artificial intelligence, biology, customer relationship management, data compression, data mining, information retrieval, image processing, machine learning, marketing, medicine, pattern recognition, psychology, recommender systems and statistics. In biology, clustering is used, for example, to automatically build a taxonomy of species based on their features. Currently, there is considerable interest in estimation of phylogenetic trees from gene sequence data. Another application of clustering is to better understand gene functions in the biological processes in a cell. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. Another growing application area is customer relationship management, where data collected from multiple touch-points (e.g., web surfing, cash register transactions, call center activities) has become readily available. This data contains valuable knowledge of customer behavior that can help optimize marketing, bundling, and pricing strategies. Due to the massive amounts and great detail of data, extracting such knowledge is hard, and often even obvious insights are overlooked. Clustering is critical in the mining process because it can summarize data to a manageable level by forming, for example, groups of customers with similar profiles.

Approaches to clustering differ in a variety of aspects. Clustering algorithms can proceed divisively (top-down) or agglomeratively (bottom-up). In top-down processing, one starts with the objects as one group and splits it into clusters. Conversely, in agglomerative algorithms, each object starts as a singleton cluster and clusters are merged successively. Depending on the result of clustering, one can distinguish between hierarchical (multi-layer) and flat (single-layer) techniques. Hierarchical algorithms output a rooted tree structure that can be represented as a dendrogram. Within hierarchical clustering, techniques can be distinguished into those generating binary trees and those generating arbitrary trees. In the case of flat clustering, hard clustering induces a partitioning into non-overlapping groups. In contrast to this Boolean membership relation, a soft clustering gives the probabilities for each object being a member of a cluster. Based on the goals of clustering, it can be formalized as an optimization problem by itself (e.g., minimizing squared error) or in the context of an application (facilitating for further activities such as predictive modeling or browsing).

Clustering has been widely studied in several disciplines, especially since the early 1960's. A variety of classic [Fuk72,Eve74,Har75,Mur85,JD88] and recent books [KR90,HK00,DHS01] give great introductions to approaches and algorithms in cluster analysis.

The rest of this chapter is organized as follows. The next section introduces some basic notational conventions for this dissertation. Section 1.3 introduces our basic approach to clustering and the components involved. Section 1.4 elaborates on the motivations for this dissertation. In section 1.5, a brief summarization of the contributions is given. Section 1.6 contains the organization of the rest of this dissertation.

Next: Notation Up: Introduction Previous: Introduction Contents

Alexander Strehl 2002-05-03