Prof. Dr. Alexander Strehl's Publications

[SG02a] Alexander Strehl and Joydeep Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:583-617, December 2002.
[ bib | .pdf | .ps.gz ]

This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets.

[GSM02] Joydeep Ghosh, Alexander Strehl, and Srujana Merugu. A consensus framework for integrating distributed clusterings under limited knowledge sharing. In Proc. NSF Workshop on Next Generation Data Mining, pages 99-108, November 2002.
[ bib | .pdf | .ps.gz ]

This paper examines the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. This problem is an abstraction of scenarios where different organizations have grouped some or all elements of a common underlying population, possibly using different features, algorithms or clustering criteria. Moreover, due to real life constraints such as proprietary techniques, legal restrictions, different data ownerships etc, it is not feasible to pool all the data into a central location and then apply clustering techniques: the only information that can be shared are the symbolic cluster labels. The cluster ensemble problem is formalized as a combinatorial optimization problem that obtains a consensus function in terms of shared mutual information among individual solutions. Three effective and efficient techniques for obtaining high-quality consensus functions are described and studied empirically for the following qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms were applied to non-identical sets of objects and (iii) when the individual solutions provide varying numbers of clusters. Promising results are obtained in all the three situations for synthetic as well as real data sets, even under severe restrictions on data and knowledge sharing.

[SG02b] Alexander Strehl and Joydeep Ghosh. Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proc. Conference on Artificial Intelligence (AAAI 2002), Edmonton, pages 93-98. AAAI/MIT Press, July 2002.
[ bib | .pdf | .ps.gz ]

It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple non-hierarchical clusterings. The idea of combining cluster labelings without accessing the original features leads us to a general knowledge reuse framework that we call cluster ensembles. Our contribution in this paper is to formally define the cluster ensemble problem as an optimization problem and to propose three effective and efficient combiners for solving it based on a hypergraph model. Results on synthetic as well as real data sets are given to show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering.

[Str02] Alexander Strehl. Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, The University of Texas at Austin, May 2002.
[ bib | .pdf ]

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the `curse of dimensionality' issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account. The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters. The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to k-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures. It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.

[GS02] Joydeep Ghosh and Alexander Strehl. Clustering and visualization of retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems, AIP. Springer, 2002. In press.
[ bib | .pdf | .ps.gz ]

Transaction analysis, including clustering of market baskets, is a key application of data mining to the retail industry. This domain has some specific requirements such as the need for obtaining easily interpretable and actionable results. It also exhibits some very challenging characteristics, mostly stemming from the fact that the data has thousands of features, is highly non-Gaussian and sparse. This paper proposes a relationship based approach to clustering such data that tries to side-step the ``curse of dimensionality'' issue by working in a suitable similarity space instead of the original high-dimensional feature space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning based clustering techniques in this space. The output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user towards a suitable number of clusters. Results are presented on a real retail industry data-set of several thousand customers and products.

[SG03] Alexander Strehl and Joydeep Ghosh. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, pages 208-230, Spring 2003.
[ bib | .pdf | .ps.gz ]

In several real-life data-mining applications, data reside in very high (1000 or more) dimensional space, where both clustering techniques developed for low-dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan, etc.) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship-based approach that alleviates both problems, side-stepping the ``curse-of-dimensionality'' issue by working in a suitable similarity space instead of the original high-dimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While two-dimensional visualization of a similarity matrix is by itself not novel, its combination with the order-sensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the high-dimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters. Results are presented on a real retail industry dataset of several thousand customers and products, as well as on clustering of web-document collections and of web-log sessions.

[SG01] Alexander Strehl and Joydeep Ghosh. Relationship-based visualization of high-dimensional data clusters. In Proc. Workshop on Visual Data Mining (KDD 2001), San Francisco, pages 90-99. ACM, August 2001.
[ bib | .pdf | .ps.gz ]

In several real-life data mining applications, data resides in very high (> 1000) dimensional space, where both clustering techniques developed for low dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan etc) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship based approach to clustering that alleviates both problems, side-stepping the ``curse of dimensionality'' issue by working in a suitable similarity space instead of the original high-dimensional attribute space. The similarity measure used can be tailored to satisfy business criteria such as obtaining user clusters representing comparable amounts of revenue. The clustering algorithm is used to re-order the data points so that the resulting (rearranged) similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While such visualization is not novel, the two key contributions of our method are: (i) it leads to clusters of (approximately) equal importance, and (ii) related clusters show up adjacent to one another, further facilitating the visualization of results. Both properties arise from the efficient and scalable top-down graph-partitioning approach used for clustering in similarity space. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user towards the right number of clusters. Results are presented on a real retail industry data-set of several thousand customers and products, as well as on clustering of web document collections.

[SG00b] Alexander Strehl and Joydeep Ghosh. A scalable approach to balanced, high-dimensional clustering of market-baskets. In Proc. High Performance Computing (HiPC 2000), Bangalore, volume 1970 of LNCS, pages 525-536. Springer, December 2000.
[ bib | .pdf | .ps.gz ]

This paper presents OPOSSUM, a novel similarity-based clustering approach based on constrained, weighted graph-partitioning. OPOSSUM is particularly attuned to real-life market baskets, characterized by very high-dimensional, highly sparse customer-product matrices with positive ordinal attribute values and significant amount of outliers. Since it is built on top of Metis, a well-known and highly efficient graph-partitioning algorithm, it inherits the scalable and easily parallelizeable attributes of the latter algorithm. Results are presented on a real retail industry data-set of several thousand customers and products, with the help of CLUSION, a cluster visualization tool.

[SG00a] Alexander Strehl and Joydeep Ghosh. Clustering guidance and quality evaluation using relationship-based visualization. In Proc. ANNIE 2000, St. Louis, volume 10, pages 483-488. ASME, November 2000.
[ bib | .pdf | .ps.gz ]

In this paper, we introduce CLUSION, a clustering visualization toolkit, that facilitates data exploration and validation of clustering results. CLUSION is especially suitable for very high-dimensional spaces since it is based on pair-wise relationships of the data points rather than their original vector representation. Given a similarity semantic, guidance to answer the questions 'What is the right number of clusters?', 'Which cluster should be split?', and 'Which clusters should be merged?' can be drawn from the visualization. The visualization also induces a quality metric not biased to increase with the number of clusters. We present examples from market basket data characterized by high-dimensional (d > 10,000), highly sparse customer-product matrices with positive ordinal attribute values and significant amount of outliers.

[SGM00] Alexander Strehl, Joydeep Ghosh, and Raymond J. Mooney. Impact of similarity measures on web-page clustering. In Proc. AAAI Workshop on AI for Web Search (AAAI 2000), Austin, pages 58-64. AAAI/MIT Press, July 2000.
[ bib | .pdf | .ps.gz ]

Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as YAHOO that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized k-means, weighted graph partitioning), on high dimensional sparse data representing web documents. Performance is measured against a human-imposed classification into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical significance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.

[SA00b] Alexander Strehl and J. K. Aggarwal. A new Bayesian relaxation framework for the estimation and segmentation of multiple motions. In Proc. IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2000), Austin, pages 21-25. IEEE, April 2000.
[ bib | .pdf | .ps.gz ]

In this paper we propose a new probabilistic relaxation framework to perform robust multiple motion estimation and segmentation from a sequence of images. Our approach uses displacement information obtained from tracked features or raw sparse optical flow to iteratively estimate multiple motion models. Each iteration consists of a segmentation and a motion parameter estimation step. The motion models are used to compute probability density functions for all displacement vectors. Based on the estimated probabilities a pixel-wise segmentation decision is made by a Bayesian classifier, which is optimal in respect to minimum error. The updated segmentation then relaxes the motion parameter estimates. These two steps are iterated until the error of the fitted models is minimized. The Bayesian formulation provides a unified probabilistic framework for various motion models and induces inherent robustness through its rejection mechanism. An implementation of the proposed framework using translational and affine motion models is presented. Its superior performance on real image sequences containing multiple and fragmented motions is demonstrated.

[SG00c] Alexander Strehl and Joydeep Ghosh. Value-based customer grouping from large retail data-sets. In Proc. SPIE Conference on Data Mining and Knowledge Discovery, Orlando, volume 4057, pages 33-42. SPIE, April 2000.
[ bib | .pdf | .ps.gz ]

In this paper, we propose OPOSSUM, a novel similarity-based clustering algorithm using constrained, weighted graph-partitioning. Instead of binary presence or absence of products in a market-basket, we use an extended `revenue per product' measure to better account for management objectives. Typically the number of clusters desired in a database marketing application is only in the teens or less. OPOSSUM proceeds top-down, which is more efficient and takes a small number of steps to attain the desired number of clusters as compared to bottom-up agglomerative clustering approaches. OPOSSUM delivers clusters that are balanced in terms of either customers (samples) or revenue (value). To facilitate data exploration and validation of results we introduce CLUSION, a visualization toolkit for high-dimensional clustering problems. To enable closed loop deployment of the algorithm, OPOSSUM has no user-specified parameters. Thresholding heuristics are avoided and the optimal number of clusters is automatically determined by a search for maximum performance. Results are presented on a real retail industry data-set of several thousand customers and products, to demonstrate the power of the proposed technique.

[SA00a] Alexander Strehl and J. K. Aggarwal. MODEEP: A motion-based object detection and pose estimation method for airborne FLIR sequences. Machine Vision and Applications (MVA), 11(6):267-276, 2000.
[ bib | .pdf | .ps.gz ]

In this paper we present a method called MODEEP (Motion-based Object DEtection and Estimation of Pose) to detect independently moving objects (IMOs) in forward looking infrared (FLIR) image sequences taken from an airborne, moving platform. Ego-motion effects are removed through a robust multi-scale affine image registration process. Thereafter, areas with residual motion indicate potential object activity. These areas are detected, refined and selected using a Bayesian classifier. The resulting regions are clustered into pairs such that each pair represents one object's front and rear end. Using motion and scene knowledge, we estimate object pose and establish a region-of-interest (ROI) for each pair. Edge elements within each ROI are used to segment the convex cover containing the IMO. We show detailed results on real, complex, cluttered and noisy sequences. Moreover, we outline the integration of our fast and robust system into a comprehensive automatic target recognition (ATR) and action classification system.

[GSG99] Gunjan K. Gupta, Alexander Strehl, and Joydeep Ghosh. Distance based clustering of association rules. In Proc. ANNIE 1999, St. Louis, volume 9, pages 759-764. ASME, November 1999.
[ bib | .pdf | .ps.gz ]

Association rule mining is one of the most important procedures in data mining. In industry applications, often more than 10,000 rules are discovered. To allow manual inspection and support knowledge discovery the number of rules has to be reduced significantly by techniques such as pruning and grouping. In this paper, we present a new normalized distance metric to group association rules. Based on these distances, an agglomerative clustering algorithm is used to cluster the rules. Also the rules are embedded in a vector space by multi-dimensional scaling and clustered using a self organizing feature map. The results are combined for visualization. We compare various distance measures and illustrate subjective and objective cluster purity on results obtained from data-sets.

[SA99] Alexander Strehl and J. K. Aggarwal. Detecting moving objects in airborne forward looking infra-red sequences. In Proc. IEEE Workshop on Computer Vision Beyond the Visible Spectrum (CVPR 1998), Fort Collins, pages 3-12. IEEE, June 1999.
[ bib | .pdf | .ps.gz ]

In this paper we propose a system that detects independently moving objects (IMOs) in forward looking infra-red (FLIR) image sequences taken from an airborne, moving platform. Ego-motion effects are removed through a robust multi-scale affine image registration process. Consequently, areas with residual motion indicate object activity. These areas are detected, refined and selected using a Bayes' classifier. The remaining regions are clustered into pairs. Each pair represents an object's front and rear end. Using motion and scene knowledge we estimate object pose and establish a region-of-interest (ROI) for each pair. Edge elements within each ROI are used to segment the convex cover containing the IMO. We show detailed results on real, complex, cluttered and noisy sequences. Moreover, we outline the integration of our fast and robust system into a comprehensive automatic target recognition (ATR) and action classification system.

[Str98] Alexander Strehl. A new Bayesian relaxation algorithm for motion estimation and segmentation in the presence of two affine motions. Master's thesis, The University of Texas at Austin, August 1998.
[ bib | .pdf ]

Describing an image sequence in terms of a small number of coherently moving segments is useful for tasks ranging from autonomous vehicle navigation to video compression. A new probabilistic relaxation algorithm is proposed to perform motion estimation as well as object segmentation by using an iterative Bayesian approach. In the first stage, optical flow with local confidence estimates is computed. The second stage uses high confidence locations to alternately estimate parameters of two affine motions and segment the current image in two motion regions. This procedure is iterated in a stochastic relaxation framework to minimize the error of the fitted affine models until the motion parameter estimates converge. Applications of our motion estimation algorithm to stabilize sequences, detect and track objects and obtain their trajectories are presented. Our algorithm's performance is illustrated on synthetic and real image sequences.