next up previous contents
Next: Evaluation Criterion Up: Cluster Ensemble Applications and Previous: Cluster Ensemble Applications and   Contents


Data-sets

We illustrate the cluster ensemble applications on two real and two artificial data-sets. In table 5.2 some basic properties of the data-sets (left) and parameter choices (right) are summarized. The first data-set (2D2K) was artificially generated and contains 500 points each of two 2-dimensional (2D) Gaussian clusters with means $ (-0.227, 0.077)^{\dagger}$ and $ (0.095, 0.323)^{\dagger}$ and equal variance of 0.1. The second data-set (8D5K) contains 1000 points from 5 multivariate Gaussian distributions (200 points each) in 8D space. Again, clusters all have the same variance (0.1), but different means. Means were drawn from a uniform distribution within the unit hypercube. Both artificial data-sets are illustrated in appendix A.1 and are available for download at http://strehl.com/.

The third data-set (PENDIG, appendix A.3) is for pen-based recognition of handwritten digits. It is publicly available from the UCI Machine Learning Repository and was contributed by Alpaydin and Alimoglu. It contains 16 spatial features for each of the 7494 training and 3498 test cases (objects). There are ten classes of roughly equal size (balanced clusters) in the data corresponding to the digits 0 to 9.

The fourth data-set is for text clustering. The 20 original Yahoo! news categories in the data are Business, Entertainment (no sub-category, art, cable, culture, film, industry, media, multimedia, music, online, people, review, stage, television, variety), Health, Politics, Sports, Technology. The data is publicly available from ftp://ftp.cs.umn.edu/
/dept/users/boley/
(K1 series) and was used in [BGG$^+$99,SGM00]. The raw 21839 $ \times$ 2340 word-document matrix consists of the non-normalized occurrence frequencies of stemmed words, using Porter's suffix stripping algorithm [FBY92]. Pruning all words that occur less than 0.01 or more than 0.10 times on average because they are insignificant (e.g., haruspex) or too generic (e.g., new), respectively, results in $ d=2903$. We call this data-set YAHOO (see also appendix A.5).

For 2D2K, 8D5K, and PENDIG we use $ k=2$, 5, and 10, respectively. When clustering YAHOO, we use $ k=40$ clusters unless noted otherwise. We chose two times the number of categories, since this seemed to be the more natural number of clusters as indicated by preliminary runs and visualization.5.8For 2D2K, 8D5K, and PENDIG we use Euclidean-based similarity. For YAHOO we use cosine-based similarity.


Table 5.2: Overview of data-set properties and parameters for cluster ensemble experiments. Balance is defined as the ratio of the average category size to the largest category size.
name features #features #categories balance similarity #clusters
2D2K real 2 2 1.00 Euclidean 2
8D5K real 8 5 1.00 Euclidean 5
PENDIG real 16 10 0.87 Euclidean 10
YAHOO ordinal 2903 20 0.24 Cosine 40
           



next up previous contents
Next: Evaluation Criterion Up: Cluster Ensemble Applications and Previous: Cluster Ensemble Applications and   Contents
Alexander Strehl 2002-05-03