Data-sets

We illustrate the cluster ensemble applications on two real and two artificial data-sets. In table 5.2 some basic properties of the data-sets (left) and parameter choices (right) are summarized. The first data-set (2D2K) was artificially generated and contains 500 points each of two 2-dimensional (2D) Gaussian clusters with means $(-0.227, 0.077)^{\dagger}$ and $(0.095, 0.323)^{\dagger}$ and equal variance of 0.1. The second data-set (8D5K) contains 1000 points from 5 multivariate Gaussian distributions (200 points each) in 8D space. Again, clusters all have the same variance (0.1), but different means. Means were drawn from a uniform distribution within the unit hypercube. Both artificial data-sets are illustrated in appendix A.1 and are available for download at http://strehl.com/.

The third data-set (PENDIG, appendix A.3) is for pen-based recognition of handwritten digits. It is publicly available from the UCI Machine Learning Repository and was contributed by Alpaydin and Alimoglu. It contains 16 spatial features for each of the 7494 training and 3498 test cases (objects). There are ten classes of roughly equal size (balanced clusters) in the data corresponding to the digits 0 to 9.

The fourth data-set is for text clustering. The 20 original Yahoo! news categories in the data are Business, Entertainment (no sub-category, art, cable, culture, film, industry, media, multimedia, music, online, people, review, stage, television, variety), Health, Politics, Sports, Technology. The data is publicly available from ftp://ftp.cs.umn.edu/ /dept/users/boley/ (K1 series) and was used in [BGG$^+$99,SGM00]. The raw 21839 $\times$ 2340 word-document matrix consists of the non-normalized occurrence frequencies of stemmed words, using Porter's suffix stripping algorithm [FBY92]. Pruning all words that occur less than 0.01 or more than 0.10 times on average because they are insignificant (e.g., haruspex) or too generic (e.g., new), respectively, results in

. We call this data-set YAHOO (see also appendix A.5).

For 2D2K, 8D5K, and PENDIG we use

, 5, and 10, respectively. When clustering YAHOO, we use

clusters unless noted otherwise. We chose two times the number of categories, since this seemed to be the more natural number of clusters as indicated by preliminary runs and visualization.^5.8For 2D2K, 8D5K, and PENDIG we use Euclidean-based similarity. For YAHOO we use cosine-based similarity.

name	features	#features	#categories	balance	similarity	#clusters
`2D2K`	real	2	2	1.00	Euclidean	2
`8D5K`	real	8	5	1.00	Euclidean	5
`PENDIG`	real	16	10	0.87	Euclidean	10
`YAHOO`	ordinal	2903	20	0.24	Cosine	40