Web-document Clusters

Next: Web-log Session Clusters Up: Experiments Previous: Retail Market-basket Clusters Contents

Web-document Clusters

In this subsection, we present results on documents from the Yahoo! news section. Each of the 2340 documents is characterized by a bag-of-words. The data is publicly available from ftp://ftp.cs.umn.edu/dept/users/boley/ (K1 series) and was used in [BGG$^+$99,SGM00]. The 20 original Yahoo! news categories are Business (B), Entertainment (no sub-category (E), art (a), cable (c), culture (cu), film (f), industry (i), media (m), multimedia (mm), music (mu), online (o), people (p), review (r), stage (s), television (t), variety (v)), Health (H), Politics (P), Sports (S), Technology (T) and correspond to the category labels 1,...,20, respectively. The raw 21839 $\times$ 2340 word-by-document matrix consists of the non-normalized occurrence frequencies of stemmed words, using Porter's suffix stripping algorithm [Fra92]. Pruning all words that occur less than 0.01 or more than 0.10 times on average because they are insignificant (e.g., haruspex) or too generic (e.g., new), respectively, results in

. We call this data-set YAHOO (see also appendix A.5). Let us point out some worthwhile differences between clustering market-baskets and documents. Firstly, discrimination of vector length is no longer desired since customer life-time value matters but document length does not. Consequently, we use cosine similarity $s^{(\mathrm{C})}$ instead of extended Jaccard similarity $s^{(\mathrm{J})}$ . Also, in document clustering we are less concerned about balancing, since there are usually no direct monetary costs of the actions derived from the clustering involved. As a consequence of this, we over-cluster first with sample-balanced OPOSSUM and then allow user guided merging of clusters through CLUSION. The YAHOO data-set is notorious for having some diffuse groups with overlaps among categories, a few categories with multi-modal distributions, etc. These aspects can be easily explored by looking at the class labels within each cluster, merging some clusters and then again visualizing the results. Figure 3.6 shows clusterings with three settings of

. For

(figure 3.6(a)) most clusters are not dense enough, despite the fact that the first two clusters already seem like they should not have been split. After increasing to

(figure 3.6(b)), CLUSION indicates that the clustering now has sufficiently compact clusters. Now, we successively merge pairs of highly related clusters until we obtain our final clustering with

(figure 3.6(c)). The merging process is guided by inter-cluster similarity (e.g., bright off-diagonal regions) augmented by cluster-descriptions (e.g., related frequent words). In fact, in our graphical user interface of CLUSION merging is as easy as clicking on a selected off-diagonal region.

$\resizebox{6cm}{6cm}{\includegraphics*{eps/yahsamk10heqcol}}$	$\resizebox{6cm}{6cm}{\includegraphics*{eps/yahsamk40heqcol}}$
(a)	(b)

$\resizebox{6cm}{6cm}{\includegraphics*{eps/yahsamk20heqcol}}$
(c)

$\mathcal{C}_{\ell}$	$\mathcal{K}_{\hat{h}}$	$\phi^{(\mathrm{A})}$	$\phi^{(\mathrm{B})}$
1	P	21.05%	0.73
2	H	91.48%	0.15
3	S	68.39%	0.40
4	P	52.84%	0.60
5	T	63.79%	0.39
6	o	57.63%	0.40
7	B	60.23%	0.48
8	f	37.93%	0.66
9	cu	50.85%	0.48
10	p	36.21%	0.56
11	f	67.80%	0.33
12	f	77.59%	0.31
13	r	47.28%	0.56
14	mu	44.07%	0.56
15	p	50.00%	0.50
16	mu	18.97%	0.71
17	p	55.08%	0.54
18	t	82.76%	0.24
19	f	38.79%	0.58
20	f	69.49%	0.36

top 3 descriptive terms

israel, teeth, dental

breast, smok, surgeri

smith, player, coach

republ, committe, reform

java, sun, card

apple, intel, electron

cent, quarter, rose

hbo, ali, alan

bestsell, weekli, hardcov

albert, nomin, winner

miramax, chri, novel

cast, shoot, indie

showbiz, sound, band

concert, artist, miami

notabl, venic, classic

fashion, sold, bbc

funer, crash, royal

househ, sitcom, timeslot

king, japanes, movi

weekend, ticket, gross

top 3 discriminative terms

mckinnei, prostat, weizman

symptom, protein, vitamin

hingi, touchdown, rodman

icke, veto, teamster

nader, wireless, lucent

pentium, ibm, compaq

dow, ahmanson, greenspan

phillip, lange, wendi

hardcov, chicken, bestsell

forcibl, meredith, sportscast

cusack, cameron, man

juliett, showtim, cast

dialogu, prodigi, submiss

bing, calla, goethe

stamp, skelton, espn

poetri, versac, worn

spencer, funer, manslaught

timeslot, slot, household

denot, winfrei, atop

weekend, gross, mimic

	B	E	a	c	cu	f	i	m	mm	mu	o	p	r	s	t	v	H	P	S	T
7	106	1	-	4	2	-	30	6	-	4	2	1	-	-	5	2	-	2	-	11
9	-	-	-	3	30	17	-	-	1	1	2	2	1	-	1	1	-	-	-	-
8	-	-	1	7	-	22	2	-	-	3	1	5	8	1	5	2	-	-	1	-
11	-	-	-	1	-	40	1	-	-	-	-	1	2	-	1	13	-	-	-	-
12	-	-	-	2	-	45	-	-	-	-	-	2	1	2	4	2	-	-	-	-
19	-	1	-	3	1	45	1	-	-	8	-	15	2	-	25	14	-	-	1	-
20	-	1	1	-	-	41	-	-	-	4	-	-	-	5	6	1	-	-	-	-
14	-	2	8	-	4	2	-	-	-	26	1	12	-	2	1	-	-	1	-	-
16	-	1	4	1	9	9	2	2	1	11	-	11	-	-	6	-	-	1	-	-
6	8	-	-	-	-	-	1	-	3	-	34	-	-	-	-	1	-	-	-	12
10	-	-	-	3	1	4	-	-	2	2	1	21	2	-	20	2	-	-	-	-
15	-	-	2	1	5	13	-	-	-	4	2	29	-	-	2	-	-	-	-	-
17	-	1	-	2	6	5	1	6	-	12	1	65	3	-	12	4	-	-	-	-
13	-	-	1	1	9	22	6	1	3	33	9	58	139	7	2	3	-	-	-	-
18	-	-	1	2	-	1	-	-	-	-	-	2	-	-	48	4	-	-	-	-
2	2	-	-	2	1	1	1	-	1	1	3	5	-	-	6	-	483	5	17	-
1	3	2	2	1	-	4	-	1	-	4	-	10	-	-	5	-	11	12	2	-
4	14	-	4	7	5	2	15	5	-	6	3	6	-	1	12	2	-	93	1	-
3	1	-	-	1	1	5	10	-	3	5	-	3	-	-	23	3	-	-	119	-
5	8	-	-	3	-	-	-	-	-	1	6	-	-	-	3	-	-	-	-	37

Table 3.2(top) shows cluster evaluations, their descriptive and discriminative word stems. Each cluster ( $\mathcal{C}_{\ell}$ ) is evaluated using the dominant category ( $\mathcal{K}_{\hat{h}}$ ), purity ( $\phi^{(\mathrm{A})}$ ), and entropy ( $\phi^{(\mathrm{B})}$ ). Let $n_{\ell}^{(h)}$ denote the number of objects in cluster $\mathcal{C}_{\ell}$ that are classified to be in category

as given by the original Yahoo! categorization. Cluster $\mathcal{C}_{\ell}$ 's purity can be defined as

$\displaystyle \phi^{(\mathrm{A})} (\mathcal{C}_{\ell}) = \frac{1}{n_{\ell}} \max_h (n_{\ell}^{(h)}) .$

(3.9)

Purity can be interpreted as the classification rate under the assumption that all samples of a cluster are predicted to be members of the actual dominant class for that cluster. Alternatively, we also use [0,1]-normalized entropy, which is defined for a problem with

categories as

$\displaystyle \phi^{(\mathrm{B})} (\mathcal{C}_{\ell}) = - \sum_{h=1}^g \frac{... ...l}^{(h)}}{n_{\ell}} \log_g \left( \frac{n_{\ell}^{(h)}}{n_{\ell}} \right) .$

(3.10)

Entropy is a more comprehensive measure than purity since rather than just considering the number of objects `in' and `not in' the most frequent category, it considers the entire distribution. Table 3.2(bottom) gives the complete confusion matrix, which indicates how clusters and categories are related. Note that neither category nor prior distribution information is used during the unsupervised clustering process. In fact, the clustering is very good. It is much better than the original categorization in terms of edge cut and similarity lift, and it provides a much better grouping when only word frequencies are looked at. The evaluation metrics serve the purpose of validating our results capture relevant categorizations. However, their importance for our purpose is limited since we are solving a clustering problem and not a classification problem. The largest and best cluster is cluster $\mathcal{C}_2$ with 483 out of 528 documents being from the health cluster. Health related documents show a very distinct set of words and can, hence, be nicely separated. Small and not well distinguished categories have been put together with other documents (For example, the arts category has mostly been absorbed by the music category to form clusters 14 and 16.). This is inevitable since the 20 categories vary widely in size from 9 to 494 documents while the clusters OPOSSUM provides are much more balanced (from 58 to 528 documents per cluster).

Next: Web-log Session Clusters Up: Experiments Previous: Retail Market-basket Clusters Contents

Alexander Strehl 2002-05-03