Data-sets and Preprocessing

Next: Results Up: Experiments on Text Documents Previous: Experiments on Text Documents Contents

Data-sets and Preprocessing

We chose four text data-sets for comparison. In this subsection we will briefly describe them:

YAHOO. This data was parsed from Yahoo! news web-pages [BGG$^+$99]. The 20 original categories for the pages are:
- Business
- Entertainment
  - no sub-category
  - art
  - cable
  - culture
  - film
  - industry
  - media
  - multimedia
  - music
  - online
  - people
  - review
  - stage
  - television
  - variety
- Health
- Politics
- Sports
- Technology
The data can be downloaded from ftp://ftp.cs.umn.edu/dept/users/ /boley/ (K1 series) (see also appendix A.5).
N20. The data contains roughly 1000 postings each from the following 20 newsgroup topics [Lan95]:
- alt.atheism
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- misc.forsale
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.med
- sci.electronics
- sci.space
- soc.religion.christian
- talk.politics.guns
- talk.politics.mideast
- talk.politics.misc
- talk.religion.misc
The data can be found e.g., at http://www.at.mit.edu/jrennie/ /20Newsgroups/ (see also appendix A.6).
WEBKB. From the CMU Web KB Project [CDF$^+$98], web-pages from the following ten industry sectors according to Yahoo! were selected:
- airline
- computer hardware
- electronic instruments and controls
- forestry and wood products
- gold and silver
- mobile homes and rvs
- oil well services and equipment
- railroad
- software and programming
- trucking
Each industry contributes about 10% of the pages.
REUT. The Reuters-21578, Distribution 1.0 is available from Lewis at http://www.research.att.com/lewis/. We use the primary topic keyword as the category. There are 82 unique primary topics in the data. The categories are highly imbalanced.

The data-sets encompass a large variety of text styles. E.g., in WEBKB documents vary significantly in length, some are in the wrong category, some are dead links or have little content (e.g., are mostly images). Also, the hub pages that Yahoo! refers to are usually top-level branch pages. These tend to have more similar bag-of-words content across different classes (e.g., contact information, search windows, welcome messages) than news content oriented pages. In contrast, the content of REUT are well written news agency messages. However, they often belong to more than one category.

Words were stemmed using Porter's suffix stripping algorithm [FBY92] in YAHOO and REUT. For all data-sets, words occurring on average between 0.01 and 0.1 times per document were counted to yield the term-document matrix. This excludes stop words such as a, and very generic words such as new, as well as too rare words such as haruspex.

Next: Results Up: Experiments on Text Documents Previous: Experiments on Text Documents Contents

Alexander Strehl 2002-05-03