next up previous contents
Next: Results Up: Experiments on Text Documents Previous: Experiments on Text Documents   Contents

Data-sets and Preprocessing

We chose four text data-sets for comparison. In this subsection we will briefly describe them:

The data-sets encompass a large variety of text styles. E.g., in WEBKB documents vary significantly in length, some are in the wrong category, some are dead links or have little content (e.g., are mostly images). Also, the hub pages that Yahoo! refers to are usually top-level branch pages. These tend to have more similar bag-of-words content across different classes (e.g., contact information, search windows, welcome messages) than news content oriented pages. In contrast, the content of REUT are well written news agency messages. However, they often belong to more than one category.

Words were stemmed using Porter's suffix stripping algorithm [FBY92] in YAHOO and REUT. For all data-sets, words occurring on average between 0.01 and 0.1 times per document were counted to yield the term-document matrix. This excludes stop words such as a, and very generic words such as new, as well as too rare words such as haruspex.


next up previous contents
Next: Results Up: Experiments on Text Documents Previous: Experiments on Text Documents   Contents
Alexander Strehl 2002-05-03