Next: Results
Up: Experiments on Text Documents
Previous: Experiments on Text Documents
  Contents
We chose four text data-sets for comparison. In this subsection we
will briefly describe them:
The data-sets encompass a large variety of text styles. E.g., in
WEBKB documents vary significantly in length, some are in the wrong
category, some are dead links or have little content (e.g., are mostly
images). Also, the hub pages that Yahoo! refers to are usually
top-level branch pages. These tend to have more similar bag-of-words
content across different classes (e.g., contact information, search
windows, welcome messages) than news content oriented pages. In
contrast, the content of REUT are well written news agency
messages. However, they often belong to more than one category.
Words were stemmed using Porter's suffix stripping algorithm
[FBY92] in YAHOO and REUT. For all
data-sets, words occurring on average between 0.01 and 0.1 times
per document were counted to yield the term-document matrix. This
excludes stop words such as a, and very generic words such as new, as well as too rare words such as haruspex.
Next: Results
Up: Experiments on Text Documents
Previous: Experiments on Text Documents
  Contents
Alexander Strehl
2002-05-03