The Cluster Hypothesis: A visual/statistical analysis

Access: Use of this item is restricted to the UNT Community
Description:

By allowing judgments based on a small number of exemplar documents to be applied to a larger number of unexamined documents, clustered presentation of search results represents an intuitively attractive possibility for reducing the cognitive resource demands on human users of information retrieval systems. However, clustered presentation of search results is sensible only to the extent that naturally occurring similarity relationships among documents correspond to topically coherent clusters. The Cluster Hypothesis posits just such a systematic relationship between document similarity and topical relevance. To date, experimental validation of the Cluster Hypothesis has proved problematic, with collection-specific results both supporting and failing to support this fundamental theoretical postulate.

The present study consists of two computational information visualization experiments, representing a two-tiered test of the Cluster Hypothesis under adverse conditions. Both experiments rely on multidimensionally scaled representations of interdocument similarity matrices. Experiment 1 is a term-reduction condition, in which descriptive titles are extracted from Associated Press news stories drawn from the TREC information retrieval test collection. The clustering behavior of these titles is compared to the behavior of the corresponding full text via statistical analysis of the visual characteristics of a two-dimensional similarity map. Experiment 2 is a dimensionality reduction condition, in which inter-item similarity coefficients for full text documents are scaled into a single dimension and then rendered as a two-dimensional visualization; the clustering behavior of relevant documents within these unidimensionally scaled representations is examined via visual and statistical methods.

Taken as a whole, results of both experiments lend strong though not unqualified support to the Cluster Hypothesis. In Experiment 1, semantically meaningful 6.6-word document surrogates systematically conform to the predictions of the Cluster Hypothesis. In Experiment 2, the majority of the unidimensionally scaled datasets exhibit a marked nonuniformity of distribution of relevant documents, further supporting the Cluster Hypothesis.

Results of the two experiments are profoundly question-specific. Post hoc analyses suggest that it may be possible to predict the success of clustered searching based on the lexical characteristics of users' natural-language expression of their information need.

Creator(s): Sullivan, Terry
Creation Date: May 2000
Partner(s):
UNT Libraries
Collection(s):
UNT Theses and Dissertations
Usage:
Total Uses: 136
Past 30 days: 3
Yesterday: 0
Creator (Author):
Publisher Info:
Publisher Name: University of North Texas
Place of Publication: Denton, Texas
Date(s):
  • Creation: May 2000
  • Digitized: June 28, 2007
Description:

By allowing judgments based on a small number of exemplar documents to be applied to a larger number of unexamined documents, clustered presentation of search results represents an intuitively attractive possibility for reducing the cognitive resource demands on human users of information retrieval systems. However, clustered presentation of search results is sensible only to the extent that naturally occurring similarity relationships among documents correspond to topically coherent clusters. The Cluster Hypothesis posits just such a systematic relationship between document similarity and topical relevance. To date, experimental validation of the Cluster Hypothesis has proved problematic, with collection-specific results both supporting and failing to support this fundamental theoretical postulate.

The present study consists of two computational information visualization experiments, representing a two-tiered test of the Cluster Hypothesis under adverse conditions. Both experiments rely on multidimensionally scaled representations of interdocument similarity matrices. Experiment 1 is a term-reduction condition, in which descriptive titles are extracted from Associated Press news stories drawn from the TREC information retrieval test collection. The clustering behavior of these titles is compared to the behavior of the corresponding full text via statistical analysis of the visual characteristics of a two-dimensional similarity map. Experiment 2 is a dimensionality reduction condition, in which inter-item similarity coefficients for full text documents are scaled into a single dimension and then rendered as a two-dimensional visualization; the clustering behavior of relevant documents within these unidimensionally scaled representations is examined via visual and statistical methods.

Taken as a whole, results of both experiments lend strong though not unqualified support to the Cluster Hypothesis. In Experiment 1, semantically meaningful 6.6-word document surrogates systematically conform to the predictions of the Cluster Hypothesis. In Experiment 2, the majority of the unidimensionally scaled datasets exhibit a marked nonuniformity of distribution of relevant documents, further supporting the Cluster Hypothesis.

Results of the two experiments are profoundly question-specific. Post hoc analyses suggest that it may be possible to predict the success of clustered searching based on the lexical characteristics of users' natural-language expression of their information need.

Degree:
Level: Doctoral
Discipline: Information Science
Language(s):
Subject(s):
Keyword(s): Document search | Document collection | Information theory
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Theses and Dissertations
Identifier:
  • OCLC: 47233220 |
  • UNTCAT: b2302287 |
  • ARK: ark:/67531/metadc2444
Resource Type: Thesis or Dissertation
Format: Text
Rights:
Access: Use restricted to UNT Community
License: Copyright
Holder: Sullivan, Terry
Statement: Copyright is held by the author, unless otherwise noted. All rights reserved.