Creating a criterion-based information agent through data mining for automated identification of scholarly research on the World Wide Web

Description:

This dissertation creates an information agent that correctly identifies Web pages containing scholarly research approximately 96% of the time. It does this by analyzing the Web page with a set of criteria, and then uses a classification tree to arrive at a decision.

The criteria were gathered from the literature on selecting print and electronic materials for academic libraries. A Delphi study was done with an international panel of librarians to expand and refine the criteria until a list of 41 operationalizable criteria was agreed upon. A Perl program was then designed to analyze a Web page and determine a numerical value for each criterion.

A large collection of Web pages was gathered comprising 5,000 pages that contain the full work of scholarly research and 5,000 random pages, representative of user searches, which do not contain scholarly research. Datasets were built by running the Perl program on these Web pages. The datasets were split into model building and testing sets.

Data mining was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. The models were created with the model datasets and then tested against the test dataset. Precision and recall were used to judge the effectiveness of each model. In addition, a set of pages that were difficult to classify because of their similarity to scholarly research was gathered and classified with the models.

The classification tree created the most effective classification model, with a precision ratio of 96% and a recall ratio of 95.6%. However, logistic regression created a model that was able to correctly classify more of the problematic pages.

This agent can be used to create a database of scholarly research published on the Web. In addition, the technique can be used to create a database of any type of structured electronic information.

Creator(s): Nicholson, Scott
Creation Date: May 2000
Partner(s):
UNT Libraries
Collection(s):
UNT Theses and Dissertations
Usage:
Total Uses: 34
Past 30 days: 2
Yesterday: 0
Creator (Author):
Publisher Info:
Publisher Name: University of North Texas
Place of Publication: Denton, Texas
Date(s):
  • Creation: May 2000
  • Digitized: June 28, 2007
Description:

This dissertation creates an information agent that correctly identifies Web pages containing scholarly research approximately 96% of the time. It does this by analyzing the Web page with a set of criteria, and then uses a classification tree to arrive at a decision.

The criteria were gathered from the literature on selecting print and electronic materials for academic libraries. A Delphi study was done with an international panel of librarians to expand and refine the criteria until a list of 41 operationalizable criteria was agreed upon. A Perl program was then designed to analyze a Web page and determine a numerical value for each criterion.

A large collection of Web pages was gathered comprising 5,000 pages that contain the full work of scholarly research and 5,000 random pages, representative of user searches, which do not contain scholarly research. Datasets were built by running the Perl program on these Web pages. The datasets were split into model building and testing sets.

Data mining was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. The models were created with the model datasets and then tested against the test dataset. Precision and recall were used to judge the effectiveness of each model. In addition, a set of pages that were difficult to classify because of their similarity to scholarly research was gathered and classified with the models.

The classification tree created the most effective classification model, with a precision ratio of 96% and a recall ratio of 95.6%. However, logistic regression created a model that was able to correctly classify more of the problematic pages.

This agent can be used to create a database of scholarly research published on the Web. In addition, the technique can be used to create a database of any type of structured electronic information.

Degree:
Level: Doctoral
Discipline: Information Science
Language(s):
Subject(s):
Keyword(s): Research methodology | World Wide Web
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Theses and Dissertations
Identifier:
  • OCLC: 47210576 |
  • UNTCAT: b2301130 |
  • ARK: ark:/67531/metadc2459
Resource Type: Thesis or Dissertation
Format: Text
Rights:
Access: Public
License: Copyright
Holder: Nicholson, Scott
Statement: Copyright is held by the author, unless otherwise noted. All rights reserved.