Developing Criteria for Extracting Principal Components and Assessing Multiple Significance Tests in Knowledge Discovery Applications

Description:

With advances in computer technology, organizations are able to store large amounts of data in data warehouses. There are two fundamental issues researchers must address: the dimensionality of data and the interpretation of multiple statistical tests. The first issue addressed by this research is the determination of the number of components to retain in principal components analysis. This research establishes regression, asymptotic theory, and neural network approaches for estimating mean and 95th percentile eigenvalues for implementing Horn's parallel analysis procedure for retaining components. Certain methods perform better for specific combinations of sample size and numbers of variables. The adjusted normal order statistic estimator (ANOSE), an asymptotic procedure, performs the best overall. Future research is warranted on combining methods to increase accuracy. The second issue involves interpreting multiple statistical tests. This study uses simulation to show that Parker and Rothenberg's technique using a density function with a mixture of betas to model p-values is viable for p-values from central and non-central t distributions. The simulation study shows that final estimates obtained in the proposed mixture approach reliably estimate the true proportion of the distributions associated with the null and nonnull hypotheses. Modeling the density of p-values allows for better control of the true experimentwise error rate and is used to provide insight into grouping hypothesis tests for clustering purposes. Future research will expand the simulation to include p-values generated from additional distributions. The techniques presented are applied to data from Lake Texoma where the size of the database and the number of hypotheses of interest call for nontraditional data mining techniques. The issue is to determine if information technology can be used to monitor the chlorophyll levels in the lake as chloride is removed upstream. A relationship established between chlorophyll and the energy reflectance, which can be measured by satellites, enables more comprehensive and frequent monitoring. The results have both economic and political ramifications.

Creator(s): Keeling, Kellie Bliss
Creation Date: August 1999
Partner(s):
UNT Libraries
Collection(s):
UNT Theses and Dissertations
Usage:
Total Uses: 53
Past 30 days: 0
Yesterday: 0
Creator (Author):
Publisher Info:
Publisher Name: University of North Texas
Place of Publication: Denton, Texas
Date(s):
  • Creation: August 1999
  • Digitized: June 13, 2007
Description:

With advances in computer technology, organizations are able to store large amounts of data in data warehouses. There are two fundamental issues researchers must address: the dimensionality of data and the interpretation of multiple statistical tests. The first issue addressed by this research is the determination of the number of components to retain in principal components analysis. This research establishes regression, asymptotic theory, and neural network approaches for estimating mean and 95th percentile eigenvalues for implementing Horn's parallel analysis procedure for retaining components. Certain methods perform better for specific combinations of sample size and numbers of variables. The adjusted normal order statistic estimator (ANOSE), an asymptotic procedure, performs the best overall. Future research is warranted on combining methods to increase accuracy. The second issue involves interpreting multiple statistical tests. This study uses simulation to show that Parker and Rothenberg's technique using a density function with a mixture of betas to model p-values is viable for p-values from central and non-central t distributions. The simulation study shows that final estimates obtained in the proposed mixture approach reliably estimate the true proportion of the distributions associated with the null and nonnull hypotheses. Modeling the density of p-values allows for better control of the true experimentwise error rate and is used to provide insight into grouping hypothesis tests for clustering purposes. Future research will expand the simulation to include p-values generated from additional distributions. The techniques presented are applied to data from Lake Texoma where the size of the database and the number of hypotheses of interest call for nontraditional data mining techniques. The issue is to determine if information technology can be used to monitor the chlorophyll levels in the lake as chloride is removed upstream. A relationship established between chlorophyll and the energy reflectance, which can be measured by satellites, enables more comprehensive and frequent monitoring. The results have both economic and political ramifications.

Degree:
Level: Doctoral
Discipline: Management Science
Language(s):
Subject(s):
Keyword(s): information technology | data mining | Lake Texoma
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Theses and Dissertations
Identifier:
  • OCLC: 45056324 |
  • UNTCAT: b2228421 |
  • ARK: ark:/67531/metadc2231
Resource Type: Thesis or Dissertation
Format: Text
Rights:
Access: Public
License: Copyright
Holder: Keeling, Kellie Bliss
Statement: Copyright is held by the author, unless otherwise noted. All rights reserved.