13 Matching Results

Search Results

Advanced search parameters have been applied.

Supporting large-scale computational science

Description: Business needs have driven the development of commercial database systems since their inception. As a result, there has been a strong focus on supporting many users, minimizing the potential corruption or loss of data, and maximizing performance metrics like transactions per second, or TPC-C and TPC-D results. It turns out that these optimizations have little to do with the needs of the scientific community, and in particular have little impact on improving the management and use of large-scale high-dimensional data. At the same time, there is an unanswered need in the scientific community for many of the benefits offered by a robust DBMS. For example, tying an ad-hoc query language such as SQL together with a visualization toolkit would be a powerful enhancement to current capabilities. Unfortunately, there has been little emphasis or discussion in the VLDB community on this mismatch over the last decade. The goal of the paper is to identify the specific issues that need to be resolved before large-scale scientific applications can make use of DBMS products. This topic is addressed in the context of an evaluation of commercial DBMS technology applied to the exploration of data generated by the Department of Energy`s Accelerated Strategic Computing Initiative (ASCI). The paper describes the data being generated for ASCI as well as current capabilities for interacting with and exploring this data. The attraction of applying standard DBMS technology to this domain is discussed, as well as the technical and business issues that currently make this an infeasible solution.
Date: February 19, 1998
Creator: Musick, R.
Partner: UNT Libraries Government Documents Department

Rethinking the learning of belief network probabilities

Description: Belief networks are a powerful tool for knowledge discovery that provide concise, understandable probabilistic models of data. There are methods grounded in probability theory to incrementally update the relationships described by the belief network when new information is seen, to perform complex inferences over any set of variables in the data, to incorporate domain expertise and prior knowledge into the model, and to automatically learn the model from data. This paper concentrates on part of the belief network induction problem, that of learning the quantitative structure (the conditional probabilities), given the qualitative structure. In particular, the current practice of rote learning the probabilities in belief networks can be significantly improved upon. We advance the idea of applying any learning algorithm to the task of conditional probability learning in belief networks, discuss potential benefits, and show results of applying neural networks and other algorithms to a medium sized car insurance belief network. The results demonstrate from 10 to 100% improvements in model error rates over the current approaches.
Date: March 1, 1996
Creator: Musick, R.
Partner: UNT Libraries Government Documents Department

Scalable pattern recognition for large-scale scientific data mining

Description: Our ability to generate data far outstrips our ability to explore and understand it. The true value of this data lies not in its final size or complexity, but rather in our ability to exploit the data to achieve scientific goals. The data generated by programs such as ASCI have such a large scale that it is impractical to manually analyze, explore, and understand it. As a result, useful information is overlooked, and the potential benefits of increased computational and data gathering capabilities are only partially realized. The difficulties that will be faced by ASCI applications in the near future are foreshadowed by the challenges currently facing astrophysicists in making full use of the data they have collected over the years. For example, among other difficulties, astrophysicists have expressed concern that the sheer size of their data restricts them to looking at very small, narrow portions at any one time. This narrow focus has resulted in the loss of ``serendipitous`` discoveries which have been so vital to progress in the area in the past. To solve this problem, a new generation of computational tools and techniques is needed to help automate the exploration and management of large scientific data. This whitepaper proposes applying and extending ideas from the area of data mining, in particular pattern recognition, to improve the way in which scientists interact with large, multi-dimensional, time-varying data.
Date: March 23, 1998
Creator: Kamath, C. & Musick, R.
Partner: UNT Libraries Government Documents Department

U1h shaft project

Description: The U1h shaft project is a design/build subcontract to construct one 20 foot (ft) finished diameter shaft to a depth of 1,045 ft at the Nevada Test Site. Atkinson Construction was subcontracted by Bechtel Nevada to construct the U1h Shaft for the Department of Energy. The project consists of furnishing and installing the sinking plant, construction of the 1,045 ft of concrete lined shaft, development of a shaft station at a depth of 976 ft, and construction of a loading pocket at the station. The outfitting of the shaft and installation of a new hoist may be incorporated into the project at a later date. This paper should be of interest to those involved with the construction of relatively deep shafts and underground excavations.
Date: June 30, 2000
Creator: Briggs, Brian & Musick, R. G.
Partner: UNT Libraries Government Documents Department

Shaft Sinking at the Nevada Test Site, U1h Shaft Project

Description: The U1h Shaft Project is a design/build subcontract to construct one 6.1 meter (m) (20 feet (ft)) finished diameter shaft to a depth of 321.6 m (1,055 ft.) at the Nevada Test Site. Atkinson Construction was subcontracted by Bechtel Nevada to construct the U1h Shaft for the U.S. Department of Energy. The project consists of furnishing and installing the sinking plant, construction of the 321.6 m (1,055 ft.) of concrete lined shaft, development of a shaft station at a depth of 297.5 m (976 ft.), and construction of a loading pocket at the station. The outfitting of the shaft and installation of a new hoist may be incorporated into the project at a later date. This paper will describe the design phase, the excavation and lining operation, shaft station construction and the contractual challenges encountered on this project.
Date: March 1, 2001
Creator: Briggs, B. & Musick, R.
Partner: UNT Libraries Government Documents Department

Large-scale data mining pilot project in human genome

Description: This whitepaper briefly describes a new, aggressive effort in large- scale data Livermore National Labs. The implications of `large- scale` will be clarified Section. In the short term, this effort will focus on several @ssion-critical questions of Genome project. We will adapt current data mining techniques to the Genome domain, to quantify the accuracy of inference results, and lay the groundwork for a more extensive effort in large-scale data mining. A major aspect of the approach is that we will be fully-staffed data warehousing effort in the human Genome area. The long term goal is strong applications- oriented research program in large-@e data mining. The tools, skill set gained will be directly applicable to a wide spectrum of tasks involving a for large spatial and multidimensional data. This includes applications in ensuring non-proliferation, stockpile stewardship, enabling Global Ecology (Materials Database Industrial Ecology), advancing the Biosciences (Human Genome Project), and supporting data for others (Battlefield Management, Health Care).
Date: May 1, 1997
Creator: Musick, R.; Fidelis, R. & Slezak, T.
Partner: UNT Libraries Government Documents Department

DataFoundry: Warehousing techniques for dynamic environments

Description: Data warehouses and data marts have been successfully applied to a multitude of commercial business applications as tools for integrating and providing access to data located across an enterprise. Although the need for this capability is as vital in the scientific world as in the business domain, working warehouses in our community are scarce. A primary technical reason for this is that our understanding of the concepts being explored in an evolving scientific domain change constantly, leading to rapid changes in the data representation. When any database providing information to a warehouse changes its format, the warehouse must be updated to reflect these changes, or it will not function properly. The cost of maintaining a warehouse using traditional techniques in this environment is prohibitive. This paper describes ideas for dramatically reducing the amount of work that must be done to keep a warehouse up to date in a dynamic, scientific environment. The ideas are being applied in a prototype warehouse called DataFoundry. DataFoundry, currently in use by structural biologists at LLNL, will eventually support scientists at the Department of Energy`s Joint Genome Institute.
Date: January 29, 1998
Creator: Critchlow, T.; Fidelis, K.; Ganesh, M.; Musick, R. & Slezak, T., LLNL
Partner: UNT Libraries Government Documents Department

Detecting data and schema changes in scientific documents

Description: Data stored in a data warehouse must be kept consistent and up-to-date with the underlying information sources. By providing the capability to identify, categorize and detect changes in these sources, only the modified data needs to be transferred and entered into the warehouse. Another alternative, periodically reloading from scratch, is obviously inefficient. When the schema of an information source changes, all components that interact with, or make use of, data originating from that source must be updated to conform to the new schema. In this paper, the authors present an approach to detecting data and schema changes in scientific documents. Scientific data is of particular interest because it is normally stored as semi-structured documents, and it incurs frequent schema updates. They address the change detection problem by detecting data and schema changes between two versions of the same semi-structured document. This paper presents a graph representation of semi-structured documents and their schema before describing their approach to detecting changes while parsing the document. It also discusses how analysis of a collection of schema changes obtained from comparing several individual can be used to detect complex schema changes.
Date: June 8, 1999
Creator: Adiwijaya, I; Critchlow, T & Musick, R
Partner: UNT Libraries Government Documents Department

Supporting large-scale computational science

Description: A study has been carried out to determine the feasibility of using commercial database management systems (DBMSs) to support large-scale computational science. Conventional wisdom in the past has been that DBMSs are too slow for such data. Several events over the past few years have muddied the clarity of this mindset: 1. 2. 3. 4. Several commercial DBMS systems have demonstrated storage and ad-hoc quer access to Terabyte data sets. Several large-scale science teams, such as EOSDIS [NAS91], high energy physics [MM97] and human genome [Kin93] have adopted (or make frequent use of) commercial DBMS systems as the central part of their data management scheme. Several major DBMS vendors have introduced their first object-relational products (ORDBMSs), which have the potential to support large, array-oriented data. In some cases, performance is a moot issue. This is true in particular if the performance of legacy applications is not reduced while new, albeit slow, capabilities are added to the system. The basic assessment is still that DBMSs do not scale to large computational data. However, many of the reasons have changed, and there is an expiration date attached to that prognosis. This document expands on this conclusion, identifies the advantages and disadvantages of various commercial approaches, and describes the studies carried out in exploring this area. The document is meant to be brief, technical and informative, rather than a motivational pitch. The conclusions within are very likely to become outdated within the next 5-7 years, as market forces will have a significant impact on the state of the art in scientific data management over the next decade.
Date: October 1, 1998
Creator: Musick, R
Partner: UNT Libraries Government Documents Department

An LLNL perspective on ASCI data mining and pattern recognition requirements

Description: The working document has been put together by the members of the Sapphire project at LLNL. The goal of Sapphire is to apply and extend techniques from data mining and pattern recognition in order to detect automatically the areas of interest in very large data sets. The intent is to help scientists address the problem of data overload by providing them effective and efficient ways of exploring and analyzing massive data sets. One of the key areas where they expect this technology to be used is in the analysis of the output from ASCI simulations. It is expected that a simulation running on the 100 Tflop ASCI machine in the year 2004 will produce data at the rate of 12TB/hour. Given the difficulties they currently have in analyzing and visualizing a terabyte of data, it is imperative that they start planning now for ways that will make the analysis of petabyte data sets feasible. This document focuses on the relevance of data mining and pattern recognition to ASCI, discusses potential applications of these techniques in ASCI, and identifies research issues that arise as they apply the algorithms in these areas to massive data sets.
Date: January 1, 1999
Creator: Baldwin, C; Kamath, C & Musick, R
Partner: UNT Libraries Government Documents Department

Establishment of a facility for intrusive characterization of transuranic waste at the Nevada Test Site

Description: This paper describes design and construction, project management, and testing results associated with the Waste Examination Facility (WEF) recently constructed at the Nevada Test Site (NTS). The WEF and associated systems were designed, procured, and constructed on an extremely tight budget and within a fast track schedule. Part 1 of this paper focuses on design and construction activities, Part 2 discusses project management of WEF design and construction activities, and Part 3 describes the results of the transuranic (TRU) waste examination pilot project conducted at the WEF. In Part 1, the waste examination process is described within the context of Waste Isolation Pilot Plant (WIPP) characterization requirements. Design criteria are described from operational and radiological protection considerations. The WEF engineered systems are described. These systems include isolation barriers using a glove box and secondary containment structure, high efficiency particulate air (HEPA) filtration and ventilation systems, differential pressure monitoring systems, and fire protection systems. In Part 2, the project management techniques used for ensuring that stringent cost/schedule requirements were met are described. The critical attributes of these management systems are described with an emphasis on team work. In Part 3, the results of a pilot project directed at performing intrusive characterization (i.e., examination) of TRU waste at the WEF are described. Project activities included cold and hot operations. Cold operations included operator training, facility systems walk down, and operational procedures validation. Hot operations included working with plutonium contaminated TRU waste and consisted of waste container breaching, waste examination, waste segregation, data collection, and waste repackaging.
Date: January 1, 1998
Creator: Foster, B.D.; Musick, R.G.; Pedalino, J.P.; Cowley, J.L.; Karney, C.C. & Kremer, J.L.
Partner: UNT Libraries Government Documents Department

Metadata for balanced performance

Description: Data and information intensive industries require advanced data management capabilities incorporated with large capacity storage. Performance in the environment is, in part, a function of individual storage and data management system performance, but most importantly a function of the level of their integration. This paper focuses on integration, in particular on the issue of how to use shared metadata to facilitate high performance interfaces between Mass Storage Systems (MSS) and advanced data management clients. Current MSS interfaces are based on traditional file system interaction. Increasing functionality at the interface can enhance performance by permitting clients to influence data placement, generate accurate cost estimates of I/O, and describe impending I/O activity. Flexible mechanisms are needed for providing this functionality without compromising the generality of the interface; the authors are proposing active metadata sharing. They present an architecture that details how the shared metadata fits into the overall system architecture and control structure, along with a first cut at what the metadata model should look like.
Date: April 1996
Creator: Brown, P.; Troy, R.; Fisher, D.; Louis, S.; McGraw, J. R. & Musick, R.
Partner: UNT Libraries Government Documents Department

Data Foundry: Data Warehousing and Integration for Scientific Data Management

Description: Data warehousing is an approach for managing data from multiple sources by representing them with a single, coherent point of view. Commercial data warehousing products have been produced by companies such as RebBrick, IBM, Brio, Andyne, Ardent, NCR, Information Advantage, Informatica, and others. Other companies have chosen to develop their own in-house data warehousing solution using relational databases, such as those sold by Oracle, IBM, Informix and Sybase. The typical approaches include federated systems, and mediated data warehouses, each of which, to some extent, makes use of a series of source-specific wrapper and mediator layers to integrate the data into a consistent format which is then presented to users as a single virtual data store. These approaches are successful when applied to traditional business data because the data format used by the individual data sources tends to be rather static. Therefore, once a data source has been integrated into a data warehouse, there is relatively little work required to maintain that connection. However, that is not the case for all data sources. Data sources from scientific domains tend to regularly change their data model, format and interface. This is problematic because each change requires the warehouse administrator to update the wrapper, mediator, and warehouse interfaces to properly read, interpret, and represent the modified data source. Furthermore, the data that scientists require to carry out research is continuously changing as their understanding of a research question develops, or as their research objectives evolve. The difficulty and cost of these updates effectively limits the number of sources that can be integrated into a single data warehouse, or makes an approach based on warehousing too expensive to consider.
Date: February 29, 2000
Creator: Musick, R.; Critchlow, T.; Ganesh, M.; Fidelis, Z.; Zemla, A. & Slezak, T.
Partner: UNT Libraries Government Documents Department