17 Matching Results

Search Results

Advanced search parameters have been applied.

Supporting large-scale computational science

Description: Business needs have driven the development of commercial database systems since their inception. As a result, there has been a strong focus on supporting many users, minimizing the potential corruption or loss of data, and maximizing performance metrics like transactions per second, or TPC-C and TPC-D results. It turns out that these optimizations have little to do with the needs of the scientific community, and in particular have little impact on improving the management and use of large-scale high-dimensional data. At the same time, there is an unanswered need in the scientific community for many of the benefits offered by a robust DBMS. For example, tying an ad-hoc query language such as SQL together with a visualization toolkit would be a powerful enhancement to current capabilities. Unfortunately, there has been little emphasis or discussion in the VLDB community on this mismatch over the last decade. The goal of the paper is to identify the specific issues that need to be resolved before large-scale scientific applications can make use of DBMS products. This topic is addressed in the context of an evaluation of commercial DBMS technology applied to the exploration of data generated by the Department of Energy`s Accelerated Strategic Computing Initiative (ASCI). The paper describes the data being generated for ASCI as well as current capabilities for interacting with and exploring this data. The attraction of applying standard DBMS technology to this domain is discussed, as well as the technical and business issues that currently make this an infeasible solution.
Date: February 19, 1998
Creator: Musick, R.
Partner: UNT Libraries Government Documents Department

Rethinking the learning of belief network probabilities

Description: Belief networks are a powerful tool for knowledge discovery that provide concise, understandable probabilistic models of data. There are methods grounded in probability theory to incrementally update the relationships described by the belief network when new information is seen, to perform complex inferences over any set of variables in the data, to incorporate domain expertise and prior knowledge into the model, and to automatically learn the model from data. This paper concentrates on part of the belief network induction problem, that of learning the quantitative structure (the conditional probabilities), given the qualitative structure. In particular, the current practice of rote learning the probabilities in belief networks can be significantly improved upon. We advance the idea of applying any learning algorithm to the task of conditional probability learning in belief networks, discuss potential benefits, and show results of applying neural networks and other algorithms to a medium sized car insurance belief network. The results demonstrate from 10 to 100% improvements in model error rates over the current approaches.
Date: March 1, 1996
Creator: Musick, R.
Partner: UNT Libraries Government Documents Department

Scalable pattern recognition for large-scale scientific data mining

Description: Our ability to generate data far outstrips our ability to explore and understand it. The true value of this data lies not in its final size or complexity, but rather in our ability to exploit the data to achieve scientific goals. The data generated by programs such as ASCI have such a large scale that it is impractical to manually analyze, explore, and understand it. As a result, useful information is overlooked, and the potential benefits of increased computational and data gathering capabilities are only partially realized. The difficulties that will be faced by ASCI applications in the near future are foreshadowed by the challenges currently facing astrophysicists in making full use of the data they have collected over the years. For example, among other difficulties, astrophysicists have expressed concern that the sheer size of their data restricts them to looking at very small, narrow portions at any one time. This narrow focus has resulted in the loss of ``serendipitous`` discoveries which have been so vital to progress in the area in the past. To solve this problem, a new generation of computational tools and techniques is needed to help automate the exploration and management of large scientific data. This whitepaper proposes applying and extending ideas from the area of data mining, in particular pattern recognition, to improve the way in which scientists interact with large, multi-dimensional, time-varying data.
Date: March 23, 1998
Creator: Kamath, C. & Musick, R.
Partner: UNT Libraries Government Documents Department

U1h shaft project

Description: The U1h shaft project is a design/build subcontract to construct one 20 foot (ft) finished diameter shaft to a depth of 1,045 ft at the Nevada Test Site. Atkinson Construction was subcontracted by Bechtel Nevada to construct the U1h Shaft for the Department of Energy. The project consists of furnishing and installing the sinking plant, construction of the 1,045 ft of concrete lined shaft, development of a shaft station at a depth of 976 ft, and construction of a loading pocket at the station. The outfitting of the shaft and installation of a new hoist may be incorporated into the project at a later date. This paper should be of interest to those involved with the construction of relatively deep shafts and underground excavations.
Date: June 30, 2000
Creator: Briggs, Brian & Musick, R. G.
Partner: UNT Libraries Government Documents Department

Shaft Sinking at the Nevada Test Site, U1h Shaft Project

Description: The U1h Shaft Project is a design/build subcontract to construct one 6.1 meter (m) (20 feet (ft)) finished diameter shaft to a depth of 321.6 m (1,055 ft.) at the Nevada Test Site. Atkinson Construction was subcontracted by Bechtel Nevada to construct the U1h Shaft for the U.S. Department of Energy. The project consists of furnishing and installing the sinking plant, construction of the 321.6 m (1,055 ft.) of concrete lined shaft, development of a shaft station at a depth of 297.5 m (976 ft.), and construction of a loading pocket at the station. The outfitting of the shaft and installation of a new hoist may be incorporated into the project at a later date. This paper will describe the design phase, the excavation and lining operation, shaft station construction and the contractual challenges encountered on this project.
Date: March 1, 2001
Creator: Briggs, B. & Musick, R.
Partner: UNT Libraries Government Documents Department

Large-scale data mining pilot project in human genome

Description: This whitepaper briefly describes a new, aggressive effort in large- scale data Livermore National Labs. The implications of `large- scale` will be clarified Section. In the short term, this effort will focus on several @ssion-critical questions of Genome project. We will adapt current data mining techniques to the Genome domain, to quantify the accuracy of inference results, and lay the groundwork for a more extensive effort in large-scale data mining. A major aspect of the approach is that we will be fully-staffed data warehousing effort in the human Genome area. The long term goal is strong applications- oriented research program in large-@e data mining. The tools, skill set gained will be directly applicable to a wide spectrum of tasks involving a for large spatial and multidimensional data. This includes applications in ensuring non-proliferation, stockpile stewardship, enabling Global Ecology (Materials Database Industrial Ecology), advancing the Biosciences (Human Genome Project), and supporting data for others (Battlefield Management, Health Care).
Date: May 1, 1997
Creator: Musick, R.; Fidelis, R. & Slezak, T.
Partner: UNT Libraries Government Documents Department

DataFoundry: Warehousing techniques for dynamic environments

Description: Data warehouses and data marts have been successfully applied to a multitude of commercial business applications as tools for integrating and providing access to data located across an enterprise. Although the need for this capability is as vital in the scientific world as in the business domain, working warehouses in our community are scarce. A primary technical reason for this is that our understanding of the concepts being explored in an evolving scientific domain change constantly, leading to rapid changes in the data representation. When any database providing information to a warehouse changes its format, the warehouse must be updated to reflect these changes, or it will not function properly. The cost of maintaining a warehouse using traditional techniques in this environment is prohibitive. This paper describes ideas for dramatically reducing the amount of work that must be done to keep a warehouse up to date in a dynamic, scientific environment. The ideas are being applied in a prototype warehouse called DataFoundry. DataFoundry, currently in use by structural biologists at LLNL, will eventually support scientists at the Department of Energy`s Joint Genome Institute.
Date: January 29, 1998
Creator: Critchlow, T.; Fidelis, K.; Ganesh, M.; Musick, R. & Slezak, T., LLNL
Partner: UNT Libraries Government Documents Department

Detecting data and schema changes in scientific documents

Description: Data stored in a data warehouse must be kept consistent and up-to-date with the underlying information sources. By providing the capability to identify, categorize and detect changes in these sources, only the modified data needs to be transferred and entered into the warehouse. Another alternative, periodically reloading from scratch, is obviously inefficient. When the schema of an information source changes, all components that interact with, or make use of, data originating from that source must be updated to conform to the new schema. In this paper, the authors present an approach to detecting data and schema changes in scientific documents. Scientific data is of particular interest because it is normally stored as semi-structured documents, and it incurs frequent schema updates. They address the change detection problem by detecting data and schema changes between two versions of the same semi-structured document. This paper presents a graph representation of semi-structured documents and their schema before describing their approach to detecting changes while parsing the document. It also discusses how analysis of a collection of schema changes obtained from comparing several individual can be used to detect complex schema changes.
Date: June 8, 1999
Creator: Adiwijaya, I; Critchlow, T & Musick, R
Partner: UNT Libraries Government Documents Department

Supporting large-scale computational science

Description: A study has been carried out to determine the feasibility of using commercial database management systems (DBMSs) to support large-scale computational science. Conventional wisdom in the past has been that DBMSs are too slow for such data. Several events over the past few years have muddied the clarity of this mindset: 1. 2. 3. 4. Several commercial DBMS systems have demonstrated storage and ad-hoc quer access to Terabyte data sets. Several large-scale science teams, such as EOSDIS [NAS91], high energy physics [MM97] and human genome [Kin93] have adopted (or make frequent use of) commercial DBMS systems as the central part of their data management scheme. Several major DBMS vendors have introduced their first object-relational products (ORDBMSs), which have the potential to support large, array-oriented data. In some cases, performance is a moot issue. This is true in particular if the performance of legacy applications is not reduced while new, albeit slow, capabilities are added to the system. The basic assessment is still that DBMSs do not scale to large computational data. However, many of the reasons have changed, and there is an expiration date attached to that prognosis. This document expands on this conclusion, identifies the advantages and disadvantages of various commercial approaches, and describes the studies carried out in exploring this area. The document is meant to be brief, technical and informative, rather than a motivational pitch. The conclusions within are very likely to become outdated within the next 5-7 years, as market forces will have a significant impact on the state of the art in scientific data management over the next decade.
Date: October 1, 1998
Creator: Musick, R
Partner: UNT Libraries Government Documents Department

An LLNL perspective on ASCI data mining and pattern recognition requirements

Description: The working document has been put together by the members of the Sapphire project at LLNL. The goal of Sapphire is to apply and extend techniques from data mining and pattern recognition in order to detect automatically the areas of interest in very large data sets. The intent is to help scientists address the problem of data overload by providing them effective and efficient ways of exploring and analyzing massive data sets. One of the key areas where they expect this technology to be used is in the analysis of the output from ASCI simulations. It is expected that a simulation running on the 100 Tflop ASCI machine in the year 2004 will produce data at the rate of 12TB/hour. Given the difficulties they currently have in analyzing and visualizing a terabyte of data, it is imperative that they start planning now for ways that will make the analysis of petabyte data sets feasible. This document focuses on the relevance of data mining and pattern recognition to ASCI, discusses potential applications of these techniques in ASCI, and identifies research issues that arise as they apply the algorithms in these areas to massive data sets.
Date: January 1, 1999
Creator: Baldwin, C; Kamath, C & Musick, R
Partner: UNT Libraries Government Documents Department

Ad Hoc Query Support For Very Large Simulation Mesh Data: The Metadata Approach

Description: We present our approach to enabling approximate ad hoc queries on terabyte-scale mesh data generated from large scientific simulations through the extension and integration of database, statistical, and data mining techniques. There are several significant barriers to overcome in achieving this objective. First, large-scale simulation data is already at the multi-terabyte scale and growing quickly, thus rendering traditional forms of interactive data exploration and query processing untenable. Second, a priori knowledge of user queries is not available, making it impossible to tune special-purpose solutions. Third, the data has spatial and temporal aspects, as well as arbitrarily high dimensionality, which exacerbates the task of finding compact, accurate, and easy-to-compute data models. Our approach is to preprocess the mesh data to generate highly compressed, lossy models that are used in lieu of the original data to answer users' queries. This approach leads to interesting challenges. The model (equivalently, the content-oriented metadata) being generated must be smaller than the original data by at least an order of magnitude. Second, the metadata representation must contain enough information to support a broad class of queries. Finally, the accuracy and speed of the queries must be within the tolerances required by users. In this paper we give an overview of ongoing development efforts with an emphasis on extracting metadata and using it in query processing.
Date: December 17, 2001
Creator: Lee, B; Snapp, R; Musick, R & Critchlow, T
Partner: UNT Libraries Government Documents Department

Use of Numerical Models as Data Proxies for Approximate Ad-Hoc Query Processing

Description: As datasets grow beyond the gigabyte scale, there is an increasing demand to develop techniques for dealing/interacting with them. To this end, the DataFoundry team at the Lawrence Livermore National Laboratory has developed a software prototype called Approximate Adhoc Query Engine for Simulation Data (AQSim). The goal of AQSim is to provide a framework that allows scientists to interactively perform adhoc queries over terabyte scale datasets using numerical models as proxies for the original data. The advantages of this system are several. The first is that by storing only the model parameters, each dataset occupies a smaller footprint compared to the original, increasing the shelf-life of such datasets before they are sent to archival storage. Second, the models are geared towards approximate querying as they are built at different resolutions, allowing the user to make the tradeoff between model accuracy and query response time. This allows the user greater opportunities for exploratory data analysis. Lastly, several different models are allowed, each focusing on a different characteristic of the data thereby enhancing the interpretability of the data compared to the original. The focus of this paper is on the modeling aspects of the AQSim framework.
Date: May 19, 2003
Creator: Kamimura, R; Abdulla, G; Baldwin, C; Critchlow, T; Lee, B; Lozares, I et al.
Partner: UNT Libraries Government Documents Department

The Framework for Approximate Queries on Simulation Data

Description: AQSim is a system intended to enable scientists to query and analyze a large volume of scientific simulation data. The system uses the state of the art in approximate query processing techniques to build a novel framework for progressive data analysis. These techniques are used to define a multi-resolution index, where each node contains multiple models of the data. The benefits of these models are two-fold: (1) they are compact representations, reconstructing only the information relevant to the analysis, and (2) the variety of models capture different aspects of the data which may be of interest to the user but are not readily apparent in their raw form. To be able to deal with the data interactively, AQSim allows the scientist to make an informed tradeoff between query response accuracy and time. In this paper, we present the framework of AQSim with a focus on its architectural design. We also show the results from an initial proof-of-concept prototype developed at LLNL. The presented framework is generic enough to handle more than just simulation data.
Date: September 27, 2001
Creator: Abdulla, G; Baldwin, C; Critchlow, T; Kamimura, R; Lee, B; Musick, R et al.
Partner: UNT Libraries Government Documents Department

Establishment of a facility for intrusive characterization of transuranic waste at the Nevada Test Site

Description: This paper describes design and construction, project management, and testing results associated with the Waste Examination Facility (WEF) recently constructed at the Nevada Test Site (NTS). The WEF and associated systems were designed, procured, and constructed on an extremely tight budget and within a fast track schedule. Part 1 of this paper focuses on design and construction activities, Part 2 discusses project management of WEF design and construction activities, and Part 3 describes the results of the transuranic (TRU) waste examination pilot project conducted at the WEF. In Part 1, the waste examination process is described within the context of Waste Isolation Pilot Plant (WIPP) characterization requirements. Design criteria are described from operational and radiological protection considerations. The WEF engineered systems are described. These systems include isolation barriers using a glove box and secondary containment structure, high efficiency particulate air (HEPA) filtration and ventilation systems, differential pressure monitoring systems, and fire protection systems. In Part 2, the project management techniques used for ensuring that stringent cost/schedule requirements were met are described. The critical attributes of these management systems are described with an emphasis on team work. In Part 3, the results of a pilot project directed at performing intrusive characterization (i.e., examination) of TRU waste at the WEF are described. Project activities included cold and hot operations. Cold operations included operator training, facility systems walk down, and operational procedures validation. Hot operations included working with plutonium contaminated TRU waste and consisted of waste container breaching, waste examination, waste segregation, data collection, and waste repackaging.
Date: January 1, 1998
Creator: Foster, B.D.; Musick, R.G.; Pedalino, J.P.; Cowley, J.L.; Karney, C.C. & Kremer, J.L.
Partner: UNT Libraries Government Documents Department

Metadata for balanced performance

Description: Data and information intensive industries require advanced data management capabilities incorporated with large capacity storage. Performance in the environment is, in part, a function of individual storage and data management system performance, but most importantly a function of the level of their integration. This paper focuses on integration, in particular on the issue of how to use shared metadata to facilitate high performance interfaces between Mass Storage Systems (MSS) and advanced data management clients. Current MSS interfaces are based on traditional file system interaction. Increasing functionality at the interface can enhance performance by permitting clients to influence data placement, generate accurate cost estimates of I/O, and describe impending I/O activity. Flexible mechanisms are needed for providing this functionality without compromising the generality of the interface; the authors are proposing active metadata sharing. They present an architecture that details how the shared metadata fits into the overall system architecture and control structure, along with a first cut at what the metadata model should look like.
Date: April 1996
Creator: Brown, P.; Troy, R.; Fisher, D.; Louis, S.; McGraw, J. R. & Musick, R.
Partner: UNT Libraries Government Documents Department

Data Foundry: Data Warehousing and Integration for Scientific Data Management

Description: Data warehousing is an approach for managing data from multiple sources by representing them with a single, coherent point of view. Commercial data warehousing products have been produced by companies such as RebBrick, IBM, Brio, Andyne, Ardent, NCR, Information Advantage, Informatica, and others. Other companies have chosen to develop their own in-house data warehousing solution using relational databases, such as those sold by Oracle, IBM, Informix and Sybase. The typical approaches include federated systems, and mediated data warehouses, each of which, to some extent, makes use of a series of source-specific wrapper and mediator layers to integrate the data into a consistent format which is then presented to users as a single virtual data store. These approaches are successful when applied to traditional business data because the data format used by the individual data sources tends to be rather static. Therefore, once a data source has been integrated into a data warehouse, there is relatively little work required to maintain that connection. However, that is not the case for all data sources. Data sources from scientific domains tend to regularly change their data model, format and interface. This is problematic because each change requires the warehouse administrator to update the wrapper, mediator, and warehouse interfaces to properly read, interpret, and represent the modified data source. Furthermore, the data that scientists require to carry out research is continuously changing as their understanding of a research question develops, or as their research objectives evolve. The difficulty and cost of these updates effectively limits the number of sources that can be integrated into a single data warehouse, or makes an approach based on warehousing too expensive to consider.
Date: February 29, 2000
Creator: Musick, R.; Critchlow, T.; Ganesh, M.; Fidelis, Z.; Zemla, A. & Slezak, T.
Partner: UNT Libraries Government Documents Department

Approximate ad-hoc query engine for simulation data

Description: In this paper, we describe AQSim, an ongoing effort to design and implement a system to manage terabytes of scientific simulation data. The goal of this project is to reduce data storage requirements and access times while permitting ad-hoc queries using statistical and mathematical models of the data. In order to facilitate data exchange between models based on different representations, we are evaluating using the ASCI common data model which is comprised of several layers of increasing semantic complexity. To support queries over the spatial-temporal mesh structured data we are in the process of defining and implementing a grammar for MeshSQL.
Date: February 1, 2001
Creator: Abdulla, G.; Baldwin, C.; Critchlow, T.; Kamimura, R.; Lozares, I.; Musick, R. et al.
Partner: UNT Libraries Government Documents Department