An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software

PDF Version Also Available for Download.

Description

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation ... continued below

Physical Description

43 pages

Creation Information

DRAKE, RICHARD R. & SUMMERS, RANDALL M. May 1, 2003.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Sponsor

Publisher

  • Sandia National Laboratories
    Publisher Info: Sandia National Labs., Albuquerque, NM, and Livermore, CA (United States)
    Place of Publication: Albuquerque, New Mexico

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation code level. Errors and failures would be detected by underlying software layers, communicated to the application through a convenient interface, and then handled by the simulation code itself. Targeted faults included corrupt communication messages, processor node dropouts, and unacceptable slowdown of service from processing nodes. Recovery techniques such as re-sending communication messages and dynamic reallocation of failing processor nodes were considered. However, most fault tolerance mechanisms rely on underlying software layers which were discovered to be lacking to such a degree that mechanisms at the application level could not be implemented. This research effort has been postponed and shifted to these supporting layers.

Physical Description

43 pages

Source

  • Other Information: PBD: 1 May 2003

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: SAND2003-1651
  • Grant Number: AC04-94AL85000
  • DOI: 10.2172/811162 | External Link
  • Office of Scientific & Technical Information Report Number: 811162
  • Archival Resource Key: ark:/67531/metadc734070

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • May 1, 2003

Added to The UNT Digital Library

  • Oct. 18, 2015, 6:40 p.m.

Description Last Updated

  • April 12, 2016, 6:30 p.m.

Usage Statistics

When was this report last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 5

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

DRAKE, RICHARD R. & SUMMERS, RANDALL M. An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software, report, May 1, 2003; Albuquerque, New Mexico. (digital.library.unt.edu/ark:/67531/metadc734070/: accessed September 24, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.