Algorithm-dependent fault tolerance for distributed computing

PDF Version Also Available for Download.

Description

Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to ... continued below

Physical Description

14 p.

Creation Information

Hough, P. D.; Goldsby, M. e. & Walsh, E. J. February 1, 2000.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Sponsor

Publisher

  • Sandia National Laboratories
    Publisher Info: Sandia National Labs., Albuquerque, NM, and Livermore, CA
    Place of Publication: Albuquerque, New Mexico

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

Physical Description

14 p.

Notes

OSTI as DE00754901

Medium: P; Size: 14 pages

Source

  • Other Information: PBD: 1 Feb 2000

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: SAND2000-8219
  • Grant Number: AC04-94AL85000
  • DOI: 10.2172/754901 | External Link
  • Office of Scientific & Technical Information Report Number: 754901
  • Archival Resource Key: ark:/67531/metadc703580

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • February 1, 2000

Added to The UNT Digital Library

  • Sept. 12, 2015, 6:31 a.m.

Description Last Updated

  • April 7, 2017, 2:04 p.m.

Usage Statistics

When was this report last used?

Yesterday: 0
Past 30 days: 1
Total Uses: 5

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Hough, P. D.; Goldsby, M. e. & Walsh, E. J. Algorithm-dependent fault tolerance for distributed computing, report, February 1, 2000; Albuquerque, New Mexico. (digital.library.unt.edu/ark:/67531/metadc703580/: accessed September 22, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.