Peer-to-peer architectures for exascale computing : LDRD final report.

PDF Version Also Available for Download.

Description

The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous ... continued below

Physical Description

44 p.

Creation Information

Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.; Armstrong, Robert C. & Rudish, Donald W. September 1, 2010.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2P architectures give us a starting point for crafting applications and system software for exascale. In the context of the Internet, P2P applications (e.g., file sharing, botnets) have already solved this problem for 10{sup 6}-10{sup 7} nodes. Usually based on a fractal distributed hash table structure, these systems have proven robust in practice to constant and unpredictable outages, failures, and even subversion. For example, a recent estimate of botnet turnover (i.e., the number of machines leaving and joining) is about 11% per week. Nonetheless, P2P networks remain effective despite these failures: The Conficker botnet has grown to {approx} 5 x 10{sup 6} peers. Unlike today's system software and applications, those for next-generation exascale machines cannot assume a static structure and, to be scalable over millions of nodes, must be decentralized. P2P architectures achieve both, and provide a promising model for 'fault-oblivious computing'. This project aimed to study the dynamics of P2P networks in the context of a design for exascale systems and applications. Having no single point of failure, the most successful P2P architectures are adaptive and self-organizing. While there has been some previous work applying P2P to message passing, little attention has been previously paid to the tightly coupled exascale domain. Typically, the per-node footprint of P2P systems is small, making them ideal for HPC use. The implementation on each peer node cooperates en masse to 'heal' disruptions rather than relying on a controlling 'master' node. Understanding this cooperative behavior from a complex systems viewpoint is essential to predicting useful environments for the inextricably unreliable exascale platforms of the future. We sought to obtain theoretical insight into the stability and large-scale behavior of candidate architectures, and to work toward leveraging Sandia's Emulytics platform to test promising candidates in a realistic (ultimately {ge} 10{sup 7} nodes) setting. Our primary example applications are drawn from linear algebra: a Jacobi relaxation solver for the heat equation, and the closely related technique of value iteration in optimization. We aimed to apply P2P concepts in designing implementations capable of surviving an unreliable machine of 10{sup 6} nodes.

Physical Description

44 p.

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: SAND2010-6392
  • Grant Number: AC04-94AL85000
  • DOI: 10.2172/991537 | External Link
  • Office of Scientific & Technical Information Report Number: 991537
  • Archival Resource Key: ark:/67531/metadc1014674

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

Reports, articles and other documents harvested from the Office of Scientific and Technical Information.

Office of Scientific and Technical Information (OSTI) is the Department of Energy (DOE) office that collects, preserves, and disseminates DOE-sponsored research and development (R&D) results that are the outcomes of R&D projects or other funded activities at DOE labs and facilities nationwide and grantees at universities and other institutions.

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • September 1, 2010

Added to The UNT Digital Library

  • Oct. 14, 2017, 8:36 a.m.

Description Last Updated

  • Oct. 25, 2017, 1:57 p.m.

Usage Statistics

When was this report last used?

Congratulations! It looks like you are the first person to view this item online.

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

International Image Interoperability Framework

IIF Logo

We support the IIIF Presentation API

Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.; Armstrong, Robert C. & Rudish, Donald W. Peer-to-peer architectures for exascale computing : LDRD final report., report, September 1, 2010; United States. (digital.library.unt.edu/ark:/67531/metadc1014674/: accessed July 18, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.