Search Results

Advanced search parameters have been applied.
open access

An extensible operating system design for large-scale parallel machines.

Description: Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and recon… more
Date: April 1, 2009
Creator: Riesen, Rolf E. & Ferreira, Kurt Brian
Partner: UNT Libraries Government Documents Department
open access

A simulation infrastructure for examining the performance of resilience strategies at scale.

Description: Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simu… more
Date: April 1, 2013
Creator: Ferreira, Kurt Brian; Levy, Scott N. & Bridges, Patrick G.
Partner: UNT Libraries Government Documents Department
open access

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Description: Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to d… more
Date: July 1, 2012
Creator: Fiala, David J; Mueller, Frank; Engelmann, Christian; Ferreira, Kurt Brian; Brightwell, Ron & Riesen, Rolf
Partner: UNT Libraries Government Documents Department
open access

An examination of content similarity within the memory of HPC applications.

Description: Memory content similarity has been e ectively exploited for more than a decade to reduce memory consumption. By consolidating duplicate and similar pages in the address space of an application, we can reduce the amount of memory it consumes without negatively a ecting the application's perception of the memory resources available to it. In addition to memory de-duplication, there may be many other ways that we can exploit memory content similarity to improve system characteristics. In this pape… more
Date: January 1, 2013
Creator: Levy, Scott N.; Bridges, Patrick G.; Ferreira, Kurt Brian; Thompson, Aidan Patrick & Trott, Christian Robert
Partner: UNT Libraries Government Documents Department
open access

Investigating an API for resilient exascale computing.

Description: Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This re… more
Date: May 1, 2013
Creator: Stearley, Jon R.; Tomkins, James; VanDyke, John P.; Ferreira, Kurt Brian; Laros, James H., & Bridges, Patrick
Partner: UNT Libraries Government Documents Department
open access

Cooperative application/OS DRAM fault recovery.

Description: Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer ap… more
Date: May 1, 2012
Creator: Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico, Albuquerque, NM); Heroux, Michael Allen; Hoemmen, Mark & Brightwell, Ronald Brian
Partner: UNT Libraries Government Documents Department
open access

Redundant computing for exascale systems.

Description: Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occ… more
Date: December 1, 2010
Creator: Stearley, Jon R.; Riesen, Rolf E.; Laros, James H., III; Ferreira, Kurt Brian; Pedretti, Kevin Thomas Tauke; Oldfield, Ron A. et al.
Partner: UNT Libraries Government Documents Department
open access

rMPI : increasing fault resiliency in a message-passing environment.

Description: As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we… more
Date: April 1, 2011
Creator: Stearley, Jon R.; Laros, James H., III; Ferreira, Kurt Brian; Pedretti, Kevin Thomas Tauke; Oldfield, Ron A.; Riesen, Rolf (IBM Research, Ireland) et al.
Partner: UNT Libraries Government Documents Department
open access

Increasing fault resiliency in a message-passing environment.

Description: Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An acti… more
Date: October 1, 2009
Creator: Stearley, Jon R.; Riesen, Rolf E.; Laros, James H., III; Ferreira, Kurt Brian; Pedretti, Kevin Thomas Tauke; Oldfield, Ron A. et al.
Partner: UNT Libraries Government Documents Department
open access

LDRD final report : a lightweight operating system for multi-core capability class supercomputers.

Description: The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, … more
Date: September 1, 2010
Creator: Kelly, Suzanne Marie; Hudson, Trammell B. (OS Research); Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico); Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J. et al.
Partner: UNT Libraries Government Documents Department
open access

Evaluating operating system vulnerability to memory errors.

Description: Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left r… more
Date: May 1, 2012
Creator: Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico); Pedretti, Kevin Thomas Tauke; Mueller, Frank (North Carolina State University); Fiala, David (North Carolina State University) & Brightwell, Ronald Brian
Partner: UNT Libraries Government Documents Department
open access

Keeping checkpoint/restart viable for exascale systems.

Description: Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current tec… more
Date: September 1, 2011
Creator: Riesen, Rolf E.; Bridges, Patrick G. (IBM Research, Ireland, Mulhuddart, Dublin); Stearley, Jon R.; Laros, James H., III; Oldfield, Ron A.; Arnold, Dorian (University of New Mexico, Albuquerque, NM) et al.
Partner: UNT Libraries Government Documents Department
open access

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Description: This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O)… more
Date: March 1, 2012
Creator: Curry, Matthew L.; Ferreira, Kurt Brian; Pedretti, Kevin Thomas Tauke; Leung, Vitus Joseph; Moreland, Kenneth D.; Lofstead, Gerald Fredrick, II et al.
Partner: UNT Libraries Government Documents Department
Back to Top of Screen