68 Matching Results

Search Results

Advanced search parameters have been applied.

Petascale Computing Enabling Technologies Project Final Report

Description: The Petascale Computing Enabling Technologies (PCET) project addressed challenges arising from current trends in computer architecture that will lead to large-scale systems with many more nodes, each of which uses multicore chips. These factors will soon lead to systems that have over one million processors. Also, the use of multicore chips will lead to less memory and less memory bandwidth per core. We need fundamentally new algorithmic approaches to cope with these memory constraints and the huge number of processors. Further, correct, efficient code development is difficult even with the number of processors in current systems; more processors will only make it harder. The goal of PCET was to overcome these challenges by developing the computer science and mathematical underpinnings needed to realize the full potential of our future large-scale systems. Our research results will significantly increase the scientific output obtained from LLNL large-scale computing resources by improving application scientist productivity and system utilization. Our successes include scalable mathematical algorithms that adapt to these emerging architecture trends and through code correctness and performance methodologies that automate critical aspects of application development as well as the foundations for application-level fault tolerance techniques. PCET's scope encompassed several research thrusts in computer science and mathematics: code correctness and performance methodologies, scalable mathematics algorithms appropriate for multicore systems, and application-level fault tolerance techniques. Due to funding limitations, we focused primarily on the first three thrusts although our work also lays the foundation for the needed advances in fault tolerance. In the area of scalable mathematics algorithms, our preliminary work established that OpenMP performance of the AMG linear solver benchmark and important individual kernels on Atlas did not match the predictions of our simple initial model. Our investigations demonstrated that a poor default memory allocation mechanism degraded performance. We developed a prototype NUMA library to ...
Date: February 14, 2010
Creator: de Supinski, B R
Partner: UNT Libraries Government Documents Department

Formal Specification of the OpenMP Memory Model

Description: OpenMP [1] is an important API for shared memory programming, combining shared memory's potential for performance with a simple programming interface. Unfortunately, OpenMP lacks a critical tool for demonstrating whether programs are correct: a formal memory model. Instead, the current official definition of the OpenMP memory model (the OpenMP 2.5 specification [1]) is in terms of informal prose. As a result, it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency model that precisely describes how reads and writes on different threads interact. This paper focuses on the formal verification of OpenMP programs through a proposed formal memory model that is derived from the existing prose model [1]. Our formalization provides a two-step process to verify whether an observed OpenMP execution is conformant. In addition to this formalization, our contributions include a discussion of ambiguities in the current prose-based memory model description. Although our formal model may not capture the current informal memory model perfectly, in part due to these ambiguities, our model reflects our understanding of the informal model's intent. We conclude with several examples that may indicate areas of the OpenMP memory model that need further refinement however it is specified. Our goal is to motivate the OpenMP community to adopt those refinements eventually, ideally through a formal model, in later OpenMP specifications.
Date: May 17, 2006
Creator: Bronevetsky, G & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Formal Specification of the OpenMP Memory Model

Description: OpenMP [2] is an important API for shared memory programming, combining shared memory's potential for performance with a simple programming interface. Unfortunately, OpenMP lacks a critical tool for demonstrating whether programs are correct: a formal memory model. Instead, the current official definition of the OpenMP memory model (the OpenMP 2.5 specification [2]) is in terms of informal prose. As a result, it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency model that precisely describes how reads and writes on different threads interact. We expand on our previous work that focused on the formal verification of OpenMP programs through a formal memory model [?]. As in that work, our formalization, which is derived from the existing prose model [2], provides a two-step process to verify whether an observed OpenMP execution is conformant. This paper extends the model to cover the entire specification. In addition to this formalization, our contributions include a discussion of ambiguities in the current prose-based memory model description. Although our formal model may not capture the current informal memory model perfectly, in part due to these ambiguities, our model reflects our understanding of the informal model's intent. We conclude with several examples that may indicate areas of the OpenMP memory model that need further refinement, however it is specified. Our goal is to motivate the OpenMP community to adopt those refinements eventually, ideally through a formal model, in later OpenMP specifications.
Date: December 19, 2006
Creator: Bronevetsky, G & de Supinski, B
Partner: UNT Libraries Government Documents Department

Practical Differential Profiling

Description: Comparing performance profiles from two runs is an essential performance analysis step that users routinely perform. In this work we present eGprof, a tool that facilitates these comparisons through differential profiling inside gprof. We chose this approach, rather than designing a new tool, since gprof is one of the few performance analysis tools accepted and used by a large community of users. eGprof allows users to 'subtract' two performance profiles directly. It also includes callgraph visualization to highlight the differences in graphical form. Along with the design of this tool, we present several case studies that show how eGprof can be used to find and to study the differences of two application executions quickly and hence can aid the user in this most common step in performance analysis. We do this without requiring major changes on the side of the user, the most important factor in guaranteeing the adoption of our tool by code teams.
Date: February 4, 2007
Creator: Schulz, M & De Supinski, B R
Partner: UNT Libraries Government Documents Department

Soft Error Vulnerability of Iterative Linear Algebra Methods

Description: Devices become increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft errors primarily caused problems for space and high-atmospheric computing applications. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming significant even at terrestrial altitudes. The soft error vulnerability of iterative linear algebra methods, which many scientific applications use, is a critical aspect of the overall application vulnerability. These methods are often considered invulnerable to many soft errors because they converge from an imprecise solution to a precise one. However, we show that iterative methods can be vulnerable to soft errors, with a high rate of silent data corruptions. We quantify this vulnerability, with algorithms generating up to 8.5% erroneous results when subjected to a single bit-flip. Further, we show that detecting soft errors in an iterative method depends on its detailed convergence properties and requires more complex mechanisms than simply checking the residual. Finally, we explore inexpensive techniques to tolerate soft errors in these methods.
Date: December 15, 2007
Creator: Bronevetsky, G & de Supinski, B
Partner: UNT Libraries Government Documents Department

Soft Error Vulnerability of Iterative Linear Algebra Methods

Description: Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.
Date: January 19, 2008
Creator: Bronevetsky, G & de Supinski, B
Partner: UNT Libraries Government Documents Department

Exploiting Data Similarity to Reduce Memory Footprints

Description: Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more efficiently - preferably transparently - could increase effective DRAM capacity and thus the benefit of multicore nodes for HPC systems. MPI application processes often exhibit significant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them difficult to manage statically. Our novel memory allocation library, SBLLmalloc, automatically identifies identical memory blocks and merges them into a single copy. SBLLmalloc does not require application or OS changes since we implement it as a user-level library. Overall, we demonstrate that SBLLmalloc reduces the memory footprint of a range of MPI applications by 32.03% on average and up to 60.87%. Further, SBLLmalloc supports problem sizes for IRS over 21.36% larger than using standard memory management techniques, thus significantly increasing effective system size. Similarly, SBLLmalloc requires 43.75% fewer nodes than standard memory management techniques to solve an AMG problem.
Date: January 28, 2011
Creator: Biswas, S; de Supinski, B R; Schulz, M; Franklin, D; Sherwood, T & Chong, F T
Partner: UNT Libraries Government Documents Department

Lightweight and Statistical Techniques for Petascale Debugging: Correctness on Petascale Systems (CoPS) Preliminry Report

Description: Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in a wide range of scientific disciplines. These large systems create unprecedented application development challenges. Scalable correctness tools are critical to shorten the time-to-solution on these systems. Currently, many DOE application developers use primitive manual debugging based on printf or traditional debuggers such as TotalView or DDT. This paradigm breaks down beyond a few thousand cores, yet bugs often arise above that scale. Programmers must reproduce problems in smaller runs to analyze them with traditional tools, or else perform repeated runs at scale using only primitive techniques. Even when traditional tools run at scale, the approach wastes substantial effort and computation cycles. Continued scientific progress demands new paradigms for debugging large-scale applications. The Correctness on Petascale Systems (CoPS) project is developing a revolutionary debugging scheme that will reduce the debugging problem to a scale that human developers can comprehend. The scheme can provide precise diagnoses of the root causes of failure, including suggestions of the location and the type of errors down to the level of code regions or even a single execution point. Our fundamentally new strategy combines and expands three relatively new complementary debugging approaches. The Stack Trace Analysis Tool (STAT), a 2011 R&D 100 Award Winner, identifies behavior equivalence classes in MPI jobs and highlights behavior when elements of the class demonstrate divergent behavior, often the first indicator of an error. The Cooperative Bug Isolation (CBI) project has developed statistical techniques for isolating programming errors in widely deployed code that we will adapt to large-scale parallel applications. Finally, we are developing a new approach to parallelizing expensive correctness analyses, such as analysis of memory usage in the Memgrind tool. In the first two years of the project, we have successfully extended STAT to determine ...
Date: September 13, 2011
Creator: de Supinski, B R; Miller, B P & Liblit, B
Partner: UNT Libraries Government Documents Department

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Description: Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15x. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.
Date: March 20, 2012
Creator: Mohror, K.; Moody, A. & de Supinski, B. R.
Partner: UNT Libraries Government Documents Department

ScalaTrace: Tracing, Analysis and Modeling of HPC Codes at Scale

Description: Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code/system complexity and their long execution times. An alternative to running actual codes is to gather their communication traces and then replay them, which facilitates application tuning and future procurements. While past approaches lacked lossless scalable trace collection, we contribute an approach that provides orders of magnitude smaller, if not near constant-size, communication traces regardless of the number of nodes while preserving structural information. We introduce intra- and inter-node compression techniques of MPI events, we develop a scheme to preserve time and causality of communication events, and we present results of our implementation for BlueGene/L. Given this novel capability, we discuss its impact on communication tuning and on trace extrapolation. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with time-preserving deterministic MPI call replay are without any precedence.
Date: March 31, 2010
Creator: Mueller, F; Wu, X; Schulz, M; de Supinski, B & Gamblin, T
Partner: UNT Libraries Government Documents Department

A Proposal for User-defined Reductions in OpenMP

Description: Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e. g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a manual reduction algorithm in ways typically applied to reductions on primitive types. In this paper, we propose new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators. Our measurements show that our UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.
Date: March 22, 2010
Creator: Duran, A; Ferrer, R; Klemm, M; de Supinski, B R & Ayguade, E
Partner: UNT Libraries Government Documents Department

An Approach to Performance Prediction for Parallel Applications

Description: Accurately modeling and predicting performance for large-scale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions between architecture and software. In contrast, we employ multilayer neural networks trained on input data from executions on the target platform. This approach is useful for predicting many aspects of performance, and it captures full system complexity. Our models are developed automatically from the training input set, avoiding the difficult and potentially error-prone process required to develop analytic models. This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood. Our model predicts performance on two large-scale parallel platforms within 5%-7% error across a large, multi-dimensional parameter space.
Date: May 17, 2005
Creator: Ipek, E; de Supinski, B R; Schulz, M & McKee, S A
Partner: UNT Libraries Government Documents Department

CLOMP: Accurately Characterizing OpenMP Application Overheads

Description: Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. Further, we show that CLOMP also captures limitations for OpenMP parallelization on NUMA systems.
Date: February 11, 2008
Creator: Bronevetsky, G; Gyllenhaal, J & de Supinski, B
Partner: UNT Libraries Government Documents Department

Automatic Fault Characterization via Abnormality-Enhanced Classification

Description: Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.
Date: December 20, 2010
Creator: Bronevetsky, G; Laguna, I & de Supinski, B R
Partner: UNT Libraries Government Documents Department