49 Matching Results

Search Results

Advanced search parameters have been applied.

Petascale Computing Enabling Technologies Project Final Report

Description: The Petascale Computing Enabling Technologies (PCET) project addressed challenges arising from current trends in computer architecture that will lead to large-scale systems with many more nodes, each of which uses multicore chips. These factors will soon lead to systems that have over one million processors. Also, the use of multicore chips will lead to less memory and less memory bandwidth per core. We need fundamentally new algorithmic approaches to cope with these memory constraints and the huge number of processors. Further, correct, efficient code development is difficult even with the number of processors in current systems; more processors will only make it harder. The goal of PCET was to overcome these challenges by developing the computer science and mathematical underpinnings needed to realize the full potential of our future large-scale systems. Our research results will significantly increase the scientific output obtained from LLNL large-scale computing resources by improving application scientist productivity and system utilization. Our successes include scalable mathematical algorithms that adapt to these emerging architecture trends and through code correctness and performance methodologies that automate critical aspects of application development as well as the foundations for application-level fault tolerance techniques. PCET's scope encompassed several research thrusts in computer science and mathematics: code correctness and performance methodologies, scalable mathematics algorithms appropriate for multicore systems, and application-level fault tolerance techniques. Due to funding limitations, we focused primarily on the first three thrusts although our work also lays the foundation for the needed advances in fault tolerance. In the area of scalable mathematics algorithms, our preliminary work established that OpenMP performance of the AMG linear solver benchmark and important individual kernels on Atlas did not match the predictions of our simple initial model. Our investigations demonstrated that a poor default memory allocation mechanism degraded performance. We developed a prototype NUMA library to ...
Date: February 14, 2010
Creator: de Supinski, B R
Partner: UNT Libraries Government Documents Department

Formal Specification of the OpenMP Memory Model

Description: OpenMP [1] is an important API for shared memory programming, combining shared memory's potential for performance with a simple programming interface. Unfortunately, OpenMP lacks a critical tool for demonstrating whether programs are correct: a formal memory model. Instead, the current official definition of the OpenMP memory model (the OpenMP 2.5 specification [1]) is in terms of informal prose. As a result, it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency model that precisely describes how reads and writes on different threads interact. This paper focuses on the formal verification of OpenMP programs through a proposed formal memory model that is derived from the existing prose model [1]. Our formalization provides a two-step process to verify whether an observed OpenMP execution is conformant. In addition to this formalization, our contributions include a discussion of ambiguities in the current prose-based memory model description. Although our formal model may not capture the current informal memory model perfectly, in part due to these ambiguities, our model reflects our understanding of the informal model's intent. We conclude with several examples that may indicate areas of the OpenMP memory model that need further refinement however it is specified. Our goal is to motivate the OpenMP community to adopt those refinements eventually, ideally through a formal model, in later OpenMP specifications.
Date: May 17, 2006
Creator: Bronevetsky, G & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Practical Differential Profiling

Description: Comparing performance profiles from two runs is an essential performance analysis step that users routinely perform. In this work we present eGprof, a tool that facilitates these comparisons through differential profiling inside gprof. We chose this approach, rather than designing a new tool, since gprof is one of the few performance analysis tools accepted and used by a large community of users. eGprof allows users to 'subtract' two performance profiles directly. It also includes callgraph visualization to highlight the differences in graphical form. Along with the design of this tool, we present several case studies that show how eGprof can be used to find and to study the differences of two application executions quickly and hence can aid the user in this most common step in performance analysis. We do this without requiring major changes on the side of the user, the most important factor in guaranteeing the adoption of our tool by code teams.
Date: February 4, 2007
Creator: Schulz, M & De Supinski, B R
Partner: UNT Libraries Government Documents Department

Benchmarking Pthreads performance

Description: The importance of the performance of threads libraries is growing as clusters of shared memory machines become more popular POSIX threads, or Pthreads, is an industry threads library standard. We have implemented the first Pthreads benchmark suite. In addition to measuring basic thread functions, such as thread creation, we apply the L.ogP model to standard Pthreads communication mechanisms. We present the results of our tests for several hardware platforms. These results demonstrate that the performance of existing Pthreads implementations varies widely; parts of nearly all of these implementations could be further optimized. Since hardware differences do not fully explain these performance variations, optimizations could improve the implementations. 2. Incorporating Threads Benchmarks into SKaMPI is an MPI benchmark suite that provides a general framework for performance analysis [7]. SKaMPI does not exhaustively test the MPI standard. Instead, it
Date: April 27, 1999
Creator: May, J M & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Accurately measuring MPI broadcasts in a computational grid

Description: An MPI library's implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demonstrate, existing methods for measuring broadcast performance are either inaccurate or inadequate. Fortunately, we have designed an accurate method for measuring broadcast performance, even in a challenging grid environment. Measuring broadcast performance is not easy. Simply sending one broadcast after another allows them to proceed through the network concurrently, thus resulting in inaccurate per broadcast timings. Existing methods either fail to eliminate this pipelining effect or eliminate it by introducing overheads that are as difficult to measure as the performance of the broadcast itself. This problem becomes even more challenging in grid environments. Latencies a long different links can vary significantly. Thus, an algorithm's performance is difficult to predict from it's communication pattern. Even when accurate pre-diction is possible, the pattern is often unknown. Our method introduces a measurable overhead to eliminate the pipelining effect, regardless of variations in link latencies. choose between different available implementations. Also, accurate and complete measurements could guide use of a given implementation to improve application performance. These choices will become even more important as grid-enabled MPI libraries [6, 7] become more common since bad choices are likely to cost significantly more in grid environments. In short, the distributed processing community needs accurate, succinct and complete measurements of collective communications performance. Since successive collective communications can often proceed concurrently, accurately measuring them is difficult. Some benchmarks use knowledge of the communication algorithm to predict the timing of events and, thus, eliminate concurrency between the collective communications that they measure. However, accurate event timing predictions are often impossible since network delays and local processing overheads are stochastic. ...
Date: May 6, 1999
Creator: T, Karonis N & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Exploiting Data Similarity to Reduce Memory Footprints

Description: Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more efficiently - preferably transparently - could increase effective DRAM capacity and thus the benefit of multicore nodes for HPC systems. MPI application processes often exhibit significant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them difficult to manage statically. Our novel memory allocation library, SBLLmalloc, automatically identifies identical memory blocks and merges them into a single copy. SBLLmalloc does not require application or OS changes since we implement it as a user-level library. Overall, we demonstrate that SBLLmalloc reduces the memory footprint of a range of MPI applications by 32.03% on average and up to 60.87%. Further, SBLLmalloc supports problem sizes for IRS over 21.36% larger than using standard memory management techniques, thus significantly increasing effective system size. Similarly, SBLLmalloc requires 43.75% fewer nodes than standard memory management techniques to solve an AMG problem.
Date: January 28, 2011
Creator: Biswas, S; de Supinski, B R; Schulz, M; Franklin, D; Sherwood, T & Chong, F T
Partner: UNT Libraries Government Documents Department

Lightweight and Statistical Techniques for Petascale Debugging: Correctness on Petascale Systems (CoPS) Preliminry Report

Description: Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in a wide range of scientific disciplines. These large systems create unprecedented application development challenges. Scalable correctness tools are critical to shorten the time-to-solution on these systems. Currently, many DOE application developers use primitive manual debugging based on printf or traditional debuggers such as TotalView or DDT. This paradigm breaks down beyond a few thousand cores, yet bugs often arise above that scale. Programmers must reproduce problems in smaller runs to analyze them with traditional tools, or else perform repeated runs at scale using only primitive techniques. Even when traditional tools run at scale, the approach wastes substantial effort and computation cycles. Continued scientific progress demands new paradigms for debugging large-scale applications. The Correctness on Petascale Systems (CoPS) project is developing a revolutionary debugging scheme that will reduce the debugging problem to a scale that human developers can comprehend. The scheme can provide precise diagnoses of the root causes of failure, including suggestions of the location and the type of errors down to the level of code regions or even a single execution point. Our fundamentally new strategy combines and expands three relatively new complementary debugging approaches. The Stack Trace Analysis Tool (STAT), a 2011 R&D 100 Award Winner, identifies behavior equivalence classes in MPI jobs and highlights behavior when elements of the class demonstrate divergent behavior, often the first indicator of an error. The Cooperative Bug Isolation (CBI) project has developed statistical techniques for isolating programming errors in widely deployed code that we will adapt to large-scale parallel applications. Finally, we are developing a new approach to parallelizing expensive correctness analyses, such as analysis of memory usage in the Memgrind tool. In the first two years of the project, we have successfully extended STAT to determine ...
Date: September 13, 2011
Creator: de Supinski, B R; Miller, B P & Liblit, B
Partner: UNT Libraries Government Documents Department

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Description: Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15x. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.
Date: March 20, 2012
Creator: Mohror, K.; Moody, A. & de Supinski, B. R.
Partner: UNT Libraries Government Documents Department

A Proposal for User-defined Reductions in OpenMP

Description: Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e. g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a manual reduction algorithm in ways typically applied to reductions on primitive types. In this paper, we propose new mechanisms to allow the use of most pre-existing binary functions on user-defined data types as User-Defined Reduction (UDR) operators. Our measurements show that our UDR prototype implementation provides consistently good performance across a range of thread counts without increasing general runtime overheads.
Date: March 22, 2010
Creator: Duran, A; Ferrer, R; Klemm, M; de Supinski, B R & Ayguade, E
Partner: UNT Libraries Government Documents Department

An Approach to Performance Prediction for Parallel Applications

Description: Accurately modeling and predicting performance for large-scale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions between architecture and software. In contrast, we employ multilayer neural networks trained on input data from executions on the target platform. This approach is useful for predicting many aspects of performance, and it captures full system complexity. Our models are developed automatically from the training input set, avoiding the difficult and potentially error-prone process required to develop analytic models. This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood. Our model predicts performance on two large-scale parallel platforms within 5%-7% error across a large, multi-dimensional parameter space.
Date: May 17, 2005
Creator: Ipek, E; de Supinski, B R; Schulz, M & McKee, S A
Partner: UNT Libraries Government Documents Department

Automatic Fault Characterization via Abnormality-Enhanced Classification

Description: Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.
Date: December 20, 2010
Creator: Bronevetsky, G; Laguna, I & de Supinski, B R
Partner: UNT Libraries Government Documents Department

OpenMP for Accelerators

Description: OpenMP [13] is the dominant programming model for shared-memory parallelism in C, C++ and Fortran due to its easy-to-use directive-based style, portability and broad support by compiler vendors. Similar characteristics are needed for a programming model for devices such as GPUs and DSPs that are gaining popularity to accelerate compute-intensive application regions. This paper presents extensions to OpenMP that provide that programming model. Our results demonstrate that a high-level programming model can provide accelerated performance comparable to hand-coded implementations in CUDA.
Date: March 15, 2011
Creator: Beyer, J C; Stotzer, E J; Hart, A & de Supinski, B R
Partner: UNT Libraries Government Documents Department

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

Description: The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or nonportable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size. We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different architectures and use cases. The design uses PnMPI to instantiate a tool from a set of individual modules. We present an overview of our design, along with first performance results for a proof of concept implementation.
Date: March 24, 2010
Creator: Hilbrich, T; Schulz, M; de Supinski, B R & Muller, M
Partner: UNT Libraries Government Documents Department

Regression Strategies for Parameter Space Exploration: A Case Study in Semicoarsening Multigrid and R

Description: Increasing system and algorithmic complexity, combined with a growing number of tunable application parameters, pose significant challenges for analytical performance modeling. This report outlines a series of robust techniques that enable efficient parameter space exploration based on empirical statistical modeling. In particular, this report applies statistical techniques such as clustering, association, correlation analyses to understand the parameter space better. Results from these statistical techniques guide the construction of piecewise polynomial regression models. Residual and significance tests ensure the resulting model is unbiased and efficient. We demonstrate these techniques in R, a statistical computing environment, for predicting the performance of semicoarsening multigrid. 50 and 75 percent of predictions achieve error rates of 5.5 and 10.0 percent or less, respectively.
Date: September 28, 2006
Creator: Lee, B C; Schulz, M & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

Description: High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context. To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.
Date: April 9, 2010
Creator: Moody, A T; Bronevetsky, G; Mohror, K M & de Supinski, B R
Partner: UNT Libraries Government Documents Department

Exploiting hierarchy in parallel computer networks to optimize collective operations performance

Description: The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced optimal trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area computational grids, and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. The authors present a strategy based upon a multilayer view of the network. By creating multilevel topology trees they take advantage of communication cost differences at every level in the network. They used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH implementation of the MPI standard. Using information about topology discovered by Globus, they construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. They present results demonstrating the advantages of their multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.
Date: February 4, 2000
Creator: Karonis, N. T.; de Supinski, B. R.; Foster, I.; Gropp, W.; Lusk, E. & Bresnahan, J.
Partner: UNT Libraries Government Documents Department