Search Results

Advanced search parameters have been applied.
open access

Overcoming Scalability Challenges for Tool Daemon Launching

Description: Many tools that target parallel and distributed environments must co-locate a set of daemons with the distributed processes of the target application. However, efficient and portable deployment of these daemons on large scale systems is an unsolved problem. We overcome this gap with LaunchMON, a scalable, robust, portable, secure, and general purpose infrastructure for launching tool daemons. Its API allows tool builders to identify all processes of a target job, launch daemons on the relevant … more
Date: February 15, 2008
Creator: Ahn, D H; Arnold, D C; de Supinski, B R; Lee, G L; Miller, B P & Schulz, M
Partner: UNT Libraries Government Documents Department
open access

BlueGene/L Applications: Parallelism on a Massive Scale

Description: BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale with 131,072 processors and absolute performance with a peak rate of 367 TFlop/s. BG/L has led the Top500 list the last four times with a Linpack rate of 280.6 TFlop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. However, the real value of a machine like … more
Date: September 8, 2006
Creator: de Supinski, B. R.; Schulz, M.; Bulatov, V. V.; Cabot, W.; Chan, B.; Cook, A. W. et al.
Partner: UNT Libraries Government Documents Department
open access

Lessons learned at 208K: Towards Debugging Millions of Cores

Description: Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application--already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach suc… more
Date: April 14, 2008
Creator: Lee, G L; Ahn, D H; Arnold, D C; de Supinski, B R; Legendre, M; Miller, B P et al.
Partner: UNT Libraries Government Documents Department
open access

The ASCI PSE Milepost: Run-Time Systems Performance Tests

Description: The Accelerated Strategic Computing Initiative (ASCI) Problem Solving Environment (PSE) consists of the tools and libraries needed for the development of ASCI simulation codes on ASCI machines. The recently completed ASCI PSE Milepost demonstrated that this software environment is available and functional at the scale used for application mileposts on ASCI White. As part of the PSE Milepost, we performed extensive performance testing of several critical run-time based systems. In this paper, we… more
Date: May 7, 2001
Creator: de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Description: Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifies which tasks of a large-scale application first mani… more
Date: March 23, 2010
Creator: Bronevetsky, G; Laguna, I; Bagchi, S; de Supinski, B R; Ahn, D & Schulz, M
Partner: UNT Libraries Government Documents Department
open access

A Proposal for User-defined Reductions in OpenMP

Description: Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e. g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a … more
Date: March 22, 2010
Creator: Duran, A; Ferrer, R; Klemm, M; de Supinski, B R & Ayguade, E
Partner: UNT Libraries Government Documents Department
open access

OpenMP for Accelerators

Description: OpenMP [13] is the dominant programming model for shared-memory parallelism in C, C++ and Fortran due to its easy-to-use directive-based style, portability and broad support by compiler vendors. Similar characteristics are needed for a programming model for devices such as GPUs and DSPs that are gaining popularity to accelerate compute-intensive application regions. This paper presents extensions to OpenMP that provide that programming model. Our results demonstrate that a high-level programmin… more
Date: March 15, 2011
Creator: Beyer, J C; Stotzer, E J; Hart, A & de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

Automatic Fault Characterization via Abnormality-Enhanced Classification

Description: Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system adm… more
Date: December 20, 2010
Creator: Bronevetsky, G; Laguna, I & de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

Massively Parallel Loading

Description: No Description Available.
Date: January 15, 2013
Creator: Frings, W.; Ahn, D. H.; LeGendre, M.; Gamblin, T.; de Supinski, B. R. & Wolf, F.
Partner: UNT Libraries Government Documents Department
open access

Statistical Fault Detection for Parallel Applications with AutomaDeD

Description: Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but… more
Date: March 23, 2010
Creator: Bronevetsky, G; Laguna, I; Bagchi, S; de Supinski, B R; Ahn, D & Schulz, M
Partner: UNT Libraries Government Documents Department
open access

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

Description: The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or nonportable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale… more
Date: March 24, 2010
Creator: Hilbrich, T.; Schulz, M.; de Supinski, B. R. & Muller, M.
Partner: UNT Libraries Government Documents Department
open access

Frontiers of Performance Analysis on Leadership-Class Systems

Description: The number of cores in high-end systems for scientific computing are employing is increasing rapidly. As a result, there is an pressing need for tools that can measure, model, and diagnose performance problems in highly-parallel runs. We describe two tools that employ complementary approaches for analysis at scale and we illustrate their use on DOE leadership-class systems.
Date: June 15, 2009
Creator: Fowler, R. J.; Adhianto, L.; de Supinski, B. R.; Fagan, M.; Gamblin, T.; Krentel, M. et al.
Partner: UNT Libraries Government Documents Department
open access

Exploiting Data Similarity to Reduce Memory Footprints

Description: Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage mo… more
Date: January 28, 2011
Creator: Biswas, S; de Supinski, B R; Schulz, M; Franklin, D; Sherwood, T & Chong, F T
Partner: UNT Libraries Government Documents Department
open access

Formal Specification of the OpenMP Memory Model

Description: OpenMP [1] is an important API for shared memory programming, combining shared memory's potential for performance with a simple programming interface. Unfortunately, OpenMP lacks a critical tool for demonstrating whether programs are correct: a formal memory model. Instead, the current official definition of the OpenMP memory model (the OpenMP 2.5 specification [1]) is in terms of informal prose. As a result, it is impossible to verify OpenMP applications formally since the prose does not provi… more
Date: May 17, 2006
Creator: Bronevetsky, G & de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

Dynamic Program Phase Detection in Distributed Shared-Memory Multiprocessors

Description: We present a novel hardware mechanism for dynamic program phase detection in distributed shared-memory (DSM) multiprocessors. We show that successful hardware mechanisms for phase detection in uniprocessors do not necessarily work well in DSM systems, since they lack the ability to incorporate the parallel application's global execution information and memory access behavior based on data distribution. We then propose a hardware extension to a well-known uniprocessor mechanism that significantl… more
Date: March 6, 2006
Creator: Ipek, E; Martinez, J F; de Supinski, B R; McKee, S A & Schulz, M
Partner: UNT Libraries Government Documents Department
open access

Toward Enhancing OpenMP's Work-Sharing Directives

Description: OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, inclu… more
Date: May 17, 2006
Creator: Chapman, B M; Huang, L; Jin, H; Jost, G & de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

Tera-scalable Algorithms for Variable-Density Elliptic Hydrodynamics with Spectral Accuracy

Description: A hybrid spectral/compact solver for variable-density viscous incompressible flow is described. Parallelization strategies for the FFTs and band-diagonal matrices are discussed and compared. Transpose methods are found to be highly competitive with direct block parallel methods when the problem is scaled to tens of thousands of processors. Various mapping strategies for the IBM BlueGene/L torus configuration of processors are explored. By optimizing the communication, we have achieved virtually… more
Date: April 13, 2005
Creator: Cook, A. W.; Cabot, W. H.; Welcome, M. L.; Williams, P. L.; Miller, B. J.; de Supinski, B. R. et al.
Partner: UNT Libraries Government Documents Department
open access

Scalable Dynamic Instrumentation for BlueGene/L

Description: Dynamic binary instrumentation for performance analysis on new, large scale architectures such as the IBM Blue Gene/L system (BG/L) poses new challenges. Their scale--with potentially hundreds of thousands of compute nodes--requires new, more scalable mechanisms to deploy and to organize binary instrumentation and to collect the resulting data gathered by the inserted probes. Further, many of these new machines don't support full operating systems on the compute nodes; rather, they rely on ligh… more
Date: September 8, 2005
Creator: Schulz, M; Ahn, D; Bernat, A; de Supinski, B R; Ko, S Y; Lee, G et al.
Partner: UNT Libraries Government Documents Department
open access

A Case for Including Transactions in OpenMP

Description: Transactional Memory (TM) has received significant attention recently as a mechanism to reduce the complexity of shared memory programming. We explore the potential of TM to improve OpenMP applications. We combine a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization. We apply this system to two application scenarios that reflect realistic TM use cases. Our results with this system demonstrate that … more
Date: January 25, 2010
Creator: Wong, M.; Bihari, B. L.; de Supinski, B. R.; Wu, P.; Michael, M.; Liu, Y. et al.
Partner: UNT Libraries Government Documents Department
open access

Benchmarking Pthreads performance

Description: The importance of the performance of threads libraries is growing as clusters of shared memory machines become more popular POSIX threads, or Pthreads, is an industry threads library standard. We have implemented the first Pthreads benchmark suite. In addition to measuring basic thread functions, such as thread creation, we apply the L.ogP model to standard Pthreads communication mechanisms. We present the results of our tests for several hardware platforms. These results demonstrate that the p… more
Date: April 27, 1999
Creator: May, J M & de Supinski, B R
Partner: UNT Libraries Government Documents Department
open access

Semantic-driven Parallelization of Loops Operating on User-defined Containers

Description: The authors describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions. Utilizing the semantics of these high-level abstractions, they demonstrate the automatic parallelization of loops that iterate over user-defined containers that have interfaces similar to the lists, vectors and sets in the Standard Template Library (STL). The parallelization is real… more
Date: July 9, 2003
Creator: Quinlan, D; Schordan, M; Yi, Q & de Supinski, B R
Partner: UNT Libraries Government Documents Department
Back to Top of Screen