50 Matching Results

Search Results

Advanced search parameters have been applied.

in creator/contributor: "de Supinski, B R"

open access

Overcoming Scalability Challenges for Tool Daemon Launching

Many tools that target parallel and distributed environments must co-locate a set of daemons with the distributed processes of the target application. However, efficient and portable deployment of these daemons on large scale systems is an unsolved problem. We overcome this gap with LaunchMON, a scalable, robust, portable, secure, and general purpose infrastructure for launching tool daemons. Its API allows tool builders to identify all processes of a target job, launch daemons on the relevant … more

Date: February 15, 2008

Creator: Ahn, D H; Arnold, D C; de Supinski, B R; Lee, G L; Miller, B P & Schulz, M

Partner: UNT Libraries Government Documents Department

open access

BlueGene/L Applications: Parallelism on a Massive Scale

BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale with 131,072 processors and absolute performance with a peak rate of 367 TFlop/s. BG/L has led the Top500 list the last four times with a Linpack rate of 280.6 TFlop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. However, the real value of a machine like … more

Date: September 8, 2006

Creator: de Supinski, B. R.; Schulz, M.; Bulatov, V. V.; Cabot, W.; Chan, B.; Cook, A. W. et al.

Partner: UNT Libraries Government Documents Department

open access

Lessons learned at 208K: Towards Debugging Millions of Cores

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application--already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach suc… more

Date: April 14, 2008

Creator: Lee, G L; Ahn, D H; Arnold, D C; de Supinski, B R; Legendre, M; Miller, B P et al.

Partner: UNT Libraries Government Documents Department

open access

The ASCI PSE Milepost: Run-Time Systems Performance Tests

The Accelerated Strategic Computing Initiative (ASCI) Problem Solving Environment (PSE) consists of the tools and libraries needed for the development of ASCI simulation codes on ASCI machines. The recently completed ASCI PSE Milepost demonstrated that this software environment is available and functional at the scale used for application mileposts on ASCI White. As part of the PSE Milepost, we performed extensive performance testing of several critical run-time based systems. In this paper, we… more

Date: May 7, 2001

Creator: de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifies which tasks of a large-scale application first mani… more

Date: March 23, 2010

Creator: Bronevetsky, G; Laguna, I; Bagchi, S; de Supinski, B R; Ahn, D & Schulz, M

Partner: UNT Libraries Government Documents Department

open access

A Proposal for User-defined Reductions in OpenMP

Reductions are commonly used in parallel programs to produce a global result from partial results computed in parallel. Currently, OpenMP only supports reductions for primitive data types and a limited set of base language operators. This is a significant limitation for those applications that employ user-defined data types (e. g., objects). Implementing manual reduction algorithms makes software development more complex and error-prone. Additionally, an OpenMP runtime system cannot optimize a … more

Date: March 22, 2010

Creator: Duran, A; Ferrer, R; Klemm, M; de Supinski, B R & Ayguade, E

Partner: UNT Libraries Government Documents Department

open access

OpenMP for Accelerators

OpenMP [13] is the dominant programming model for shared-memory parallelism in C, C++ and Fortran due to its easy-to-use directive-based style, portability and broad support by compiler vendors. Similar characteristics are needed for a programming model for devices such as GPUs and DSPs that are gaining popularity to accelerate compute-intensive application regions. This paper presents extensions to OpenMP that provide that programming model. Our results demonstrate that a high-level programmin… more

Date: March 15, 2011

Creator: Beyer, J C; Stotzer, E J; Hart, A & de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

Automatic Fault Characterization via Abnormality-Enhanced Classification

Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system adm… more

Date: December 20, 2010

Creator: Bronevetsky, G; Laguna, I & de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

Massively Parallel Loading

No Description Available.

Date: January 15, 2013

Creator: Frings, W.; Ahn, D. H.; LeGendre, M.; Gamblin, T.; de Supinski, B. R. & Wolf, F.

Partner: UNT Libraries Government Documents Department

open access

Efficient and Scalable Retrieval Techniques for Global File Properties

No Description Available.

Date: April 30, 2012

Creator: Ahn, D H; Brim, M; de Supinski, B R; Gamblin, T; Lee, G L; LeGendre, M P et al.

Partner: UNT Libraries Government Documents Department

open access

Statistical Fault Detection for Parallel Applications with AutomaDeD

Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but… more

Date: March 23, 2010

Creator: Bronevetsky, G; Laguna, I; Bagchi, S; de Supinski, B R; Ahn, D & Schulz, M

Partner: UNT Libraries Government Documents Department

open access

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or nonportable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale… more

Date: March 24, 2010

Creator: Hilbrich, T.; Schulz, M.; de Supinski, B. R. & Muller, M.

Partner: UNT Libraries Government Documents Department

open access

Frontiers of Performance Analysis on Leadership-Class Systems

The number of cores in high-end systems for scientific computing are employing is increasing rapidly. As a result, there is an pressing need for tools that can measure, model, and diagnose performance problems in highly-parallel runs. We describe two tools that employ complementary approaches for analysis at scale and we illustrate their use on DOE leadership-class systems.

Date: June 15, 2009

Creator: Fowler, R. J.; Adhianto, L.; de Supinski, B. R.; Fagan, M.; Gamblin, T.; Krentel, M. et al.

Partner: UNT Libraries Government Documents Department

open access

Exploiting Data Similarity to Reduce Memory Footprints

Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage mo… more

Date: January 28, 2011

Creator: Biswas, S; de Supinski, B R; Schulz, M; Franklin, D; Sherwood, T & Chong, F T

Partner: UNT Libraries Government Documents Department

open access

Beyond DVFS: A First Look at Performance Under a Hardware-Enforced Power Bound

No Description Available.

Date: March 5, 2012

Creator: Rountree, B. R.; Ahn, D. H.; de Supinski, B. R.; Lowenthal, D. K. & Schulz, M.

Partner: UNT Libraries Government Documents Department

open access

Parallelizing Heavyweight Debugging Tools with MPIecho

No Description Available.

Date: April 22, 2011

Creator: Rountree, B L; Cobb, G X; Gamblin, G T; Schulz, M W; de Supinski, B R & Tufo, H M

Partner: UNT Libraries Government Documents Department

open access

Formal Specification of the OpenMP Memory Model

OpenMP [1] is an important API for shared memory programming, combining shared memory's potential for performance with a simple programming interface. Unfortunately, OpenMP lacks a critical tool for demonstrating whether programs are correct: a formal memory model. Instead, the current official definition of the OpenMP memory model (the OpenMP 2.5 specification [1]) is in terms of informal prose. As a result, it is impossible to verify OpenMP applications formally since the prose does not provi… more

Date: May 17, 2006

Creator: Bronevetsky, G & de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

Dynamic Program Phase Detection in Distributed Shared-Memory Multiprocessors

We present a novel hardware mechanism for dynamic program phase detection in distributed shared-memory (DSM) multiprocessors. We show that successful hardware mechanisms for phase detection in uniprocessors do not necessarily work well in DSM systems, since they lack the ability to incorporate the parallel application's global execution information and memory access behavior based on data distribution. We then propose a hardware extension to a well-known uniprocessor mechanism that significantl… more

Date: March 6, 2006

Creator: Ipek, E; Martinez, J F; de Supinski, B R; McKee, S A & Schulz, M

Partner: UNT Libraries Government Documents Department

open access

Toward Enhancing OpenMP's Work-Sharing Directives

OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, inclu… more

Date: May 17, 2006

Creator: Chapman, B M; Huang, L; Jin, H; Jost, G & de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

Tera-scalable Algorithms for Variable-Density Elliptic Hydrodynamics with Spectral Accuracy

A hybrid spectral/compact solver for variable-density viscous incompressible flow is described. Parallelization strategies for the FFTs and band-diagonal matrices are discussed and compared. Transpose methods are found to be highly competitive with direct block parallel methods when the problem is scaled to tens of thousands of processors. Various mapping strategies for the IBM BlueGene/L torus configuration of processors are explored. By optimizing the communication, we have achieved virtually… more

Date: April 13, 2005

Creator: Cook, A. W.; Cabot, W. H.; Welcome, M. L.; Williams, P. L.; Miller, B. J.; de Supinski, B. R. et al.

Partner: UNT Libraries Government Documents Department

open access

Scalable Dynamic Instrumentation for BlueGene/L

Dynamic binary instrumentation for performance analysis on new, large scale architectures such as the IBM Blue Gene/L system (BG/L) poses new challenges. Their scale--with potentially hundreds of thousands of compute nodes--requires new, more scalable mechanisms to deploy and to organize binary instrumentation and to collect the resulting data gathered by the inserted probes. Further, many of these new machines don't support full operating systems on the compute nodes; rather, they rely on ligh… more

Date: September 8, 2005

Creator: Schulz, M; Ahn, D; Bernat, A; de Supinski, B R; Ko, S Y; Lee, G et al.

Partner: UNT Libraries Government Documents Department

open access

A Case for Including Transactions in OpenMP

Transactional Memory (TM) has received significant attention recently as a mechanism to reduce the complexity of shared memory programming. We explore the potential of TM to improve OpenMP applications. We combine a software TM (STM) system to support transactions with an OpenMP implementation to start thread teams and provide task and loop-level parallelization. We apply this system to two application scenarios that reflect realistic TM use cases. Our results with this system demonstrate that … more

Date: January 25, 2010

Creator: Wong, M.; Bihari, B. L.; de Supinski, B. R.; Wu, P.; Michael, M.; Liu, Y. et al.

Partner: UNT Libraries Government Documents Department

open access

Benchmarking Pthreads performance

The importance of the performance of threads libraries is growing as clusters of shared memory machines become more popular POSIX threads, or Pthreads, is an industry threads library standard. We have implemented the first Pthreads benchmark suite. In addition to measuring basic thread functions, such as thread creation, we apply the L.ogP model to standard Pthreads communication mechanisms. We present the results of our tests for several hardware platforms. These results demonstrate that the p… more

Date: April 27, 1999

Creator: May, J M & de Supinski, B R

Partner: UNT Libraries Government Documents Department

open access

Semantic-driven Parallelization of Loops Operating on User-defined Containers

The authors describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions. Utilizing the semantics of these high-level abstractions, they demonstrate the automatic parallelization of loops that iterate over user-defined containers that have interfaces similar to the lists, vectors and sets in the Standard Template Library (STL). The parallelization is real… more

Date: July 9, 2003

Creator: Quinlan, D; Schordan, M; Yi, Q & de Supinski, B R

Partner: UNT Libraries Government Documents Department

50 Matching Results

Search Results

Add Filters

Applied Filters

Overcoming Scalability Challenges for Tool Daemon Launching

BlueGene/L Applications: Parallelism on a Massive Scale

Lessons learned at 208K: Towards Debugging Millions of Cores

The ASCI PSE Milepost: Run-Time Systems Performance Tests

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

A Proposal for User-defined Reductions in OpenMP

OpenMP for Accelerators

Automatic Fault Characterization via Abnormality-Enhanced Classification

Massively Parallel Loading

Efficient and Scalable Retrieval Techniques for Global File Properties

Statistical Fault Detection for Parallel Applications with AutomaDeD

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

Frontiers of Performance Analysis on Leadership-Class Systems

Exploiting Data Similarity to Reduce Memory Footprints

Beyond DVFS: A First Look at Performance Under a Hardware-Enforced Power Bound

Parallelizing Heavyweight Debugging Tools with MPIecho

Formal Specification of the OpenMP Memory Model

Dynamic Program Phase Detection in Distributed Shared-Memory Multiprocessors

Toward Enhancing OpenMP's Work-Sharing Directives

Tera-scalable Algorithms for Variable-Density Elliptic Hydrodynamics with Spectral Accuracy

Scalable Dynamic Instrumentation for BlueGene/L

A Case for Including Transactions in OpenMP

Benchmarking Pthreads performance

Semantic-driven Parallelization of Loops Operating on User-defined Containers