Award ER25844: Minimizing System Noise Effects for Extreme-Scale Scientific Simulation Through Function Delegation

PDF Version Also Available for Download.

Description

In software running on distributed computing clusters, time spent on communication between nodes in the cluster can be a significant portion of the overall computation time; background operating system tasks and other computational �noise� on the nodes of the system can have a significant impact on the amount of time this communication takes, especially on large systems. The research completed in this period has improved understanding of when such noise will have a significant impact. Specifically, it was demonstrated that not just noise on the nodes, but also noise on the network between nodes can have a significant impact on ... continued below

Creation Information

Lumsdaine, Andrew November 20, 2012.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Publisher

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

In software running on distributed computing clusters, time spent on communication between nodes in the cluster can be a significant portion of the overall computation time; background operating system tasks and other computational �noise� on the nodes of the system can have a significant impact on the amount of time this communication takes, especially on large systems. The research completed in this period has improved understanding of when such noise will have a significant impact. Specifically, it was demonstrated that not just noise on the nodes, but also noise on the network between nodes can have a significant impact on computation time. It was also demonstrated that noise patterns matter more than noise intensity: very regular noise can cause less disruption than lighter (on average) but less regular noise. It was also demonstrated that the effect of noise is more prominent as the speed of the network between nodes is increased. Furthermore, a tracing tool, Netgauge, was improved via our work, and a system simulator, LogGOPSim, was developed; they can be used by application developers to improve performance of their program and by system designers to mitigate the effects of noise by adjusting the noise characteristics of the operating system. Both have been made freely available as open source programs. In the course of developing these tools, we demonstrated weaknesses in existing methodologies for modeling communication, and we introduced a more detailed model, LogGOPS, for simulating systems. Not only were the deleterious effects of noise explored but we have also offered solutions. Our studies of simulations of system noise have led to specific recommendations on tuning systems to mitigate noise. We have also improved existing approaches to mitigating noise. �Non-blocking collective communication� avoids the effects of noise by letting communication continue simultaneously with computation (thus being �non-blocking�), so that the delays in communication introduced by noise have a smaller impact on overall computation time. Potentially, noise can be reduced much further by �offloading� communication tasks to a separate processing element than the operating system is using. We have improved our library LibNBC, which provides an implementation of non-blocking collectives, via this work. During this research, our proposal to include non-blocking collectives (which used LibNBC as a reference implementation) in the upcoming MPI-3 standard was accepted. As MPI is a ubiquitous and important standard for communication in parallel computing, this demonstrates a certain acceptance of the practicality and desirability of non-blocking collectives. Now that non-blocking collectives are a part of the standard we can expect to see optimized platform-specific implementations of non-blocking collectives. Also as part of this work we have also developed a language GOAL (Global Operation Assembly Language) that can be used as a starting point for defining languages to express optimized communication algorithms.

Subjects

STI Subject Categories

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: DOE/ER/25844
  • Grant Number: FG02-08ER25844
  • DOI: 10.2172/1082306 | External Link
  • Office of Scientific & Technical Information Report Number: 1082306
  • Archival Resource Key: ark:/67531/metadc840896

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • November 20, 2012

Added to The UNT Digital Library

  • May 19, 2016, 9:45 a.m.

Description Last Updated

  • Nov. 22, 2016, 8:49 p.m.

Usage Statistics

When was this report last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 2

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Lumsdaine, Andrew. Award ER25844: Minimizing System Noise Effects for Extreme-Scale Scientific Simulation Through Function Delegation, report, November 20, 2012; United States. (digital.library.unt.edu/ark:/67531/metadc840896/: accessed August 17, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.