Scalable resource management in high performance computers.

PDF Version Also Available for Download.

Description

Clusters of workstations have emerged as an important platform for building cost-effective, scalable and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead and the flexibility necessary to efficiently support and analyze a wide range of job scheduling algorithms. STORM achieves these feats by closely integrating the management daemons with the low-level features that are common in state-of-the-art high-performance system area networks. The architecture of STORM is based on three main ... continued below

Physical Description

10 p.

Creation Information

Frachtenberg, E. (Eitan); Petrini, F. (Fabrizio); Fernandez Peinador, J. (Juan) & Coll, S. (Salvador) January 1, 2002.

Context

This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this article can be viewed below.

Who

People and organizations associated with either the creation of this article or its content.

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this article. Follow the links below to find similar items on the Digital Library.

Description

Clusters of workstations have emerged as an important platform for building cost-effective, scalable and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead and the flexibility necessary to efficiently support and analyze a wide range of job scheduling algorithms. STORM achieves these feats by closely integrating the management daemons with the low-level features that are common in state-of-the-art high-performance system area networks. The architecture of STORM is based on three main technical innovations. First, a sizable part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time, irrespective of job and machine sizes. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12MB on a 64 processor/32 node cluster in less than 0.25 sec on an empty network, in less than 0.45 sec when all the processors are busy computing other jobs, and in less than 0.65 sec when the network is flooded with a background traffic. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM is at least two orders of magnitude faster than existing production schedulers in launching jobs, performing resource management tasks and gang scheduling.

Physical Description

10 p.

Source

  • Submitted to: International Supercomputing Conference, New York, June, 2002

Language

Item Type

Identifier

Unique identifying numbers for this article in the Digital Library or other systems.

  • Report No.: LA-UR-02-1672
  • Grant Number: none
  • Office of Scientific & Technical Information Report Number: 976130
  • Archival Resource Key: ark:/67531/metadc930699

Collections

This article is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

What responsibilities do I have when using this article?

When

Dates and time periods associated with this article.

Creation Date

  • January 1, 2002

Added to The UNT Digital Library

  • Nov. 13, 2016, 7:26 p.m.

Description Last Updated

  • Dec. 12, 2016, 6:16 p.m.

Usage Statistics

When was this article last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 2

Interact With This Article

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Frachtenberg, E. (Eitan); Petrini, F. (Fabrizio); Fernandez Peinador, J. (Juan) & Coll, S. (Salvador). Scalable resource management in high performance computers., article, January 1, 2002; United States. (digital.library.unt.edu/ark:/67531/metadc930699/: accessed August 16, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.