Benchmarking and tuning the MILC code on clusters and supercomputers Page: 1 of 4
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to Digital Library by the UNT Libraries Government Documents Department.
The following text was automatically extracted from the image on this page using optical character recognition software:
FERMILAB-Conf-01/391-T December 2001
Benchmarking and tuning the MILC code on clusters and supercomputers
Steven Gottlieba b *
aDepartment of Physics-SW117; Indiana University; Bloomington, IN 47405; USA
bTheory Group MS106; Fermilab; P.O. Box 500; Batavia, IL 60510-0500; USA
Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium
and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for
many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the
KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
This contribution is a condensation of a 16 page
poster with 17 tables of benchmarks. The poster
is available on the web .
Benchmarks presented here are for the Con-
jugate Gradient algorithm with Kogut-Susskind
quarks, not just for $. They are done within the
context of a complete application for creation of
gauge fields using the R-algorithm . The ap-
plication uses even-odd checkerboarding, which
reduces possible reuse of data in cache. Even
the single CPU benchmarks are done with a fully
parallel application that splits the computation
within $ into two stages to accommodate the
need to wait for boundary values that would come
from another node in a multiCPU run. This also
reduces potential cache reusage. On some of the
architectures, we make use of assembly code for
basic SU(3) arithmetic routines or for prefetching
data to cache. We use Kogut-Susskind quarks for
benchmarking because they are used in our dy-
namical quark calculations. KS quarks are more
demanding than Wilson quarks in terms of mem-
ory bandwidth. In single precision, the former
require 1.45 bytes/flop of input data and produce
0.36 byte/flop of output. For Wilson quarks only
0.91 bytes/flop of input is required and output is
unchanged. Thus, it should not be surprising to
find that a Wilson quark code can achieve higher
speed than reported here .
*At Fermilab until June 15, 2002.
Since August 2000, MILC has been working
with Intel and NCSA under a non-disclosure
agreement to tune our code for the Itanium pro-
cessor. In December 2000, we were allowed to re-
port first results without assembly code. Some
limited results with assembly code were reported
at Linux World last January. We may now talk
more freely about results on Itanium.
MILC has had several months of production
running on the initial Terascale Computer System
at the Pittsburgh Supercomputer Center. It is
based on Compaq ES40 nodes that contain 667
MHz EV67 Alpha chips. The full 6 TF computer
will be based on 1000 MHz EV68 chips. At the
end of March, we were given access to the first
ES45 node at PSC that contains that chip.
IBM SP tests have been run on either the In-
diana University SP or Blue Horizon at SDSC.
They have 375 MHz Power 3 chips deployed on
4-way and 8-way SMP nodes, respectively.
During the Spring, we had access to a 1.5 GHz
Pentium IV system and a dual 1.2 GHz Athlon
system, thanks to NCSA and Penguin Comput-
3. CODE CHANGES
The work on the Itanium processor was carried
out in conjunction with two Intel engineers, Gau-
tham Doshi and Brian Nickerson. Doshi worked
on in-lining and optimizing compiler flags for the
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Gottlieb, Steven A. Benchmarking and tuning the MILC code on clusters and supercomputers, article, December 28, 2001; Batavia, Illinois. (digital.library.unt.edu/ark:/67531/metadc715120/m1/1/: accessed April 23, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.