PERI - Auto-tuning Memory Intensive Kernels for Multicore Page: 1 of 15
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
PERI - Auto-tuning Memory Intensive Kernels for Multicore
Samuel Williams*t, Kaushik Dattat, Jonathan Carter*,
Leonid Oliker*t, John Shalf, Katherine Yelick*t , David Bailey*
*CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
tComputer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA
E-mail: SWWilliams@lbl.gov, kdatta@eecs.berkeley.edu, JTCarter@lbl.gov,
LOliker@lbl.gov, JShalf@lbl.gov, KAYelick@lbl.gov, DHBailey@lbl.gov
Abstract.
We present an auto-tuning approach to optimize application performance on emerging multicore architectures.
The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT
libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector
Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann
application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature,
including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM
(STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that
allows us to identify a highly optimized version for each platform, while amortizing the human programming effort.
Results show that our auto-tuned kernel applications often achieve a better than 4 x improvement compared with
the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware
bottlenecks and software challenges for future multicore systems and applications.
1. Introduction
The computing revolution towards massive on-chip parallelism is moving forward with relatively little concrete
evidence on how to best to use these technologies for real applications [1]. For the foreseeable future, high-
performance computing (HPC) machines will almost certainly contain multicore chips, likely tied together into
(multi-socket) shared memory nodes as the machine building block. As a result, application scientists must
fully harness intra-node performance in order to effectively leverage the enormous computational potential of
emerging multicore-based supercomputers. Thus, understanding the most efficient design and utilization of
these systems, in the context of demanding numerical simulations, is of utmost priority to the HPC community.
In this paper, we present an application-centric approach for producing highly optimized multicore
implementations through a study of Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation
PDE on a regular grid (Stencil), and a Lattice Boltzmann application (LBMHD). Our work uses a novel
approach to implementing the computations across one of the broadest sets of multicore platforms in existing
HPC literature, including the conventional multicore designs of the dual-socket x quad-core Intel Xeon E5345
(Clovertown) and AMD Opteron 2356 (Barcelona), as well as the hardware multithreaded dual-socket x octal-
core Niagara2 - Sun T2+ T5140 (Victoria Falls). In addition, we include the heterogeneous local-store based
architecture of the dual-socket x eight-core Sony-Toshiba-IBM (STI) Cell QS20 Blade.
Our work explores a number of important optimization strategies, which we analyze to identify the
microarchitecture bottlenecks in each system; this leads to several insights into how to build effective multicore
applications, compilers, tools and hardware. In particular, we discover that, although the original code versions
Upcoming Pages
Here’s what’s next.
Search Inside
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Bailey, David H.; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John et al. PERI - Auto-tuning Memory Intensive Kernels for Multicore, article, June 24, 2008; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc898896/m1/1/: accessed April 25, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.