PERI - Auto-tuning Memory Intensive Kernels for Multicore

Bailey, David H.; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine; Bailey, David H.

You Are Here:
University Libraries
UNT Digital Library
UNT Libraries Government Documents Department
This Article
Page: 1

PERI - Auto-tuning Memory Intensive Kernels for Multicore Page: 1 of 15

This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.

View a full description of this article.

Previous search

Adjust Image
Rotate Left
Rotate Right
Brightness, Contrast, etc. (Experimental)
Cropping Tool
Download Sizes
Preview all sizes/dimensions or...
Download Thumbnail
Download Small
Download Medium
Download Large
High Resolution Files
IIIF Image JSON
IIIF Image URL
Accessibility
View Extracted Text

zoom Next

These controls are experimental and have not yet been optimized for user experience.

brightness

Reset Brightness 0

contrast

Reset Contrast 0

saturation

Reset Saturation 0

sharpen

Reset Sharpness 0

exposure

Reset Exposure 0

hue

Reset Hue 0

gamma

Reset Gama 0

Applying filters

PERI - Auto-tuning Memory Intensive Kernels for Multicore

[Sequence #]: 1 of 15

Next item

Extracted Text

The following text was automatically extracted from the image on this page using optical character recognition software:

PERI - Auto-tuning Memory Intensive Kernels for Multicore
Samuel Williams*t, Kaushik Dattat, Jonathan Carter*,
Leonid Oliker*t, John Shalf, Katherine Yelick*t , David Bailey*
*CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
tComputer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA
E-mail: SWWilliams@lbl.gov, kdatta@eecs.berkeley.edu, JTCarter@lbl.gov,
LOliker@lbl.gov, JShalf@lbl.gov, KAYelick@lbl.gov, DHBailey@lbl.gov
Abstract.
We present an auto-tuning approach to optimize application performance on emerging multicore architectures.
The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT
libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector
Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann
application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature,
including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM
(STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that
allows us to identify a highly optimized version for each platform, while amortizing the human programming effort.
Results show that our auto-tuned kernel applications often achieve a better than 4 x improvement compared with
the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware
bottlenecks and software challenges for future multicore systems and applications.
1. Introduction
The computing revolution towards massive on-chip parallelism is moving forward with relatively little concrete
evidence on how to best to use these technologies for real applications [1]. For the foreseeable future, high-
performance computing (HPC) machines will almost certainly contain multicore chips, likely tied together into
(multi-socket) shared memory nodes as the machine building block. As a result, application scientists must
fully harness intra-node performance in order to effectively leverage the enormous computational potential of
emerging multicore-based supercomputers. Thus, understanding the most efficient design and utilization of
these systems, in the context of demanding numerical simulations, is of utmost priority to the HPC community.
In this paper, we present an application-centric approach for producing highly optimized multicore
implementations through a study of Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation
PDE on a regular grid (Stencil), and a Lattice Boltzmann application (LBMHD). Our work uses a novel
approach to implementing the computations across one of the broadest sets of multicore platforms in existing
HPC literature, including the conventional multicore designs of the dual-socket x quad-core Intel Xeon E5345
(Clovertown) and AMD Opteron 2356 (Barcelona), as well as the hardware multithreaded dual-socket x octal-
core Niagara2 - Sun T2+ T5140 (Victoria Falls). In addition, we include the heterogeneous local-store based
architecture of the dual-socket x eight-core Sony-Toshiba-IBM (STI) Cell QS20 Blade.
Our work explores a number of important optimization strategies, which we analyze to identify the
microarchitecture bottlenecks in each system; this leads to several insights into how to build effective multicore
applications, compilers, tools and hardware. In particular, we discover that, although the original code versions

Upcoming Pages

Here’s what’s next.

2 of 15

3 of 15

4 of 15

5 of 15

Show all pages in this article.

Search Inside

This article can be searched. Note: Results may vary based on the legibility of text within the document.

or search this site for other articles

Tools / Downloads

Get a copy of this page or view the extracted text.

Preview all sizes/dimensions or...

Download Thumbnail
Download Small
Download Medium
Download Large
IIIF Image JSON
IIIF Image

View Extracted (OCR) Text

Citing and Sharing

Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.

Reference the current page of this Article.

Bailey, David H.; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John et al. PERI - Auto-tuning Memory Intensive Kernels for Multicore, article, June 24, 2008; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc898896/m1/1/: accessed April 25, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.

PERI - Auto-tuning Memory Intensive Kernels for Multicore Page: 1 of 15

Upcoming Pages

Search Inside

Tools / Downloads

Citing and Sharing

Reference the current page of this Article.

Print / Share This Page

Permanent URL (This Page)

Univesal Viewer

International Image Interoperability Framework (This Page)