Optimizing performance of superscalar codes for a single Cray X1MSP processor

Shan, Hongzhang; Strohmaier, Erich; Oliker, Leonid

You Are Here:
University Libraries
UNT Digital Library
UNT Libraries Government Documents Department
This Article
Page: 4

Optimizing performance of superscalar codes for a single Cray X1MSP processor Page: 4 of 11

This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.

View a full description of this article.

Previous search

Adjust Image
Rotate Left
Rotate Right
Brightness, Contrast, etc. (Experimental)
Cropping Tool
Download Sizes
Preview all sizes/dimensions or...
Download Thumbnail
Download Small
Download Medium
Download Large
High Resolution Files
IIIF Image JSON
IIIF Image URL
Accessibility
View Extracted Text

zoom Next

These controls are experimental and have not yet been optimized for user experience.

brightness

Reset Brightness 0

contrast

Reset Contrast 0

saturation

Reset Saturation 0

sharpen

Reset Sharpness 0

exposure

Reset Exposure 0

hue

Reset Hue 0

gamma

Reset Gama 0

Applying filters

Optimizing performance of superscalar codes for a single Cray X1MSP processor

[Sequence #]: 4 of 11

Previous item Next item

Extracted Text

The following text was automatically extracted from the image on this page using optical character recognition software:

10000
ai1000
0 Original
- 100 U Directed
C 10 - Optimized
1
NAS-CG Ocean NAS- 1-D FFT Radix MM Nbody
FFT
Fig. 1: The Performance of the Original, Directed, and Optimized Versions. The Y-
axis is in Log-scale.
4.1 Effect of Using Compiler Directives
The compiler needs to make conservative assumptions about possible data dependencies to avoid
potential race conditions. Compiler directives are therefore often necessary to allow for the effective
vectorization and multistreaming of data independent regions. Figure 1 shows that compiler directives had
almost no effect on the performance of CG and Radix. For CG, the main time-consuming component is the
loops to compute the sparse-matrix vector multiplication. Here, the compiler was able to identify the data
independence between iterations and vectorize and multistream the loops automatically. Therefore, the
original version and the directed version deliver similar performance. The compiler directives have no
effect on performance of radix either. Unlike CG, data dependencies within the loop iterations are
preventing vectorization and multistreaming; causing the code to run on the scalar units. In other cases, the
compiler directives can substantially improve the performance. For ocean, 1-D FFT, NAS FFT, and MM,
the important loops can be both vectorized and multi-streamed. However, for Nbody, the loop can be
multistreamed but not vectorized due to its code irregularity and complexity.
4.2 Application Restructuring and Performance Optimization
Adding compiler directives can exploit the data parallelism within the loops. However, in order to
exploit the data parallelism across the loops or functions, the programs have to be restructured to generate
more efficient execution codes. The average vector lengths of the directed version for NAS CG, ocean,
NAS FFT, 1-D FFT, Radix, MM, and Nbody are 46.38, 63.15, 16.70, 9.8, 1, 3.71, and 64 respectively. For
CG, further increase the average vector length without increasing data set sizes is difficult. The indirect
memory access limits its performance. However, its performance can be improved about 10% if the inner
loop is unrolled eight times explicitly (the optimized version). For Ocean, the average vector length has
almost reached vector register length of 64 and no further optimization was necessary. We focus on
increasing the vector length of other applications. For MM, we found that a naive implementation using the
stride access that will be intentionally avoided on superscalar processors delivers the best results on Cray
X1.
NAS FFT There are two important implementation parameters in the NAS FFT, fftblock and
transblock. The first parameter controls how many ffts are done at a time. The second parameter is the
blocking factor for the transpose. The default values are 16 and 32 respectively, which are appropriate for
most superscalar machines to maximize cache reuse. As suggested in the code, on the vector machines, the
block size should be as large as possible, i.e. 256 for class B. The result using longer vectors is shown in
Table 1 labeled as Vec-full. It only takes half of the time needed by the directed version. In the NPB 2.4
version, in order to reduce the amount of the memory required by the program, the time evolution array is
no longer stored for all time steps but just for the first. With this efficient memory usage, the performance

Upcoming Pages

Here’s what’s next.

5 of 11

6 of 11

7 of 11

8 of 11

Show all pages in this article.

Search Inside

This article can be searched. Note: Results may vary based on the legibility of text within the document.

or search this site for other articles

Tools / Downloads

Get a copy of this page or view the extracted text.

Preview all sizes/dimensions or...

Download Thumbnail
Download Small
Download Medium
Download Large
IIIF Image JSON
IIIF Image

View Extracted (OCR) Text

Citing and Sharing

Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.

Reference the current page of this Article.

Shan, Hongzhang; Strohmaier, Erich & Oliker, Leonid. Optimizing performance of superscalar codes for a single Cray X1MSP processor, article, June 8, 2004; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc789037/m1/4/: accessed April 24, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.

Optimizing performance of superscalar codes for a single Cray X1MSP processor Page: 4 of 11

Upcoming Pages

Search Inside

Tools / Downloads

Citing and Sharing

Reference the current page of this Article.

Print / Share This Page

Permanent URL (This Page)

Univesal Viewer

International Image Interoperability Framework (This Page)