Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations

Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; Van der Wijngaart, Rob

You Are Here:
University Libraries
UNT Digital Library
UNT Libraries Government Documents Department
This Article
Page: 3

Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations Page: 3 of 20

vp.

This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.

View a full description of this article.

Previous search

Adjust Image
Rotate Left
Rotate Right
Brightness, Contrast, etc. (Experimental)
Cropping Tool
Download Sizes
Preview all sizes/dimensions or...
Download Thumbnail
Download Small
Download Medium
Download Large
High Resolution Files
IIIF Image JSON
IIIF Image URL
Accessibility
View Extracted Text

zoom Next

These controls are experimental and have not yet been optimized for user experience.

brightness

Reset Brightness 0

contrast

Reset Contrast 0

saturation

Reset Saturation 0

sharpen

Reset Sharpness 0

exposure

Reset Exposure 0

hue

Reset Hue 0

gamma

Reset Gama 0

Applying filters

Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations

[Sequence #]: 3 of 20

Previous item Next item

Extracted Text

The following text was automatically extracted from the image on this page using optical character recognition software:

more than 200 instructions. Advanced branch prediction hardware minimizes the effects of the relatively
long pipeline (six cycles) necessitated by the high frequency design.
Each processor contains its own private L1 cache (64KB instruction and 32KB data) with prefetch
hardware; however, both cores share a 1.5MB unified L2 cache. Certain data access patterns may therefore
cause L2 cache conflicts between the two processing units. The directory for the L3 cache is located on-chip,
but the memory itself resides off-chip. The L3 is designed as a stand-alone 32MB cache, or to be combined
with other L3s on the same MCM to create a larger interleaved cache of up to 128MB. Multi-node Power4
configurations are currently available employing IBM's Colony interconnect, but future large-scale systems
will use the lower latency Federation switch.
The Power4 experiments reported here were performed on a single node of the 27-node IBM pSeries
690 system (named Cheetah) running AIX 5.1 and operated by Oak Ridge National Laboratory.
2.3 SX-6
The NEC SX-6 vector processor uses a dramatically different architectural approach than conventional
cache-based systems. Vectorization exploits regularities in the computational structure to expedite uniform
operations on independent data sets. Vector arithmetic instructions involve identical operations on the ele-
ments of vector operands located in the vector register. Many scientific codes allow vectorization, since they
are characterized by predictable fine-grain data-parallelism that can be exploited with properly structured
program semantics and sophisticated compilers. The 500 MHz SX-6 processor contains an 8-way replicated
vector pipe capable of issuing a MADD each cycle, for a peak performance of 8 Gflops/s per CPU. The
processors contain 72 vector registers, each holding 256 64-bit words.
For non-vectorizable instructions, the SX-6 contains a 500 MHz scalar processor with a 64KB instruc-
tion cache, a 64KB data cache, and 128 general-purpose registers. The 4-way superscalar unit has a peak
of 1 Gflops/s and supports branch prediction, data prefetching, and out-of-order execution. Since the vector
unit of the SX-6 is significantly more powerful than its scalar processor, it is critical to achieve high vector
operation ratios, either via compiler discovery or explicitly through code (re-)organization.
Unlike conventional architectures, the SX-6 vector unit lacks data caches. Instead of relying on data lo-
cality to reduce memory overhead, memory latencies are masked by overlapping pipelined vector operations
with memory fetches. The SX-6 uses high speed SDRAM with peak bandwidth of 32GB/s per CPU: enough
to feed one operand per cycle to each of the replicated pipe sets. Each SMP contains eight processors that
share the node's memory. The nodes can be used as building blocks of large-scale multi-processor systems;
for instance, the Earth Simulator contains 640 SX-6 nodes, connected through a single-stage crossbar.
The vector results in this paper were obtained on the single-node (8-way) SX-6 system (named Rime)
running SUPER-UX at the Arctic Region Supercomputing Center (ARSC) of the University of Alaska.
3 Microbenchmarks
This section presents the performance of a microbenchmark suite that measures some low-level machine
characteristics such as memory subsystem behavior and scatter/gather hardware support using STREAM [7];
and point-to-point communication, network/memory contention, and barrier synchronizations via PMB [5].
3.1 Memory Access Performance
First we examine the low-level memory characteristics of the three architectures in our study. Table 2
presents asymptotic unit-stride memory bandwidth behavior of the triad summation: a(i) = b(i) + s x c(i),
using the STREAM benchmark [7]. It effectively captures the peak bandwidth of the architectures, and
shows that the SX-6 achieves about 48 and 14 times the performance of the Power3 and Power4, respectively,

Upcoming Pages

Here’s what’s next.

4 of 20

5 of 20

6 of 20

7 of 20

Show all pages in this article.

Search Inside

This article can be searched. Note: Results may vary based on the legibility of text within the document.

or search this site for other articles

Tools / Downloads

Get a copy of this page or view the extracted text.

Preview all sizes/dimensions or...

Download Thumbnail
Download Small
Download Medium
Download Large
IIIF Image JSON
IIIF Image

View Extracted (OCR) Text

Citing and Sharing

Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.

Reference the current page of this Article.

Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane et al. Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations, article, May 1, 2003; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc784888/m1/3/: accessed April 19, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.

Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations Page: 3 of 20

Upcoming Pages

Search Inside

Tools / Downloads

Citing and Sharing

Reference the current page of this Article.

Print / Share This Page

Permanent URL (This Page)

Univesal Viewer

International Image Interoperability Framework (This Page)