







# Customer Centric 64-bit Computing Opteron vs Itanium



# **Customer Centric 64-bit Computing** Opteron vs Itanium



**Progressive 64-bit approach:** 32-bit instruction + prefix byte

- leverages x86 compiler technology reliable compilers, port easily
- code size increase is minimal (~5%) large caches not required

#### □ x86 CPUs = RISC cores + CISC→RISC instruction decoders

- provides x86 processors <u>high clock frequency</u> and <u>legacy compatibility</u>
- processor not compiler manages RISC core recompile rarely
- Itanium is a slave to the compiler recompile often

#### out-of-order execution and register renaming

- Opteron manages it's registers intelligently less compiler reliant
- Itanium requires the compiler to think for it strong compiler reliance

#### Both Opteron and Itanium are RISC, but Opteron doesn't require reinventing compilers, large caches & a mint to purchace

| October 4, 2004 |  |
|-----------------|--|
|-----------------|--|

Computation Products Group



Opteron: INT and FP Execution Units



# All X86 RISC Cores aren't created = Opteron vs Xeon EMT



#### □ # of int pipes and pipeline depth impact integer throughput

- Opteron has 3 integer pipes +50% reg, reg move thoughput
- Opteron has 3 ALU/AGU units +50% +,-,logical, shift throughput
- # pipeline stages differs shorter instruction execution latency

#### Different Register File Sizes (Opteron 80-bit, Xeon 128-bit)

- size dictates # RISC ops in an x86 instruction instruction preference
- dictates # bits written from FPU pipes limits scalar SIMD throughput

#### Design of FPU and issue bandwidth from FP scheduler

- Opteron: ADD/MUL/ST pipes eat and write 240 bits per clock
- Xeon: ADD/MUL pipes eat and write 128 bits per clock

# Though Xeon64 and Opteron are instruction compatible, Xeon64 delivers <sup>1</sup>/<sub>2</sub> the throughput per clock on SIMD scalar code

October 4, 2004

Computation Products Group

# AMD Opteron<sup>TM</sup>, Pentium<sup>®</sup>4 (FPU analysis) Throughput of SSE, SSE2, x87 Operations



cycles / cycle

10

9

| Operation         | SSE Scalar   | SSE vector | SSE2 scalar  | SSE2 vector | X87       |
|-------------------|--------------|------------|--------------|-------------|-----------|
| Add               | 1 / cycle    | 2 / cycle  | 1 / cycle    | 1 / cycle   | 1 / cycle |
| Multiply          | 1 / cycle    | 2 / cycle  | 1 / cycle    | 1 / cycle   | 1 / cycle |
| Add &<br>Multiply | 2 / cycle    | 4 / cycle  | 2 / cycle    | 2 / cycle   | 2 / cycle |
|                   |              |            |              |             |           |
| Operation         | SSE Scalar   | SSE vector | SSE2 scalar  | SSE2 vector | X87       |
| Add               | 1 / 2 cycles | 2 / cycle  | 1 / 2 cycles | 1 / cycle   | 1 / cycle |

| Add               | 1 / 2 cycles | 2 / cycle      | 1 / 2 cycles | 1 / cycle | 1 |
|-------------------|--------------|----------------|--------------|-----------|---|
| Multiply          | 1 / 2 cycles | 2 / cycle      | 1 / 2 cycles | 1 / cycle | 2 |
| Add &<br>Multiply | 1 / cycle    | 4 / cycle      | 1 / cycle    | 2 / cycle | 1 |
| October 4, 2004   | Computation  | Products Group |              |           |   |

| Operation              | 32-bit                              | 64-bit              | Operation              | 32-bit                        | 64-b |
|------------------------|-------------------------------------|---------------------|------------------------|-------------------------------|------|
| ADD/SUB                | 3 / cycle                           | 3 / cycle           | ADD/SUB                | 2 / cycle                     | NA   |
| MUL <sub>signed</sub>  | <b>1 / cycle</b><br>4 cycle latency | 1 / 2 cycles        | MUL <sub>signed</sub>  | 1 / cycle<br>18 cycle latency | NA   |
| MUL unsigned           | 1 / cycle<br>4 cycle latency        | 1 / 2 cycles        | MUL unsigned           | 1 / cycle<br>10 cycle latency | NA   |
| MOV <sub>mem,reg</sub> | 2 / cycle                           | 2 / cycle           | MOV <sub>mem,reg</sub> | 2 / cycle                     | NA   |
| MOV <sub>reg,reg</sub> | 3 / cycle                           | 3 / cycle           | MOV <sub>reg,reg</sub> | 2 / cycle                     | NA   |
| XOR/AND/OR             | 3 / cycle                           | 3 / cycle           | XOR/AND/OR             | 2 / cycle                     | NA   |
| Shift/Rotate           | 3 / cycle                           | 3 / cycle           | Shift/Rotate           | 2 / cycle                     | NA   |
| DIV signed             | 42 cycle latency                    |                     | DIV signed             | 80 cycle latency              | NA   |
| DIV unsigned           | 39 cycle latency                    |                     | DIV unsigned           | 80 cycle latency              | NA   |
| LEA                    | 3 / cycle                           | 3 / cycle           | LEA                    | (2–0.5) / cycle               | NA   |
| October 4, 2004        | Computa                             | tion Products Group |                        |                               |      |

# AMD Opteron<sup>TM</sup>, Pentium<sup>®</sup>4 (ALU Analysis) Throughput and Latency Comparison







# Scalable Memory Bandwidth and IO Opteron's on die IO controller



1

#### Hypertransport

- asynchronous coherent communication maintain MP cache coherency
- high rate of communication low impact on MP memory latency

#### Memory Bandwidth

- scales linearly with # of processors in system
- greater % of theoretical peak delivered *low latency memory access*

#### Memory Latency

- memory requests retired rapidly enhances memory bandwidth
- doesn't scale linearly with # CPUS scalable SMP performance

# PGI 5.2 Agenda Compiler Enhancements driven by DYNA



15

## Overview of enhancements in PGI 5.2

- all vector code isn't created equal
- addressing of common block variables
- loop peeling & optimal vector code
- packing scalars into vector format
- shuffling data in loops with GPRs
- excessive prefetching caveats to using sw prefetch
- tuning of the unrolling heuristic less register pressure
- expanded class of vectorizable loops
- F90 pointer addressing support for objects greater than 2 GB

October 4, 2004

Computation Products Group

All Vectorized Code isn't created = Minimizing bubbles in the FPU pipeline consider the following loop: DO i=1,N a(i) = a(i) + b(i) \* [c(i)+d(i)]PGI 5.1.3 PGI 5.2.\* ENDDO A\*\*\*\*\*\* \*\*\*\* movlps (d),%xmm0 movlps (d),%xmm0 movaps (d),%xmm0 movhps 8(d),%xmm0 movhps 8(d),%xmm0 addps (c),%xmm0 use operate movlps (c),%xmml movlps (c),%xmml mulps (b),%xmm1 register from movhps 8(c),%xmml movhps 8(c),%xmml addps (a),%xmml addps %xmm0,%xmm1 renaming memory addps %xmm0,%xmm1 --> movlps (b),%xmm2 — — > movlps (b),%xmm0 on movhps 8(b),%xmm2 movhps 8(b),%xmm0 16-bvte mulps %xmm2,%xmm1 mulps %xmm0,%xmm1 aligned movlps (a),%xmm3 movlps (a),%xmm0 addresses movhps 8(a),%xmm3 movhps 8(a),%xmm0 addps %xmm3,%xmm1 addps %xmm0,%xmm1 ✓ uses 2 registers ✓ uses 4 reaisters ✓ uses 1 reaister ✓ generates 8 bubbles ✓ generates 8 bubbles ✓ generates 2 bubbles October 4, 2004 Computation Products Group 16







October 4, 2004

Computation Products Group



#### uniform relative alignment of pointers in loops

- Can be achieved via use of common blocks
- Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or ••• double precision
- loop peeling to adjust common pointers to 16-byte aligned locations
- performing \*,+,- from memory (requires 16-byte alignment)
- □ PGI 5.2.\* implements peeling of code in CB loops

| Common/test1/vx2(N)<br>Common/test2/q1(N), | , vx3 (N) , vx4 (N) , vy2 (N) , vy3 (N) , vy4 (N) , vz2 (N) , vz3 (N) , vz4<br>a2 (N) , a3 (N) , a4 (N) | (N) |
|--------------------------------------------|---------------------------------------------------------------------------------------------------------|-----|
| Common/test3/ax(N),                        | ay(N), bz(N)                                                                                            |     |
| do i=1,N                                   |                                                                                                         |     |
| ax(i)=g2(i)*v                              | <b>vx2(i)+g3(i)*vx3(i)+g4(i)*vx4(i)</b>                                                                 |     |
| ay(i)=g2(i)*v                              | <b>ry2 (i) +g3 (i) *vy3 (i) +g4 (i) *vy4 (i)</b>                                                        |     |
| bz(i)=g2(i)*v                              | rz2(i)+g3(i)*vz3(i)+g4(i)*vz4(i)                                                                        |     |
| enddo                                      |                                                                                                         |     |
| October 4, 2004                            | Computation Products Group                                                                              | 20  |

# Loop Peeling & Optimal Vector Code Common Block Illustration

#### □ Efficient code vectorization requires:

- uniform relative alignment of pointers in loops
  - Can be achieved via use of common blocks
  - Span of arrays covered in each loop iteration should be a multiple of 4 or 2 in single or double precision
- loop peeling to adjust common pointers to 16-byte aligned locations
- performing \*,+,- from memory (requires 16-byte alignment)

#### □ PGI 5.2.\* implements peeling of code in **CB** loops

| Common/test1/vx2(N), vx3(N), vx4(N), vy2(N), vy3(N), vy4(N), vz2(N), vz3 | (N) , vz4 (N) |
|--------------------------------------------------------------------------|---------------|
| • Check relative alignment of test1,test2 and test3 common block p       | pointers      |
| • If pointers are aligned to 16-byte boundaries - JUMP TO VECTORS:       | SE LOOP       |
| • Scalar SSE loop +6*9 - used to align CB pointers on a 16-byte b        | oundary       |
| • Vector SSE loop +6*9 - used to perform most of computation             |               |
| • Scalar SSE loop +6*9 - final iterations not covered by vector SE       | SE loop       |
| October 4, 2004 Computation Products Group                               | 21            |



□ Some vector loops require performing vector operations of scalar data upon vector quantities:

 $a(i) = a(i) + \frac{b(j,i)}{c(i)} + \frac{d(j,i)}{c(i)} + \frac{d(j,i)}{c(i)}$ 

- PGI 5.1.5 does this via reading 4 floats, storing them to stack and then reading them in a 128-bit load:
  - Create load / store dependencies
  - Excessive # of rops required to perform this function
  - Requires 8 x 32-bit movss loads / stores, 1 movaps read (14 rops)
  - Creates 8 bubbles down FPU pipes
- PGI 5.2.\* does this via interleaving floats
  - 4 x movss reads, 2 x Unpcklps, 1 x movlhps (11 rops)
  - Creates 4 bubbles down FPU pipes

#### Much shorter latency

October 4, 2004

Computation Products Group

22

# Use GPRs to shuffle data Not an absolute statement but almost



- GPRs have the following advantages in loops that "only" shuffle data around:
  - movss decodes to 2 rops, a GPR mov decodes to 1 rop
  - the FPU pipe can only perform 1 32-bit or 64-bit store per cycle while the ALU unit can perform 2 of either
  - pseudo vector copies of floats can be performed using 64-bit GPRs to perform 2 at a time, this utilizes the full throughput of the ALU and load store unit
  - double precision moves should still be more efficient because the ALU unit can perform 2 x 64-bit stores per cycle whereas the FPU can only perform 1 x 64-bit store per cycle
  - caution must be taken into consideration to not generate excessive register pressure
  - ALU throughput can be affected if there are many ALU ops in addition to loads and stores occurring (add, sub, lea, etc.)

October 4, 2004

Computation Products Group

## **Excessive Prefetching** Caveats about software prefetch



23

- Prefetching can preemptively bring data into the cache in advance of it's use, but:
  - Opteron has a very robust HW prefetcher for sequential data accesses
    - HW prefetches move into L2 (12 vs 3 cycle latency compared to L1
    - does not consume execution dispatch bandwidth / sw prefetches do
  - SW prefetches across 4KB page boundaries are dropped and suffer a 90 cycle latency penalty
  - SW prefetch of non-sequential data accesses offers little benefit
    - only 4-8 bytes of every 64 bytes fetched is useful
    - Rate of cache evictions is very high, useful data now has to be fetched from L2
    - MAB units in processor consumed quickly and prevents loads from occurring
  - SW prefetches consume 1 of the 3 execution dispatch slots per clock cycle, thus limiting throughput through the FPU and IPC

| October 4, 2004 | Computation Products Group | 24 |
|-----------------|----------------------------|----|
|                 |                            | ·  |

# **Tuning of Unrolling Heuristic** Less is sometimes more



- Excessive unrolling of some classes of loops increases register pressure:
  - Loops that do not benefit from compiler unrolling:
    - multi-dimensional arrays in which (i, j, ...) *i* isn't the fastest moving index
    - ✤ arrays whose index needs to be loaded to be determined, x( BIN(i) )
    - loops large in size that exceed the # of floating-point registers
  - GPR and FP registers are spilled to memory causing:
    - excess RISC operation counts compared more work required more execution time
    - address generation held up by register load dependencies
    - out of order execution is limited via not being able to load data to process

**PGI 5.2** unrolls less aggressively, allowing out of order execution within the processor to mask latency rather than compiler unrolling

October 4, 2004

Computation Products Group





25

Loops with the following constructs now vectorize in **PGI 5.2** :

- loops containing SIGN or MERGE intrinsics
- large loops containing more than a preset limit of instructions
- Loop-carry Reduction Elimination (LRE) interfered with some loops vectorization
- invariant IF / ELSE transformations that hoist IF / ELSE constructs not dependent upon loop variables outside of loop replicating loop with all cases of IF / ELSE statement
- Loops that operate upon data objects > 2 GB
- Loops in programs compiled with the -i8 switch

October 4, 2004 Computation Products Group 26



# 64-bit LS-DYNA v5434 Neon Model Performance









#### LS-DYNA 3-Car Benchmark Performance









# **Trademark Attribution**



32

AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies.

October 4, 2004

Computation Products Group