0% found this document useful (0 votes)
43 views37 pages

08 Performance Overview

This document summarizes the results of microbenchmarks that analyze key performance characteristics of GPUs, including memory latency and bandwidth, instruction throughput, and branching behavior. The benchmarks show that different GPU architectures have varying strengths, such as the ATI X1900XTX hiding memory latency better than NVIDIA chips, and the NVIDIA 8800GTX doubling the bandwidth of the 7900GTX. Branching performance depends heavily on coherence, with the NVIDIA chips requiring over 16x16 pixel coherence to avoid overhead. These benchmarks provide insight into GPU behaviors that can help optimize algorithms for different architectures.

Uploaded by

proxymo1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views37 pages

08 Performance Overview

This document summarizes the results of microbenchmarks that analyze key performance characteristics of GPUs, including memory latency and bandwidth, instruction throughput, and branching behavior. The benchmarks show that different GPU architectures have varying strengths, such as the ATI X1900XTX hiding memory latency better than NVIDIA chips, and the NVIDIA 8800GTX doubling the bandwidth of the 7900GTX. Branching performance depends heavily on coherence, with the NVIDIA chips requiring over 16x16 pixel coherence to avoid overhead. These benchmarks provide insight into GPU behaviors that can help optimize algorithms for different architectures.

Uploaded by

proxymo1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Understanding GPUs Through Benchmarking

Mike Houston Stanford University

Introduction
Key areas for GPGPU
Memory latency behavior Memory bandwidths Upload/Download Instruction rates Branching performance ATI X1900XTX (R580) NVIDIA 7900GTX (G71) NVIDIA 8800GTX (G80) AMD HD2900 (R600)

Chips analyzed

GPUBench
An open-source suite of micro-benchmarks
GL DX9 (alpha version)

Developed at Stanford to aid our understanding of GPUs


Vendors wouldnt directly tell us arch details Behavior under GPGPU apps different than games and other benchmarks

Library of results
http://graphics.stanford.edu/projects/gpubench/
3

Memory latency
Questions
Can latency be hidden? Does access pattern affect latency?

Methodology
Try different numbers of texture fetches
Different access patterns:
Cache hit every fetch to the same texel Sequential every fetch increments address by 1 Random dependent lookup with random texture

Increase the ALU ops of the shader ALU ops must be dependent to avoid optimization GPUBench test: fetchcost
5

Fetch cost ATI cache hit


X1900XTX has 3X the ALUs per pipe

4 ALU ops

12 ALU ops ATI X1900XTX

ATI X1800XT

Cost = max(ALU, TEX)


6

Fetch cost ATI cache hit

12 ALU ops ATI X1900XTX

5 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)


7

Fetch cost ATI sequential


X1900XTX has 3X the ALUs per pipe

8 ALU ops

24 ALU ops ATI X1900XTX

ATI X1800XT

Cost = max(ALU, TEX)


8

Fetch cost ATI sequential

24 ALU ops ATI X1900XTX

20 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)


9

Fetch cost NVIDIA cache hit

4 ALU op penalty

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX


10

Fetch cost NVIDIA sequential

8 ALU op issue penalty

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX


11

Fetch cost NVIDIA 8800 GTX


Cache sequential

8 ALU ops 4 ALU ops

NVIDIA 8800 GTX NVIDIA 8800 GTX Cost = max(ALU, TEX)


12

Bandwidth to ALUs
Questions
Cache performance? Sequential performance? Random-read performance?

13

Methodology
Cache hit
Use a constant as index to texture(s)

Sequential
Use fragment position to index texture(s)

Random
Index a seeded texture with fragment position to look up into input texture(s)

GPUBench test: inputfloatbandwidth


14

Results
Better random bandwidth
Better effective cache bandwidth

ATI X1900XTX

NVIDIA 7900GTX

Sequential bandwidth (SEQ) about the same


15

Results

NVIDIA 7900GTX

NVIDIA 8800GTX 2X bandwidth of 7900GTX


16

Results

NVIDIA 8800GTX

AMD HD2900XT

17

Off-board bandwidth
Questions
How fast can we get data on the board (download)? How fast can we get data off the board (readback)?

GPUBench tests:
download readback

18

Download
Host to GPU is slow

ATI X1900XTX

NVIDIA 7900GTX

19

Download
Next generation not much better

NVIDIA 7900GTX

NVIDIA 8800GTX

20

Readback
GPU to host is slow

ATI GL Readback performance is abysmal

ATI X1900XTX

NVIDIA 7900GTX

21

Readback
Next generation not much better

NVIDIA 7900GTX

NVIDIA 8800GTX

22

Instruction Issue Rate


Questions
What is the raw performance achievable? Do different instructions have different costs? Vector vs. scalar issue differences?

23

Methodology
Write long shaders with dependent instructions
>100 instructions All instructions dependent
But try to structure to allow for multi-issue

Test float1 vs. float4 performance GPUBench tests:


instrissue

24

Results Vector issue

ATI X1900XTX

NVIDIA 7900GTX

= More costly than others


25

Results Vector issue


Faster ADD/SUB Peak (single instruction) FLOPS with MAD

ATI X1900XTX

NVIDIA 7900GTX

26

Results Vector issue

NVIDIA 7900GTX 8800GTX is 37% faster (peak)

NVIDIA 8800GTX

27

Results Vector issue

AMD HD2900XT

NVIDIA 8800GTX

28

When benchmarks go wrong


Smart compilers subverting testing and optimizing away shaders. Bug found in previous subtract test. No clever way to write RCP test found yet Always sanity check results against theoretical peak!!!

NVIDIA 7800GTX GPUBench 1.2


29

Results Scalar issue

NVIDIA 7900GTX

NVIDIA 8800GTX

8800GTX is a scalar issue processor

30

Branching Performance
Questions
Is predication better than branching? Is using Early-Z culling a better option? What is the cost of branching? What branching granularity is required? How much can I really save branching around heavy computation?

31

Methodology
Early-Z
Set a Z-buffer and compare function to mask out compute Change coherence of blocks Change sizes of blocks Set differing amounts of pixels to be drawn If{ do a little }; else { LOTS of math} Change coherence of blocks Change sizes of blocks Have differing amounts of pixels execute heavy math branch

Shader Branching

GPUBench tests:
branching
32

Results Early-Z - NVIDIA


Random is bad!

4x4 coherence is almost perfect!

NVIDIA 7900GTX

33

Results Branching - NVIDIA


Fully coherent has good performance

But overhead

NVIDIA 7900GTX

34

Results Branching - NVIDIA


Performance increases with branch coherence

NVIDIA 7900GTX Need > 32x32 branch coherence

35

Results Branching - NVIDIA


Performance increases with branch coherence

NVIDIA 8800GTX Need > 16x16 branch coherence (Turns out 16x4 is as good as 16x16 )
36

Summary
Benchmarks can help discern app behavior and architecture characteristics We use these benchmarks as predictive models when designing algorithms
Folding@Home ClawHMMer CFD

Be wary of driver optimizations


Driver revisions change behavior
Raster order, scheduler, compiler
37

You might also like