Understanding GPUs Through Benchmarking
Mike Houston Stanford University
Introduction
Key areas for GPGPU
Memory latency behavior Memory bandwidths Upload/Download Instruction rates Branching performance ATI X1900XTX (R580) NVIDIA 7900GTX (G71) NVIDIA 8800GTX (G80) AMD HD2900 (R600)
Chips analyzed
GPUBench
An open-source suite of micro-benchmarks
GL DX9 (alpha version)
Developed at Stanford to aid our understanding of GPUs
Vendors wouldnt directly tell us arch details Behavior under GPGPU apps different than games and other benchmarks
Library of results
http://graphics.stanford.edu/projects/gpubench/
3
Memory latency
Questions
Can latency be hidden? Does access pattern affect latency?
Methodology
Try different numbers of texture fetches
Different access patterns:
Cache hit every fetch to the same texel Sequential every fetch increments address by 1 Random dependent lookup with random texture
Increase the ALU ops of the shader ALU ops must be dependent to avoid optimization GPUBench test: fetchcost
5
Fetch cost ATI cache hit
X1900XTX has 3X the ALUs per pipe
4 ALU ops
12 ALU ops ATI X1900XTX
ATI X1800XT
Cost = max(ALU, TEX)
6
Fetch cost ATI cache hit
12 ALU ops ATI X1900XTX
5 ALU ops AMD HD2900XT
Cost = max(ALU, TEX)
7
Fetch cost ATI sequential
X1900XTX has 3X the ALUs per pipe
8 ALU ops
24 ALU ops ATI X1900XTX
ATI X1800XT
Cost = max(ALU, TEX)
8
Fetch cost ATI sequential
24 ALU ops ATI X1900XTX
20 ALU ops AMD HD2900XT
Cost = max(ALU, TEX)
9
Fetch cost NVIDIA cache hit
4 ALU op penalty
Cost = sum(ALU, TEX)
NVIDIA 7900 GTX
10
Fetch cost NVIDIA sequential
8 ALU op issue penalty
Cost = sum(ALU, TEX)
NVIDIA 7900 GTX
11
Fetch cost NVIDIA 8800 GTX
Cache sequential
8 ALU ops 4 ALU ops
NVIDIA 8800 GTX NVIDIA 8800 GTX Cost = max(ALU, TEX)
12
Bandwidth to ALUs
Questions
Cache performance? Sequential performance? Random-read performance?
13
Methodology
Cache hit
Use a constant as index to texture(s)
Sequential
Use fragment position to index texture(s)
Random
Index a seeded texture with fragment position to look up into input texture(s)
GPUBench test: inputfloatbandwidth
14
Results
Better random bandwidth
Better effective cache bandwidth
ATI X1900XTX
NVIDIA 7900GTX
Sequential bandwidth (SEQ) about the same
15
Results
NVIDIA 7900GTX
NVIDIA 8800GTX 2X bandwidth of 7900GTX
16
Results
NVIDIA 8800GTX
AMD HD2900XT
17
Off-board bandwidth
Questions
How fast can we get data on the board (download)? How fast can we get data off the board (readback)?
GPUBench tests:
download readback
18
Download
Host to GPU is slow
ATI X1900XTX
NVIDIA 7900GTX
19
Download
Next generation not much better
NVIDIA 7900GTX
NVIDIA 8800GTX
20
Readback
GPU to host is slow
ATI GL Readback performance is abysmal
ATI X1900XTX
NVIDIA 7900GTX
21
Readback
Next generation not much better
NVIDIA 7900GTX
NVIDIA 8800GTX
22
Instruction Issue Rate
Questions
What is the raw performance achievable? Do different instructions have different costs? Vector vs. scalar issue differences?
23
Methodology
Write long shaders with dependent instructions
>100 instructions All instructions dependent
But try to structure to allow for multi-issue
Test float1 vs. float4 performance GPUBench tests:
instrissue
24
Results Vector issue
ATI X1900XTX
NVIDIA 7900GTX
= More costly than others
25
Results Vector issue
Faster ADD/SUB Peak (single instruction) FLOPS with MAD
ATI X1900XTX
NVIDIA 7900GTX
26
Results Vector issue
NVIDIA 7900GTX 8800GTX is 37% faster (peak)
NVIDIA 8800GTX
27
Results Vector issue
AMD HD2900XT
NVIDIA 8800GTX
28
When benchmarks go wrong
Smart compilers subverting testing and optimizing away shaders. Bug found in previous subtract test. No clever way to write RCP test found yet Always sanity check results against theoretical peak!!!
NVIDIA 7800GTX GPUBench 1.2
29
Results Scalar issue
NVIDIA 7900GTX
NVIDIA 8800GTX
8800GTX is a scalar issue processor
30
Branching Performance
Questions
Is predication better than branching? Is using Early-Z culling a better option? What is the cost of branching? What branching granularity is required? How much can I really save branching around heavy computation?
31
Methodology
Early-Z
Set a Z-buffer and compare function to mask out compute Change coherence of blocks Change sizes of blocks Set differing amounts of pixels to be drawn If{ do a little }; else { LOTS of math} Change coherence of blocks Change sizes of blocks Have differing amounts of pixels execute heavy math branch
Shader Branching
GPUBench tests:
branching
32
Results Early-Z - NVIDIA
Random is bad!
4x4 coherence is almost perfect!
NVIDIA 7900GTX
33
Results Branching - NVIDIA
Fully coherent has good performance
But overhead
NVIDIA 7900GTX
34
Results Branching - NVIDIA
Performance increases with branch coherence
NVIDIA 7900GTX Need > 32x32 branch coherence
35
Results Branching - NVIDIA
Performance increases with branch coherence
NVIDIA 8800GTX Need > 16x16 branch coherence (Turns out 16x4 is as good as 16x16 )
36
Summary
Benchmarks can help discern app behavior and architecture characteristics We use these benchmarks as predictive models when designing algorithms
Folding@Home ClawHMMer CFD
Be wary of driver optimizations
Driver revisions change behavior
Raster order, scheduler, compiler
37