0% found this document useful (0 votes)

43 views37 pages

08 Performance Overview

This document summarizes the results of microbenchmarks that analyze key performance characteristics of GPUs, including memory latency and bandwidth, instruction throughput, and branching behavior. The benchmarks show that different GPU architectures have varying strengths, such as the ATI X1900XTX hiding memory latency better than NVIDIA chips, and the NVIDIA 8800GTX doubling the bandwidth of the 7900GTX. Branching performance depends heavily on coherence, with the NVIDIA chips requiring over 16x16 pixel coherence to avoid overhead. These benchmarks provide insight into GPU behaviors that can help optimize algorithms for different architectures.

Uploaded by

proxymo1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views37 pages

08 Performance Overview

Uploaded by

proxymo1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Understanding GPUs Through Benchmarking

Mike Houston Stanford University

Introduction
Key areas for GPGPU
Memory latency behavior Memory bandwidths Upload/Download Instruction rates Branching performance ATI X1900XTX (R580) NVIDIA 7900GTX (G71) NVIDIA 8800GTX (G80) AMD HD2900 (R600)

Chips analyzed

GPUBench
An open-source suite of micro-benchmarks
GL DX9 (alpha version)

Developed at Stanford to aid our understanding of GPUs

Vendors wouldnt directly tell us arch details Behavior under GPGPU apps different than games and other benchmarks

Library of results
http://graphics.stanford.edu/projects/gpubench/
3

Memory latency
Questions
Can latency be hidden? Does access pattern affect latency?

Methodology
Try different numbers of texture fetches
Different access patterns:
Cache hit every fetch to the same texel Sequential every fetch increments address by 1 Random dependent lookup with random texture

Increase the ALU ops of the shader ALU ops must be dependent to avoid optimization GPUBench test: fetchcost
5

Fetch cost ATI cache hit

X1900XTX has 3X the ALUs per pipe

4 ALU ops

12 ALU ops ATI X1900XTX

ATI X1800XT

Cost = max(ALU, TEX)

Fetch cost ATI cache hit

12 ALU ops ATI X1900XTX

5 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)

Fetch cost ATI sequential

X1900XTX has 3X the ALUs per pipe

8 ALU ops

24 ALU ops ATI X1900XTX

ATI X1800XT

Cost = max(ALU, TEX)

Fetch cost ATI sequential

24 ALU ops ATI X1900XTX

20 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)

Fetch cost NVIDIA cache hit

4 ALU op penalty

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX

Fetch cost NVIDIA sequential

8 ALU op issue penalty

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX

Fetch cost NVIDIA 8800 GTX

Cache sequential

8 ALU ops 4 ALU ops

NVIDIA 8800 GTX NVIDIA 8800 GTX Cost = max(ALU, TEX)

Bandwidth to ALUs
Questions
Cache performance? Sequential performance? Random-read performance?

Methodology
Cache hit
Use a constant as index to texture(s)

Sequential
Use fragment position to index texture(s)

Random
Index a seeded texture with fragment position to look up into input texture(s)

GPUBench test: inputfloatbandwidth

Results
Better random bandwidth
Better effective cache bandwidth

ATI X1900XTX

NVIDIA 7900GTX

Sequential bandwidth (SEQ) about the same

Results

NVIDIA 7900GTX

NVIDIA 8800GTX 2X bandwidth of 7900GTX

Results

NVIDIA 8800GTX

AMD HD2900XT

Off-board bandwidth
Questions
How fast can we get data on the board (download)? How fast can we get data off the board (readback)?

GPUBench tests:
download readback

Download
Host to GPU is slow

ATI X1900XTX

NVIDIA 7900GTX

Download
Next generation not much better

NVIDIA 7900GTX

NVIDIA 8800GTX

Readback
GPU to host is slow

ATI GL Readback performance is abysmal

ATI X1900XTX

NVIDIA 7900GTX

Readback
Next generation not much better

NVIDIA 7900GTX

NVIDIA 8800GTX

Instruction Issue Rate

Questions
What is the raw performance achievable? Do different instructions have different costs? Vector vs. scalar issue differences?

Methodology
Write long shaders with dependent instructions
>100 instructions All instructions dependent
But try to structure to allow for multi-issue

Test float1 vs. float4 performance GPUBench tests:

instrissue

Results Vector issue

ATI X1900XTX

NVIDIA 7900GTX

= More costly than others

Results Vector issue

Faster ADD/SUB Peak (single instruction) FLOPS with MAD

ATI X1900XTX

NVIDIA 7900GTX

Results Vector issue

NVIDIA 7900GTX 8800GTX is 37% faster (peak)

NVIDIA 8800GTX

Results Vector issue

AMD HD2900XT

NVIDIA 8800GTX

When benchmarks go wrong

Smart compilers subverting testing and optimizing away shaders. Bug found in previous subtract test. No clever way to write RCP test found yet Always sanity check results against theoretical peak!!!

NVIDIA 7800GTX GPUBench 1.2

Results Scalar issue

NVIDIA 7900GTX

NVIDIA 8800GTX

8800GTX is a scalar issue processor

Branching Performance
Questions
Is predication better than branching? Is using Early-Z culling a better option? What is the cost of branching? What branching granularity is required? How much can I really save branching around heavy computation?

Methodology
Early-Z
Set a Z-buffer and compare function to mask out compute Change coherence of blocks Change sizes of blocks Set differing amounts of pixels to be drawn If{ do a little }; else { LOTS of math} Change coherence of blocks Change sizes of blocks Have differing amounts of pixels execute heavy math branch

Shader Branching

GPUBench tests:
branching
32

Results Early-Z - NVIDIA

Random is bad!

4x4 coherence is almost perfect!

NVIDIA 7900GTX

Results Branching - NVIDIA

Fully coherent has good performance

But overhead

NVIDIA 7900GTX

Results Branching - NVIDIA

Performance increases with branch coherence

NVIDIA 7900GTX Need > 32x32 branch coherence

Results Branching - NVIDIA

Performance increases with branch coherence

NVIDIA 8800GTX Need > 16x16 branch coherence (Turns out 16x4 is as good as 16x16 )
36

Summary
Benchmarks can help discern app behavior and architecture characteristics We use these benchmarks as predictive models when designing algorithms
Folding@Home ClawHMMer CFD

Be wary of driver optimizations

Driver revisions change behavior
Raster order, scheduler, compiler
37

GPU Microarchitecture Insights via Microbenchmarking
No ratings yet
GPU Microarchitecture Insights via Microbenchmarking
12 pages
Parameters To Compare GPUs
No ratings yet
Parameters To Compare GPUs
7 pages
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
No ratings yet
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
24 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
NVIDIA GPU Evolution: Gaming to AI
100% (1)
NVIDIA GPU Evolution: Gaming to AI
91 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Presented by Ragasudha.B Pavitha.P
No ratings yet
Presented by Ragasudha.B Pavitha.P
13 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
GPU Benchmarks Hierarchy 2024 - Graphics Card Rankings Tom's Hardware
No ratings yet
GPU Benchmarks Hierarchy 2024 - Graphics Card Rankings Tom's Hardware
1 page
Topic 8
No ratings yet
Topic 8
71 pages
Sogna Ragazzo Sogna
No ratings yet
Sogna Ragazzo Sogna
2 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Owens
No ratings yet
Owens
67 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Lec 1
No ratings yet
Lec 1
27 pages
ICT Presentation
No ratings yet
ICT Presentation
27 pages
Chapter 4 Notes
No ratings yet
Chapter 4 Notes
2 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
GPGPU
No ratings yet
GPGPU
139 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
GPU Types
No ratings yet
GPU Types
8 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Synthesis Gpgpu Draft2012 09
No ratings yet
Synthesis Gpgpu Draft2012 09
100 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Performance Modeling Guide
No ratings yet
GPU Performance Modeling Guide
12 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
GPU Gems 3: Advanced Graphics Techniques
No ratings yet
GPU Gems 3: Advanced Graphics Techniques
1 page
Parallel Path Tracing
No ratings yet
Parallel Path Tracing
35 pages
Exploring The Gpu Architecture
No ratings yet
Exploring The Gpu Architecture
9 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
Ada2024 Gpu 1
No ratings yet
Ada2024 Gpu 1
47 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Note2 4
No ratings yet
Note2 4
11 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
No ratings yet
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
66 pages
Gpu Cuda Part1
No ratings yet
Gpu Cuda Part1
27 pages
Presentation-Group # 7
No ratings yet
Presentation-Group # 7
17 pages
Lecture 6 3
No ratings yet
Lecture 6 3
15 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Survey of Nvidia RTX Technolog
No ratings yet
Survey of Nvidia RTX Technolog
9 pages
Android Debug Bridge - Android Developers
No ratings yet
Android Debug Bridge - Android Developers
11 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Parallel Novikov Si 1-2-2013
No ratings yet
Parallel Novikov Si 1-2-2013
14 pages
Extending OpenMP To Clusters
No ratings yet
Extending OpenMP To Clusters
1 page
Khem Raj Embedded Linux Conference 2014, San Jose, CA
No ratings yet
Khem Raj Embedded Linux Conference 2014, San Jose, CA
29 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
GBA Programming Manual v1.22
No ratings yet
GBA Programming Manual v1.22
172 pages
Model 0710 & Experiment Overview
No ratings yet
Model 0710 & Experiment Overview
10 pages
Aircraft Structures II Lab
0% (1)
Aircraft Structures II Lab
15 pages
Network World Magazine Article 1328
No ratings yet
Network World Magazine Article 1328
4 pages
Hisense Fouani
No ratings yet
Hisense Fouani
33 pages
Vga Club3d
No ratings yet
Vga Club3d
1 page
LM80 LM81 LM82 LM83 LM94
No ratings yet
LM80 LM81 LM82 LM83 LM94
146 pages
DP Videos AMD 15034 Drivers
No ratings yet
DP Videos AMD 15034 Drivers
693 pages
Cpu 0807
No ratings yet
Cpu 0807
112 pages
Asrock Motherboard Manual
No ratings yet
Asrock Motherboard Manual
72 pages
DP Videos AMD-NT 15123 Drivers
No ratings yet
DP Videos AMD-NT 15123 Drivers
690 pages
Laptops 260857
No ratings yet
Laptops 260857
7 pages
Extended PC I Devs
No ratings yet
Extended PC I Devs
88 pages
PowerDVD 10 User's Guide
No ratings yet
PowerDVD 10 User's Guide
112 pages
DP Graphics ATI1006 Wnt5 x32 Q
No ratings yet
DP Graphics ATI1006 Wnt5 x32 Q
20 pages
Call Off Duty Infinite Warfare: 11DVD: Minimum Requirements
No ratings yet
Call Off Duty Infinite Warfare: 11DVD: Minimum Requirements
23 pages
Lastexception 63840442327
No ratings yet
Lastexception 63840442327
4 pages
1
No ratings yet
1
2 pages
E V112 (RX3800 Series Lite)
No ratings yet
E V112 (RX3800 Series Lite)
2 pages
History
No ratings yet
History
13 pages
Pixel Bender Toolkit Read Me
No ratings yet
Pixel Bender Toolkit Read Me
4 pages
X.Org X Server Log Analysis
No ratings yet
X.Org X Server Log Analysis
17 pages
Hyundai Mbox R600 R650 EN
100% (1)
Hyundai Mbox R600 R650 EN
51 pages
ةكرام راعسأ ةمئاق - رودنوك - Condor
No ratings yet
ةكرام راعسأ ةمئاق - رودنوك - Condor
5 pages
Conversions
100% (2)
Conversions
619 pages
All GPUs - CSV
No ratings yet
All GPUs - CSV
244 pages
Pes2014 PC en
No ratings yet
Pes2014 PC en
14 pages
DP Video Hybrid wnt6-x64 1103 Vista-7-X64
No ratings yet
DP Video Hybrid wnt6-x64 1103 Vista-7-X64
19 pages
H55M Pro: User Manual
No ratings yet
H55M Pro: User Manual
60 pages

08 Performance Overview

Uploaded by

08 Performance Overview

Uploaded by

Understanding GPUs Through Benchmarking

Mike Houston Stanford University

Developed at Stanford to aid our understanding of GPUs

Fetch cost ATI cache hit

12 ALU ops ATI X1900XTX

Cost = max(ALU, TEX)

Fetch cost ATI cache hit

12 ALU ops ATI X1900XTX

5 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)

Fetch cost ATI sequential

24 ALU ops ATI X1900XTX

Cost = max(ALU, TEX)

Fetch cost ATI sequential

24 ALU ops ATI X1900XTX

20 ALU ops AMD HD2900XT

Cost = max(ALU, TEX)

Fetch cost NVIDIA cache hit

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX

Fetch cost NVIDIA sequential

8 ALU op issue penalty

Cost = sum(ALU, TEX)

NVIDIA 7900 GTX

Fetch cost NVIDIA 8800 GTX

8 ALU ops 4 ALU ops

NVIDIA 8800 GTX NVIDIA 8800 GTX Cost = max(ALU, TEX)

GPUBench test: inputfloatbandwidth

Sequential bandwidth (SEQ) about the same

NVIDIA 8800GTX 2X bandwidth of 7900GTX

ATI GL Readback performance is abysmal

Instruction Issue Rate

Test float1 vs. float4 performance GPUBench tests:

Results Vector issue

= More costly than others

Results Vector issue

Results Vector issue

NVIDIA 7900GTX 8800GTX is 37% faster (peak)

Results Vector issue

When benchmarks go wrong

NVIDIA 7800GTX GPUBench 1.2

Results Scalar issue

8800GTX is a scalar issue processor

Results Early-Z - NVIDIA

4x4 coherence is almost perfect!

Results Branching - NVIDIA

Results Branching - NVIDIA

NVIDIA 7900GTX Need > 32x32 branch coherence

Results Branching - NVIDIA

Be wary of driver optimizations

You might also like