GPGPU-Sim Tutorial
Zhen Lin
North Carolina State University
Based on GPGPU-Sim Tutorial and Manual by UBC
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
GPGPU-Sim in a Nutshell
Microarchitecture timing model of contemporary GPUs
Run unmodified CUDA/OpenCL
What GPGPU-Sim Simulates
Functional model
PTX
SASS
Timing model for the compute part of a GPU
Not for CPU or PCIe
Only model microarchitecture timing relevant to compute
Functional model
PTX
A low-level, data-parallel virtual machine and instruction set architecture (ISA)
Between CUDA and hardware ISA (SASS)
Stable ISA that spans multiple GPU generations
SASS/PTXPLUS
Hardware native ISA
PTX -> Translate + Optimize -> SASS
More accurate, but not well supported
CUDA tool chain
Functional Model (PTX)
Scalar ISA
SSA representation: register allocation not done in PTX
Timing Model for GPU Micro-Architecture
GPGPU-Sim simulates the timing model
of a GPU running each launched CUDA
kernel
Report stats (e.g. # cycles) for each kernel
Exclude any time spent on data transfer
on PCIe bus
CPU is assumed to be idle when the GPU
is working
Compilation Path
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Demo1
Setup
Stats
Configuration
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Overview of the Architecture
Inside a SIMT Core
Pipeline stages
Fetch
Decode
Issue
Read operand
Execution
Writeback
Fetch + Decode
Arbitrate the I-cache
among warps
Cache miss handled by
fetching again later
Fetched instruction is
decoded and then
stored in the I-Buffer
1 or more entries / warp
Only warp with vacant
entries are considered in
fetch
Issue
Selects a warp with a ready
instruction
Acquires the activemask
from TOS of SIMT stack
Invalid the I-buffer
Scoreboard
Checks for RAW and WAW
dependency hazard
Flag instructions with hazards as not ready in I-Buffer
(masking them out from the scheduler)
Instructions reserves dest registers at issue
Release them at writeback
December 2012
GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model
4.17
Read Operand
Bank 0
Bank 1
Bank 2
Bank 3
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
add.s32 R3, R1, R2;
No Conflict
mul.s32 R3, R0, R4;
Conflict at bank 0
Operand Collector Architecture (US Patent: 7834881)
Interleave operand fetch from different threads to achieve full utilization
December 2012
GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model
4.18
Operand Collector
(from instruction issue stage)
dispatch
December 2012
GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model
4.19
Execution
ALU
Stream processor (SP)
Specific function unit (SFU)
MEM
Shared memory
Local memory
Global memory
Texture memory
Constant memory
ALU Pipelines
SIMD Execution Unit
Fully Pipelined
Each pipe may execute a subset of instructions
Configurable bandwidth and latency (depending on the instruction)
Default: SP + SFU pipes
December 2012
GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model
4.21
Memory Unit
Double clock the unit
Each cycle service half the
warp
A
G
U
Bank
Conflict
Shared MSHR
Mem
Access
Coalesc.
Data
Cache
Has a private writeback
path
December 2012
GPGPU-Sim Tutorial (MICRO 2012) 4: Microarchitecture Model
Const.
Cache
Texture
Cache
Memory Port
Model timing for memory
instructions
Support half-warp (16
threads)
4.22
Writeback
Write result to register file
Scoreboard updates the r-bit
Stack-Based Branch Divergence Hardware
When the branch diverge
New entries are pushed to SIMT stack
RPC set to the immediate post dominator
Activemast indicates which threads are active
PC is sent to fetch unit
When RPC is reached
Pop the TOS
PC of new TOS is sent to the fetch unit
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Demo2
Software framework overview
To monitor the warp scheduling order
Compare with different scheduling policies
For More Information
http://www.gpgpu-sim.org/
Thanks & question?