Tiling/Performance
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on the device
is to tile data to take advantage of fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple
threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread
can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory (DRAM)
- much slower access than shared memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only -> constant memory (very fast if in cache)
– R/W shared within Block -> shared memory (very fast)
– R/W within each thread -> registers (very fast)
– R/W inputs/results -> global memory (very slow)
Idea: Use Shared Memory to reuse global memory data
• Each input element is
WIDTH
read by Width threads.
• Load each element into Shared
Memory and have several threads M P
use the local version to reduce the ty
memory bandwidth
WIDTH
– Tiled algorithms
tx
WIDTH WIDTH
Tiled Multiply
bx
0 1
tx
012
Break up the execution of the Nd
TILE_WIDTH TILE_WIDTH
TILE_W
IDTH-1
kernel into phases so that the data accesses in
WIDTH
each phase is focused on one subset (tile) of
Md and Nd
Md Pd
TILE_WIDTHE
1
WIDTH
Pdsub
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3
Each phase of a Thread Block uses
one tile from Md and one from Nd
Phase 1 Step 4 Step 5 StepPhase
6 2
T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=
Mds0,0*Nds0,0 + Mds0,0*Nds0,0
↓ ↓ ↓ ↓
Mds0,0 Mds1,0*Nds0,1 +
Nds0,0 Mds0,0 Nds0,0 Mds1,0*Nds0,1
T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=
Mds0,0*Nds1,0 + Mds0,0*Nds1,0
↓ ↓ ↓ ↓
Mds1,0 Mds1,0*Nds1,1 +
Nds1,0 Mds1,0 Nds1,0 Mds1,0*Nds1,1
T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=
Mds0,1*Nds0,0 + Mds0,1*Nds0,0
↓ ↓ ↓ ↓
Mds0,1 Mds1,1*Nds0,1 +
Nds0,1 Mds0,1 Nds0,1 Mds1,1*Nds0,1
T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=
Mds
time0,1*Nds1,0 + Mds0,1*Nds1,0
↓ ↓ ↓ ↓
Mds1,1 Mds1,1*Nds1,1 +
Nds1,1 Mds1,1 Nds1,1 Mds1,1*Nds1,1
Threads, Warps, Blocks
• There are (up to) 32 threads in a Warp
– Only <32 when there are fewer than 32 total
threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a single SM
• G80 has 16 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory) allow, more than 1
Block can occupy each SM
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks
– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
How about performance on a GPU
– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 200/4 = 50 GFLOPS
– This limits the execution rate to 3.3% (50/1500) of the peak
floating-point execution rate of the device!
– Need to drastically cut down memory accesses to get close to
the1,500 GFLOPS
Outline of Tiling Technique
– Identify a tile of global memory contents that are accessed by multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads are ready to start the phase
– Have the multiple threads to access their data from the on-chip memory
– Use barrier synchronization to make sure that all threads have completed the
current phase
– Move on to the next tile
Objective
– To understand the design of a tiled parallel algorithm
for matrix multiplication
– Loading a tile
– Phased execution
– Barrier Synchronization
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled
code
CUDA Code – Kernel
Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3
dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
Tiled Matrix Multiplication Kernel
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. shared float Mds[TILE_WIDTH][TILE_WIDTH];
2. shared float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
Tiled Multiply
bx
0 1
• Each block computes one tx
012
square sub-matrix Pdsub of size Nd
TILE_WIDTH TILE_WIDTH
TILE_W
m IDTH-1
TILE_WIDTH
WIDTH
• Each thread computes one bx k
element of Pdsub
Md Pd
by
0
m
0
TILE_WIDTHE
1
WIDTH
Pdsub
by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block,
allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
0
10
20
30
40
50
60
70
80
90
100
not tiled
tiled
only
4x4 tiles
tiled &
unrolled
tiled
8x8 tiles only
tiled &
unrolled
tiled
only
tiled &
12x12 tiles
unrolled
tiled
only
Tiling Size Effects
tiled &
16x16 tiles
unrolled