Performance (Memory)
Optimization
National Tsing Hua University
2024, Fall Semester
Communication vs Computation
Peak performance for Kepler
The peak processing performance is 3935 Gflops.
The bandwidth is 250GB/s, which equals to 63G
floating point data per second.
The ratio is about 60 times
Instruction execution
Each computation instruction takes 1~4 cycles
Each load/store instruction for global memory access
takes 400~800 cycles
Memory access to shared memory can be 1~20 cycles
The ratio is about 100 times
NTHU LSA Lab 2
Data Pre-fetch and Reuse
GPU has faster memory spaces (but smaller)
Shared memory / L1 cache
Register file
Solution:
Hardware: prefetch data to shared memory or
registers for later computation (hardware)
Software/Programmer: minimize memory usage &
reuse the data in shared memory or registers as
many times as possible
NTHU LSA Lab 3
Outline
Host memory
Pined memory
Asynchronous computation & data transfer
Streams
Global/Local memory
Memory coalescing
Tiled algorithm
Shared memory
Bank conflicts avoidance
Memory padding
Address linearization
NTHU LSA Lab 4
Outline
Host memory
Pined memory
Asynchronous computation & data transfer
Streams
Global/Local memory
Memory coalescing
Tiled algorithm
Shared memory
Bank conflicts avoidance
Memory padding
Address linearization
NTHU LSA Lab 5
1. Page-Locked Data Transfers
cudaMallocHost() allows allocation of page-
locked (“pinned”) host memory
cudaMalloc ( &dev1, size ) ;
cudaMallocHost( &host1, size ) ;
…
cudaMemcpy ( dev1, host1, size, H2D ) ;
Enables highest cudaMemcpy performance
Use with caution!!
Allocating too much page-locked memory can
reduce overall system (host) performance
Parallel Programming – NTHU LSA Lab 6
2. Overlap CPU & GPU Computations
To facilitate concurrent execution between host
and device, some function calls are asynchronous:
Control is returned to the host thread before the
device has completed the requested task.
Asynchronous functions:
Kernel launches
Asynchronous memory copy and set options:
cudaMemcpyAsync, cudaMemsetAsync
cudaMemcpy within the same device
H2D cudaMemcpy of 64kB or less
NTHU LSA Lab 7
Synchronous Computation
cudaMalloc ( &dev1, size ) ;
double* host1 = (double*) malloc ( &host1, size ) ;
…
// cudaMemcpy blocks until copy is completed
cudaMemcpy ( dev1, host1, size, H2D ) ;
// two kernels are serialized and executed on device
kernel2 <<< grid, block>>> ( …, dev2, … ); Kernels from a
kernel3 <<< grid, block>>> ( …, dev3, … ); single thread
// cudaMemcpy starts after kernels finish
// and blocks until copy is completed
are serialized
cudaMemcpy ( host4, dev4, size, D2H ) ;
CPU_func();
CPU GPU
… cudaMemcpy
CPU and GPU are synchronized due to kernel2
cudaMemcpy kernel3
Kernel functions from the same process cudaMemcpy
(default stream) are always serialized, CPU_func()
and cannot be overlapped on GPU
NTHU LSA Lab 8
Asynchronous Computation
cudaMalloc(&dev1, size) ;
double* host1=(double*) malloc (&host1, size);
...
cudaMemcpy (dev1, host1, size, H2D) ;
kernel2 <<< grid, block >>> ( …, dev1, … ); CPU & GPU
kernel3 <<< grid, block >>> ( …, dev1, … ); overlapped
CPU_method ();
cudaMemcpy ( host1, dev1, size, D2H ) ;
... CPU GPU
cudaMemcpy
kernel2
CPU_func()
kernel3
cudaMemcpy
NTHU LSA Lab 9
Asynchronous Data Transfers
Asynchronous host-device memory copy returns control
immediately to CPU
cudaMemcpyAsync(dst, src, size, dir, stream);
requires pinned host memory (allocated by “cudaMallocHost”)
Overlap CPU computation with data transfer
0 = default stream
cudaMemcpyAsync(a_d, a_h, size,
cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cudaMemcpyAsync(a_h, a_d, size,
cudaMemcpyHostToDevice, 0);
overlapped
CPU_method();
NTHU LSA Lab 10
3. CUDA Streams
CUDA Stream is a technique to overlap the execution of a
kernel, and hide data transfer delay from computations
Operations in different streams can be interleaved and, when possible,
they can even run concurrently
Operations in the same stream are still serialized and executed in order
Consider a kernel process a huge dataset
Without stream, the kernel computation can only start after the dataset
is transferred
H2D kernel D2H
With stream, we can partition the dataset, assign each partition to a
stream, and execute them in a pipeline
H1 H2 H3
K1 K2 K3
D1 D2 D3
NTHU LSA Lab 11
CUDA Streams
kernel launch
kernel<<<grid,block,0,stream-id>>>(/*…*/);
Stream-id must be allocated and destroyed
cudaStream_t *stream;
cudaStreamCreate(&stream);
cudaStreamDestroy(stream);
Memory copy can be either synchronous or
asynchronous. But synchronous memcpy prevents
streams from running in parallel
If asynchronous copy is used, host memory must be
pinned
NTHU LSA Lab 12
CUDA Streams
cudaStream_t stream[2];
cudaStreamCreate(&stream[0]);
cudaStreamCreate(&stream[1]); pined(page locked mem)
cudaMallocHost(&hostPtr, 2 * size);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(/*…*/,cudaMemcpyHostToDevice,stream[i]);
kernel<<<100,512,0,stream[i]>>>(/*…*/);
cudaMemcpyAsync(/*…*/,cudaMemcpyDeviceToHost,stream[i]);
}
cudaStreamDestroy(stream[0]);
cudaStreamDestroy(stream[1]);
NTHU LSA Lab 13
Stream based Synchronization
cudaStreamSynchronize(stream-id)
Blocks host until all CUDA calls in stream stream-id
complete
cudaEventRecord (event, stream-id )
Insert ‘events‘ in streams
Event is recorded when GPU reaches it in a stream
cudaEventSynchronize (event)
Blocks CPU thread until event is recorded
cudaStreamWaitEvent (steam-id,
event,0)
Block a GPU stream until event reports completion
NTHU LSA Lab 14
Example: Explicit Sync between Streams
cudaEvent_t event;
cudaEventCreate (&event); // create event
// 1) H2D copy of new input
cudaMemcpyAsync ( d_in, in, size, H2D, stream1 );
cudaEventRecord (event, stream1); // record event
// 2) D2H copy of previous result
cudaMemcpyAsync ( out, d_out, size, D2H, stream2 );
// wait for event in stream1
cudaStreamWaitEvent ( stream2, event );
// 3) must wait for 1 and 2
kernel <<< , , , stream2 >>> ( d_in, d_out );
asynchronousCPUmethod ( … ) // Async GPU method
Stream 1 H2D (S1) event
Stream 2 D2H (S2) kernel (S2)
NTHU LSA Lab 15
Outline
Host memory
Pined memory
Asynchronous computation & data transfer
Streams
Global/Local memory
Memory coalescing
Tiled algorithm
Shared memory
Bank conflicts avoidance
Memory padding
Address linearization
NTHU LSA Lab 16
Local Memory Cache
L1 & L2 are used to cache local memory contents
L1: On chip memory. Same as share memory
Programmers can decide the ratio of shared memory and L1 cache
L2: Off chip memory Cache. Same as global memory
On chip
Off chip
(on-board)
NTHU LSA Lab 17
Coalesced Memory Access
Accessing data in the global memory is critical to the
performance of a CUDA application
DRAM is slow comparing to other on-chip memory
Recall that all threads in a warp execute the same
instruction
When all threads in a warp execute a load instruction, the
hardware detects whether the threads access consecutive
memory locations
In this favorable case, the hardware coalesces all memory
accesses into a consolidated access (single transaction) to
consecutive DRAM locations (off-chip memory)
NTHU LSA Lab 18
Coalesced Memory Access
Coalesced access
Unaligned sequential addresses that fit into two 128-
byte L1-cache lines
NTHU LSA Lab 19
Misaligned Access Without Caching
Misaligned sequential addresses that fall within five
32-byte L2 cache segments
No extra data reading
Sometimes, it will be faster than (L1) cached memory
access
If data are not reused
NTHU LSA Lab 20
Example: Matrix Transpose
SDK Sample (“transpose”)
Illustrates coalescing using shared memory
Speedups for even small matrices
NTHU LSA Lab 21
Uncoalesced Transpose
B[i,j] = A[j,i]
NTHU LSA Lab 22
Coalesced Transpose
Coalescing through shared memory
Make both read & write become continuous for global memory
__share__ S[];
S[i,j] = A[i,j];
B[i,j] = S[j,i];
NTHU LSA Lab 23
Outline
Host memory
Pined memory
Asynchronous computation & data transfer
Streams
Global/Local memory
Memory coalescing
Tiled algorithm
Shared memory
Bank conflicts avoidance
Memory padding
Address linearization
NTHU LSA Lab 24
Example: Matrix Multiply
Compute C = A x B, where A, B, C are N by N matrices
For i = 1:N Let each thread compute one element C[i][j]
For j = 1:N
For k = 1:N
C[i][j]+=A[i][k]*B[k][j]
Compute to Global Memory Access (CGMA) ratio
Compute = 1 multiplication + 1 addition; Memory access = 2
CGMA = 1
K20x (Kepler)
Compute = 3950 GFLOPs; Global memory BW = 250GB/s
Compute / Comm. = 3950x4/250 ≈ 64
CGMA must increase to 64! Floating point takes 4 bytes
NTHU LSA Lab 25
Load Everything to Shared Memory
Share memory is 100 times faster than global memory
If N^2 threads are used:
Each thread only needs to loads 2 element, and does 2N
computations
CGMA = N (When N > 64, memory access will not be the
bottleneck anymore)
For i = 1:N
For j = 1:N
For k = 1:N
C[i][j]+=A[i][k]*B[k][j]
But shared memory is small
The data needs to be stored is 3N2 integers or floats
If N=1024, size = 12MB (i.e., 3*1,024*1,024*4)
NTHU LSA Lab 26
Load Everything to Shared Memory
Matrix_Mul<<<1, N, 2*N*N>>>(A, B, C, N);
The third parameter is the size of shared memory.
extern __shared__ int S[];
inline int Addr(int matrixIdx, int i, int j, int N) {
return (N*N*matrixIdx + i*N+ j);
}
__global__ void Matrix_Mul(int* A, int* B,int* C, int* N) {
int i = threadIdx.x;
int j = threadIdx.y;
//move data to shared memory
S[Addr(0, i, j, N)]=A[Addr(0, i, j, N)];
S[Addr(1, i, j, N)]=B[Addr(0, i, j, N)];
__syncthreads();
// do computation
for(int k=0; k<*N; k++)
C[Addr(1, i, j, N)]=S[Addr(0, j, k, N)]*S[Addr(0, k, j, N)];
}
Parallel Programming – NTHU LSA Lab 27
Block(Tiled) Algorithm
Break up the execution of the kernel into phases so
that the data accesses in each phase is focused on
one subset (tile) of data
Not all problems can be partitioned
into independent subsets
NTHU LSA Lab 28
Block(Tiled) Algorithm
Total required data accesses
Rewrite for-loop by TILE_WIDTH = 2 x (TILE_WIDTH)^2
For i’ = 1:N step TILE_WIDTH Total computing= 2 x (TILE_WIDTH)^3
For j’ = 1:N step TILE_WIDTH
For k’ = 1:N step TILE_WIDTH
For i = i’: i’+ TILE_WIDTH - 1
For j = j’: j’+ TILE_WIDTH - 1
For k = k’: k’+ TILE_WIDTH - 1
C[i][j]+=A[i][k]*B[k][j]
We can find a small enough TILE_WIDTH, such that all the
values needed by C[i][j] are in shared memory
Every data is re-used TILE_WIDTH times
Given 48KB shared memory: Include output array C[][]
Max tiled size = (48KB/4B/3)^(1/2) = 64
CGMA = number of data re-use = TILE_WIDTH = 64!
NTHU LSA Lab 29
extern __shared__ int S[];
inline int Addr(int matrixIdx, int i, int j, int N) {
return (N*N*matrixIdx + i*N+ j);
}
__global__ void Matrix_Mul(int* A, int* B,int* C, int* N) {
int i = threadIdx.x;
int j = threadIdx.y;
//move data to shared memory
S[Addr(0, i, j, N)]=A[Addr(0, i, j, N)];
S[Addr(1, i, j, N)]=B[Addr(0, i, j, N)];
__syncthreads();
// do computation
for(int k=0; k<*N; k++)
C[Addr(1, i, j, N)]=S[Addr(0, i, k, N)]*S[Addr(1, k, j, N)];
}
void main() {
for(int i=0; i<N; i+=TILE_WIDTH)
for(int j=0; j<N; j+=TILE_WIDTH){
cudaMemcpy(d_A, &A[i,j], sizeof(int)*TILE_WIDTH*TILE_WIDTH, H2D);
cudaMemcpy(d_B, &B[i,j], sizeof(int)*TILE_WIDTH*TILE_WIDTH, H2D);
Matrix_Mul<<<1, N, 2*N*N>>>(d_A, d_B, d_C, TILE_WIDTH);
cudaMemcpy(&C[i,j], d_C, sizeof(int)*TILE_WIDTH*TILE_WIDTH), D2H;
}
} NTHU LSA Lab 30
Tiled Algorithm
Block algorithms or tiled algorithms:
Split the inputs into blocks to fit into shared (cache) memory
Increase data reuse, minimize global memory access
Larger CGMA ratio does not always guarantee better
performance.
CGMA ratio should be large enough to hide the
communication cost, not the larger the better
Block algorithms cause overhead due to increasing
computations or number of thread blocks
NTHU LSA Lab 31
Outline
Host memory
Pined memory
Asynchronous computation & data transfer
Streams
Global/Local memory
Memory coalescing
Tiled algorithm
Shared memory
Bank conflicts avoidance
Memory padding
Address linearization
NTHU LSA Lab 32
Shared Memory Architecture
Many threads accessing memory Bank0
Therefore, memory is divided into banks Bank1
Successive 32-bit (4Bytes) words assigned to Bank2
successive banks Bank3
Each bank can service one address per cycle Bank4
Bank5
A memory can service as many simultaneous
Bank6
accesses as it has banks
Bank7
Multiple simultaneous accesses to a bank
result in a bank conflict
Conflicting accesses are serialized
Shared memory is as fast as register if no Bank15
bank conflict
NTHU LSA Lab 33
Example: No bank Conflict
Linear addressing Random 1:1 Permutation
Thread0 Bank0 Thread0 Bank0
Thread1 Bank1 Thread1 Bank1
Thread2 Bank2 Thread2 Bank2
Thread3 Bank3 Thread3 Bank3
Thread4 Bank4 Thread4 Bank4
Thread5 Bank5 Thread5 Bank5
Thread6 Bank6 Thread6 Bank6
Thread7 Bank7 Thread7 Bank7
Thread15 Bank15 Thread15 Bank15
NTHU LSA Lab 34
Example: No bank Conflict
If all threads of a half-warp
Thread0 Bank0
read the identical address, Thread1 Bank1
there is no bank conflict Thread2 Bank2
(using broadcast) Thread3 Bank3
Thread4 Bank4
Assume warp size is 8
Thread5 Bank5
Thread0~3 access the same Thread6 Bank6
data & in the same half-warp Thread7 Bank7
The rest of threads also have
1:1 permutation and no conflict
But not for write access Thread31 Bank15
NTHU LSA Lab 35
Example: Bank Conflict
n-way bank conflict
Each bank has n different memory access
Ex: 2-way bank conflict
__shared__ int array[2][32];
int offset = threadIdx.x*2;
int temp = array[offset/32][offset%32];
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
NTHU LSA Lab 36
Bank Conflict Avoidance
Change shared memory access pattern
Linear addressing access
1:1 permutation
Broadcast: half-warp read the identical address
Memory padding
Add addition memory space to avoid bank conflict
NTHU LSA Lab 37
Example: 2D array
32x32 SMEM array
Warp accesses a column:
32-way bank conflicts (threads in a warp access
the same bank)
NTHU LSA Lab 38
Memory Padding
Add a column for padding:
32x33 SMEM array
Warp accesses a column:
32 different banks, no bank conflicts
NTHU LSA Lab 39
Address linearization (SoA)
Address linearization can avoid bank conflict on shared
memory, and provide memory coalescing on local memory or
constant memory
An array of structures behaves like row major accesses
struct Point { double x; double y; double z;}
A[N];
A[threadIdx].x = …
A[1].x A[1].y A[1].z A[2].x A[2].y A[2].z A[3].x A[3].y A[3].z
A structure of arrays behaves like column major
struct PointList{double *x; double *y; double *z;}
A;
A.x[threadIdx] = …
A[1].x A[2].x A[3].x A[1].y A[2].y A[3].y A[1].z A[2].z A[3].z
NTHU LSA Lab 40
Slides from Mark Harris, NVIDIA Developer Technology
Performance Optimization
AN EXAMPLE OF CUDA
NTHU LSA Lab 41
Performance!
30x Speedup!
NTHU LSA Lab 42
Run on block1
Run on block2
T1 T2 T1 T2
T1 T1
T1 Block1 needs the result of 14 from
block1
NTHU LSA Lab 43
NTHU LSA Lab 44
If the maximum threads per block is 8
NTHU LSA Lab 45
// input/output data is initiated on global memory
// Use shared memory for computations
// Wait for other threads to finish moving
// Sync between threads in the same block
NTHU LSA Lab 46
Executed by one Multiprocessor
NTHU LSA Lab 47
NTHU LSA Lab 48
If WARP=4:
Executed by one Multiprocessor
4WARP
2WARP
1WARP
Highly divergent wrap (threadID 0~14)
NTHU LSA Lab 49
NTHU LSA Lab 50
If WARP=4:
1WARP
1WARP
1WARP
Highly divergent memory access locations
NTHU LSA Lab 51
NTHU LSA Lab 52
NTHU LSA Lab 53
54
55
NTHU LSA Lab 56
NTHU LSA Lab 57
Half of the threads are idle since 1st iteration! 58
NTHU LSA Lab 59
NTHU LSA Lab 60
Details in backup slides
NTHU LSA Lab 61
NTHU LSA Lab 62
Backup
NTHU LSA Lab 63
NTHU LSA Lab 64
NTHU LSA Lab 65
NTHU LSA Lab 66
NTHU LSA Lab 67
NTHU LSA Lab 68
NTHU LSA Lab 69
NTHU LSA Lab 70
NTHU LSA Lab 71
NTHU LSA Lab 72
Reference
NIVIDA Advanced CUDA Webinar Memory Optimizations
http://on-demand.gputechconf.com/gtc-express/2011/
presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memo
ry_Optimization.pdf
NVIDIA CUDA C/C++ Streams and Concurrency
http://on-demand.gputechconf.com/gtc-express/2011/
presentations/StreamsAndConcurrencyWebinar.pdf
Mark Harris, NVIDIA Developer Technology
http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_
Harris.pdf
NTHU LSA Lab 73