0% found this document useful (0 votes)

22 views73 pages

Chap9 - CUDA Optimization

Uploaded by

singonx038

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views73 pages

Chap9 - CUDA Optimization

Uploaded by

singonx038

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Performance (Memory)

Optimization

National Tsing Hua University

2024, Fall Semester
Communication vs Computation
 Peak performance for Kepler
 The peak processing performance is 3935 Gflops.
 The bandwidth is 250GB/s, which equals to 63G
floating point data per second.
 The ratio is about 60 times
 Instruction execution
 Each computation instruction takes 1~4 cycles
 Each load/store instruction for global memory access
takes 400~800 cycles
 Memory access to shared memory can be 1~20 cycles
 The ratio is about 100 times

NTHU LSA Lab 2

Data Pre-fetch and Reuse
 GPU has faster memory spaces (but smaller)
 Shared memory / L1 cache
 Register file
 Solution:
 Hardware: prefetch data to shared memory or
registers for later computation (hardware)
 Software/Programmer: minimize memory usage &
reuse the data in shared memory or registers as
many times as possible

NTHU LSA Lab 3

Outline
 Host memory
 Pined memory
 Asynchronous computation & data transfer
 Streams
 Global/Local memory
 Memory coalescing
 Tiled algorithm
 Shared memory
 Bank conflicts avoidance
 Memory padding
 Address linearization

NTHU LSA Lab 4

NTHU LSA Lab 5

1. Page-Locked Data Transfers
 cudaMallocHost() allows allocation of page-
locked (“pinned”) host memory
cudaMalloc ( &dev1, size ) ;
cudaMallocHost( &host1, size ) ;
…
cudaMemcpy ( dev1, host1, size, H2D ) ;

 Enables highest cudaMemcpy performance

 Use with caution!!
 Allocating too much page-locked memory can
reduce overall system (host) performance

Parallel Programming – NTHU LSA Lab 6

2. Overlap CPU & GPU Computations
 To facilitate concurrent execution between host
and device, some function calls are asynchronous:
 Control is returned to the host thread before the
device has completed the requested task.
 Asynchronous functions:
 Kernel launches
 Asynchronous memory copy and set options:
cudaMemcpyAsync, cudaMemsetAsync
 cudaMemcpy within the same device
 H2D cudaMemcpy of 64kB or less

NTHU LSA Lab 7

Synchronous Computation
cudaMalloc ( &dev1, size ) ;
double* host1 = (double*) malloc ( &host1, size ) ;
…
// cudaMemcpy blocks until copy is completed
cudaMemcpy ( dev1, host1, size, H2D ) ;
// two kernels are serialized and executed on device
kernel2 <<< grid, block>>> ( …, dev2, … ); Kernels from a
kernel3 <<< grid, block>>> ( …, dev3, … ); single thread
// cudaMemcpy starts after kernels finish
// and blocks until copy is completed
are serialized
cudaMemcpy ( host4, dev4, size, D2H ) ;
CPU_func();
CPU GPU
… cudaMemcpy

 CPU and GPU are synchronized due to kernel2

cudaMemcpy kernel3
 Kernel functions from the same process cudaMemcpy
(default stream) are always serialized, CPU_func()
and cannot be overlapped on GPU
NTHU LSA Lab 8
Asynchronous Computation
cudaMalloc(&dev1, size) ;
double* host1=(double*) malloc (&host1, size);
...
cudaMemcpy (dev1, host1, size, H2D) ;
kernel2 <<< grid, block >>> ( …, dev1, … ); CPU & GPU
kernel3 <<< grid, block >>> ( …, dev1, … ); overlapped
CPU_method ();
cudaMemcpy ( host1, dev1, size, D2H ) ;
... CPU GPU
cudaMemcpy
kernel2
CPU_func()
kernel3
cudaMemcpy

NTHU LSA Lab 9

Asynchronous Data Transfers
 Asynchronous host-device memory copy returns control
immediately to CPU
 cudaMemcpyAsync(dst, src, size, dir, stream);
 requires pinned host memory (allocated by “cudaMallocHost”)
 Overlap CPU computation with data transfer
 0 = default stream
cudaMemcpyAsync(a_d, a_h, size,
cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cudaMemcpyAsync(a_h, a_d, size,
cudaMemcpyHostToDevice, 0);
overlapped
CPU_method();
NTHU LSA Lab 10
3. CUDA Streams
 CUDA Stream is a technique to overlap the execution of a
kernel, and hide data transfer delay from computations
 Operations in different streams can be interleaved and, when possible,
they can even run concurrently
 Operations in the same stream are still serialized and executed in order
 Consider a kernel process a huge dataset
 Without stream, the kernel computation can only start after the dataset
is transferred
H2D kernel D2H
 With stream, we can partition the dataset, assign each partition to a
stream, and execute them in a pipeline
H1 H2 H3
K1 K2 K3
D1 D2 D3
NTHU LSA Lab 11
CUDA Streams
 kernel launch
 kernel<<<grid,block,0,stream-id>>>(/*…*/);
 Stream-id must be allocated and destroyed
 cudaStream_t *stream;
 cudaStreamCreate(&stream);
 cudaStreamDestroy(stream);
 Memory copy can be either synchronous or
asynchronous. But synchronous memcpy prevents
streams from running in parallel
 If asynchronous copy is used, host memory must be
pinned
NTHU LSA Lab 12
CUDA Streams
cudaStream_t stream[2];
cudaStreamCreate(&stream[0]);
cudaStreamCreate(&stream[1]); pined(page locked mem)
cudaMallocHost(&hostPtr, 2 * size);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(/*…*/,cudaMemcpyHostToDevice,stream[i]);
kernel<<<100,512,0,stream[i]>>>(/*…*/);
cudaMemcpyAsync(/*…*/,cudaMemcpyDeviceToHost,stream[i]);
}
cudaStreamDestroy(stream[0]);
cudaStreamDestroy(stream[1]);

NTHU LSA Lab 13

Stream based Synchronization
 cudaStreamSynchronize(stream-id)
 Blocks host until all CUDA calls in stream stream-id
complete
 cudaEventRecord (event, stream-id )
 Insert ‘events‘ in streams
 Event is recorded when GPU reaches it in a stream
 cudaEventSynchronize (event)
 Blocks CPU thread until event is recorded
 cudaStreamWaitEvent (steam-id,
event,0)
 Block a GPU stream until event reports completion
NTHU LSA Lab 14
Example: Explicit Sync between Streams
cudaEvent_t event;
cudaEventCreate (&event); // create event
// 1) H2D copy of new input
cudaMemcpyAsync ( d_in, in, size, H2D, stream1 );
cudaEventRecord (event, stream1); // record event
// 2) D2H copy of previous result
cudaMemcpyAsync ( out, d_out, size, D2H, stream2 );
// wait for event in stream1
cudaStreamWaitEvent ( stream2, event );
// 3) must wait for 1 and 2
kernel <<< , , , stream2 >>> ( d_in, d_out );
asynchronousCPUmethod ( … ) // Async GPU method

Stream 1 H2D (S1) event

Stream 2 D2H (S2) kernel (S2)

NTHU LSA Lab 15
Outline
 Host memory
 Pined memory
 Asynchronous computation & data transfer
 Streams
 Global/Local memory
 Memory coalescing
 Tiled algorithm
 Shared memory
 Bank conflicts avoidance
 Memory padding
 Address linearization

NTHU LSA Lab 16

Local Memory Cache
 L1 & L2 are used to cache local memory contents
 L1: On chip memory. Same as share memory
Programmers can decide the ratio of shared memory and L1 cache
 L2: Off chip memory Cache. Same as global memory

On chip

Off chip
(on-board)

NTHU LSA Lab 17

Coalesced Memory Access
 Accessing data in the global memory is critical to the
performance of a CUDA application
 DRAM is slow comparing to other on-chip memory
 Recall that all threads in a warp execute the same
instruction
 When all threads in a warp execute a load instruction, the
hardware detects whether the threads access consecutive
memory locations
 In this favorable case, the hardware coalesces all memory
accesses into a consolidated access (single transaction) to
consecutive DRAM locations (off-chip memory)

NTHU LSA Lab 18

Coalesced Memory Access
 Coalesced access

 Unaligned sequential addresses that fit into two 128-

byte L1-cache lines

NTHU LSA Lab 19

Misaligned Access Without Caching
 Misaligned sequential addresses that fall within five
32-byte L2 cache segments
 No extra data reading

 Sometimes, it will be faster than (L1) cached memory

access
 If data are not reused
NTHU LSA Lab 20
Example: Matrix Transpose
 SDK Sample (“transpose”)
 Illustrates coalescing using shared memory
 Speedups for even small matrices

NTHU LSA Lab 21

Uncoalesced Transpose

B[i,j] = A[j,i]

NTHU LSA Lab 22

Coalesced Transpose
 Coalescing through shared memory
 Make both read & write become continuous for global memory

__share__ S[];
S[i,j] = A[i,j];
B[i,j] = S[j,i];

NTHU LSA Lab 23

NTHU LSA Lab 24

Example: Matrix Multiply
 Compute C = A x B, where A, B, C are N by N matrices
For i = 1:N Let each thread compute one element C[i][j]
For j = 1:N
For k = 1:N
C[i][j]+=A[i][k]*B[k][j]
 Compute to Global Memory Access (CGMA) ratio
Compute = 1 multiplication + 1 addition; Memory access = 2
CGMA = 1
 K20x (Kepler)
 Compute = 3950 GFLOPs; Global memory BW = 250GB/s
 Compute / Comm. = 3950x4/250 ≈ 64
 CGMA must increase to 64! Floating point takes 4 bytes
NTHU LSA Lab 25
Load Everything to Shared Memory
 Share memory is 100 times faster than global memory
 If N^2 threads are used:
 Each thread only needs to loads 2 element, and does 2N
computations
 CGMA = N (When N > 64, memory access will not be the
bottleneck anymore)
For i = 1:N
For j = 1:N
For k = 1:N
C[i][j]+=A[i][k]*B[k][j]

 But shared memory is small

 The data needs to be stored is 3N2 integers or floats
 If N=1024, size = 12MB (i.e., 3*1,024*1,024*4)
NTHU LSA Lab 26
Load Everything to Shared Memory
 Matrix_Mul<<<1, N, 2*N*N>>>(A, B, C, N);
 The third parameter is the size of shared memory.

extern shared int S[];

inline int Addr(int matrixIdx, int i, int j, int N) {
return (N*N*matrixIdx + i*N+ j);
}
__global__ void Matrix_Mul(int* A, int* B,int* C, int* N) {
int i = threadIdx.x;
int j = threadIdx.y;
//move data to shared memory
S[Addr(0, i, j, N)]=A[Addr(0, i, j, N)];
S[Addr(1, i, j, N)]=B[Addr(0, i, j, N)];
__syncthreads();
// do computation
for(int k=0; k<*N; k++)
C[Addr(1, i, j, N)]=S[Addr(0, j, k, N)]*S[Addr(0, k, j, N)];
}
Parallel Programming – NTHU LSA Lab 27
Block(Tiled) Algorithm
 Break up the execution of the kernel into phases so
that the data accesses in each phase is focused on
one subset (tile) of data
 Not all problems can be partitioned
into independent subsets

NTHU LSA Lab 28

Block(Tiled) Algorithm
Total required data accesses
 Rewrite for-loop by TILE_WIDTH = 2 x (TILE_WIDTH)^2
For i’ = 1:N step TILE_WIDTH Total computing= 2 x (TILE_WIDTH)^3
For j’ = 1:N step TILE_WIDTH
For k’ = 1:N step TILE_WIDTH
For i = i’: i’+ TILE_WIDTH - 1
For j = j’: j’+ TILE_WIDTH - 1
For k = k’: k’+ TILE_WIDTH - 1
C[i][j]+=A[i][k]*B[k][j]

 We can find a small enough TILE_WIDTH, such that all the

values needed by C[i][j] are in shared memory
Every data is re-used TILE_WIDTH times
 Given 48KB shared memory: Include output array C[][]
 Max tiled size = (48KB/4B/3)^(1/2) = 64
 CGMA = number of data re-use = TILE_WIDTH = 64!
NTHU LSA Lab 29
extern __shared__ int S[];
inline int Addr(int matrixIdx, int i, int j, int N) {
return (N*N*matrixIdx + i*N+ j);
}
__global__ void Matrix_Mul(int* A, int* B,int* C, int* N) {
int i = threadIdx.x;
int j = threadIdx.y;
//move data to shared memory
S[Addr(0, i, j, N)]=A[Addr(0, i, j, N)];
S[Addr(1, i, j, N)]=B[Addr(0, i, j, N)];
__syncthreads();
// do computation
for(int k=0; k<*N; k++)
C[Addr(1, i, j, N)]=S[Addr(0, i, k, N)]*S[Addr(1, k, j, N)];
}
void main() {
for(int i=0; i<N; i+=TILE_WIDTH)
for(int j=0; j<N; j+=TILE_WIDTH){
cudaMemcpy(d_A, &A[i,j], sizeof(int)*TILE_WIDTH*TILE_WIDTH, H2D);
cudaMemcpy(d_B, &B[i,j], sizeof(int)*TILE_WIDTH*TILE_WIDTH, H2D);
Matrix_Mul<<<1, N, 2*N*N>>>(d_A, d_B, d_C, TILE_WIDTH);
cudaMemcpy(&C[i,j], d_C, sizeof(int)*TILE_WIDTH*TILE_WIDTH), D2H;
}
} NTHU LSA Lab 30
Tiled Algorithm
 Block algorithms or tiled algorithms:
 Split the inputs into blocks to fit into shared (cache) memory
 Increase data reuse, minimize global memory access

 Larger CGMA ratio does not always guarantee better

performance.
 CGMA ratio should be large enough to hide the
communication cost, not the larger the better
 Block algorithms cause overhead due to increasing
computations or number of thread blocks

NTHU LSA Lab 31

NTHU LSA Lab 32

Shared Memory Architecture
 Many threads accessing memory Bank0
 Therefore, memory is divided into banks Bank1
 Successive 32-bit (4Bytes) words assigned to Bank2
successive banks Bank3

 Each bank can service one address per cycle Bank4

Bank5
 A memory can service as many simultaneous
Bank6
accesses as it has banks
Bank7
 Multiple simultaneous accesses to a bank
result in a bank conflict
 Conflicting accesses are serialized
 Shared memory is as fast as register if no Bank15

bank conflict
NTHU LSA Lab 33
Example: No bank Conflict
 Linear addressing  Random 1:1 Permutation
Thread0 Bank0 Thread0 Bank0
Thread1 Bank1 Thread1 Bank1
Thread2 Bank2 Thread2 Bank2
Thread3 Bank3 Thread3 Bank3
Thread4 Bank4 Thread4 Bank4
Thread5 Bank5 Thread5 Bank5
Thread6 Bank6 Thread6 Bank6
Thread7 Bank7 Thread7 Bank7

Thread15 Bank15 Thread15 Bank15

NTHU LSA Lab 34

Example: No bank Conflict
 If all threads of a half-warp
Thread0 Bank0
read the identical address, Thread1 Bank1
there is no bank conflict Thread2 Bank2
(using broadcast) Thread3 Bank3
Thread4 Bank4
 Assume warp size is 8
Thread5 Bank5
 Thread0~3 access the same Thread6 Bank6
data & in the same half-warp Thread7 Bank7
 The rest of threads also have
1:1 permutation and no conflict
 But not for write access Thread31 Bank15

NTHU LSA Lab 35

Example: Bank Conflict
 n-way bank conflict
 Each bank has n different memory access
 Ex: 2-way bank conflict
__shared__ int array[2][32];
int offset = threadIdx.x*2;
int temp = array[offset/32][offset%32];

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3

NTHU LSA Lab 36

Bank Conflict Avoidance
 Change shared memory access pattern
 Linear addressing access
 1:1 permutation
 Broadcast: half-warp read the identical address

 Memory padding
 Add addition memory space to avoid bank conflict

NTHU LSA Lab 37

Example: 2D array
 32x32 SMEM array
 Warp accesses a column:
 32-way bank conflicts (threads in a warp access
the same bank)

NTHU LSA Lab 38

Memory Padding
 Add a column for padding:
 32x33 SMEM array

 Warp accesses a column:

 32 different banks, no bank conflicts

NTHU LSA Lab 39

Address linearization (SoA)
 Address linearization can avoid bank conflict on shared
memory, and provide memory coalescing on local memory or
constant memory
 An array of structures behaves like row major accesses
 struct Point { double x; double y; double z;}
A[N];
 A[threadIdx].x = …
A[1].x A[1].y A[1].z A[2].x A[2].y A[2].z A[3].x A[3].y A[3].z
 A structure of arrays behaves like column major
 struct PointList{double *x; double *y; double *z;}
A;
 A.x[threadIdx] = …
A[1].x A[2].x A[3].x A[1].y A[2].y A[3].y A[1].z A[2].z A[3].z
NTHU LSA Lab 40
Slides from Mark Harris, NVIDIA Developer Technology
Performance Optimization

AN EXAMPLE OF CUDA

NTHU LSA Lab 41

Performance!

30x Speedup!

NTHU LSA Lab 42

Run on block1
Run on block2
T1 T2 T1 T2

T1 T1

T1 Block1 needs the result of 14 from

block1

NTHU LSA Lab 43

NTHU LSA Lab 44
If the maximum threads per block is 8

NTHU LSA Lab 45

// input/output data is initiated on global memory

// Use shared memory for computations

// Wait for other threads to finish moving

// Sync between threads in the same block

NTHU LSA Lab 46

Executed by one Multiprocessor

NTHU LSA Lab 47

NTHU LSA Lab 48
If WARP=4:

Executed by one Multiprocessor

4WARP

2WARP

1WARP

Highly divergent wrap (threadID 0~14)

NTHU LSA Lab 49
NTHU LSA Lab 50
If WARP=4:

1WARP

Highly divergent memory access locations

NTHU LSA Lab 51

NTHU LSA Lab 52
NTHU LSA Lab 53
54
55
NTHU LSA Lab 56
NTHU LSA Lab 57
Half of the threads are idle since 1st iteration! 58
NTHU LSA Lab 59
NTHU LSA Lab 60
Details in backup slides
NTHU LSA Lab 61
NTHU LSA Lab 62
Backup

NTHU LSA Lab 63

NTHU LSA Lab 64
NTHU LSA Lab 65
NTHU LSA Lab 66
NTHU LSA Lab 67
NTHU LSA Lab 68
NTHU LSA Lab 69
NTHU LSA Lab 70
NTHU LSA Lab 71
NTHU LSA Lab 72
Reference
 NIVIDA Advanced CUDA Webinar Memory Optimizations
 http://on-demand.gputechconf.com/gtc-express/2011/
presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memo
ry_Optimization.pdf
 NVIDIA CUDA C/C++ Streams and Concurrency
 http://on-demand.gputechconf.com/gtc-express/2011/
presentations/StreamsAndConcurrencyWebinar.pdf
 Mark Harris, NVIDIA Developer Technology
 http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_
Harris.pdf

NTHU LSA Lab 73

Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
No ratings yet
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
77 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Class 13
No ratings yet
Class 13
19 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Hpca2021 Gpu 2
No ratings yet
Hpca2021 Gpu 2
41 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Advanced CUDA for Developers
No ratings yet
Advanced CUDA for Developers
41 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
PC Cuda Assignment-2
No ratings yet
PC Cuda Assignment-2
29 pages
CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
No ratings yet
CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
64 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CENG443 2023 Final
No ratings yet
CENG443 2023 Final
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
LM32 Ait L23
No ratings yet
LM32 Ait L23
22 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA
No ratings yet
CUDA
18 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
GitHub - Metio - Storage-Units - Java - Java Library For Storage Units
No ratings yet
GitHub - Metio - Storage-Units - Java - Java Library For Storage Units
3 pages
Computer Applications Jsa 2021 (5) - 1
No ratings yet
Computer Applications Jsa 2021 (5) - 1
72 pages
Linux Partition Naming Convention (IDE Drive Mappings)
100% (2)
Linux Partition Naming Convention (IDE Drive Mappings)
9 pages
(MCQ) Computer Communication Networks - LMT3
No ratings yet
(MCQ) Computer Communication Networks - LMT3
11 pages
Activity Cum Lesson Planner - Voice Business Process
100% (1)
Activity Cum Lesson Planner - Voice Business Process
144 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
Max9060-Max9064 Ultra-Small, Nanopower Single Comparators in 4-Bump Ucsp and 5 Sot23
No ratings yet
Max9060-Max9064 Ultra-Small, Nanopower Single Comparators in 4-Bump Ucsp and 5 Sot23
10 pages
Object Oriented Programming Lab Manual
No ratings yet
Object Oriented Programming Lab Manual
60 pages
Datasheet - GWN7830, 7831, 7832english
No ratings yet
Datasheet - GWN7830, 7831, 7832english
3 pages
CPU Performance Analysis Guide
No ratings yet
CPU Performance Analysis Guide
35 pages
UNIT-1: Fundamentals of D.C. Circuits
No ratings yet
UNIT-1: Fundamentals of D.C. Circuits
80 pages
Study Dote
No ratings yet
Study Dote
15 pages
Pacman Game in Assembly Language
67% (6)
Pacman Game in Assembly Language
2 pages
Corrigendum-IV: Notice Inviting Eoi For Empanelment of Business Partners For Telecom and It Business Opportunities
No ratings yet
Corrigendum-IV: Notice Inviting Eoi For Empanelment of Business Partners For Telecom and It Business Opportunities
2 pages
RYU Soft Testbed v2.0
No ratings yet
RYU Soft Testbed v2.0
146 pages
Proces rm002 - en P PDF
No ratings yet
Proces rm002 - en P PDF
2 pages
Quick Start Guide of Cv10-Controlvit Series: Safety Precautions
No ratings yet
Quick Start Guide of Cv10-Controlvit Series: Safety Precautions
5 pages
GSR-XG Led Latest Pricelist
No ratings yet
GSR-XG Led Latest Pricelist
20 pages
Esquema Placa Mae 810 Celeron
No ratings yet
Esquema Placa Mae 810 Celeron
41 pages
Simplified Modular Programming QBasic
No ratings yet
Simplified Modular Programming QBasic
2 pages
2D Racing Game Project
No ratings yet
2D Racing Game Project
31 pages
Chapter 1 Emerging Technology
No ratings yet
Chapter 1 Emerging Technology
17 pages
Sony cdx-gt23w gt24 Gt24ee Ver1.0
No ratings yet
Sony cdx-gt23w gt24 Gt24ee Ver1.0
40 pages
ADC Through SPI
100% (1)
ADC Through SPI
4 pages
VPC Tutorial
No ratings yet
VPC Tutorial
12 pages
Computer Networks Midterm Exam
No ratings yet
Computer Networks Midterm Exam
8 pages
958174D01 - en - O.1 - 958174D01 - en - O.1 Codes de Pannes KCE Complète
100% (6)
958174D01 - en - O.1 - 958174D01 - en - O.1 Codes de Pannes KCE Complète
117 pages
CIS Ubuntu Linux 20.04 LTS Benchmark v1.0.0
No ratings yet
CIS Ubuntu Linux 20.04 LTS Benchmark v1.0.0
547 pages
Ub400 (Un) Ug V1
No ratings yet
Ub400 (Un) Ug V1
23 pages
Rest Api
No ratings yet
Rest Api
122 pages

Chap9 - CUDA Optimization

Uploaded by

Chap9 - CUDA Optimization

Uploaded by

Performance (Memory)

National Tsing Hua University

NTHU LSA Lab 2

NTHU LSA Lab 3

NTHU LSA Lab 4

NTHU LSA Lab 5

 Enables highest cudaMemcpy performance

Parallel Programming – NTHU LSA Lab 6

NTHU LSA Lab 7

 CPU and GPU are synchronized due to kernel2

NTHU LSA Lab 9

NTHU LSA Lab 13

Stream 1 H2D (S1) event

Stream 2 D2H (S2) kernel (S2)

NTHU LSA Lab 16

NTHU LSA Lab 17

NTHU LSA Lab 18

 Unaligned sequential addresses that fit into two 128-

NTHU LSA Lab 19

 Sometimes, it will be faster than (L1) cached memory

NTHU LSA Lab 21

NTHU LSA Lab 22

NTHU LSA Lab 23

NTHU LSA Lab 24

 But shared memory is small

extern __shared__ int S[];

NTHU LSA Lab 28

 We can find a small enough TILE_WIDTH, such that all the

 Larger CGMA ratio does not always guarantee better

NTHU LSA Lab 31

NTHU LSA Lab 32

 Each bank can service one address per cycle Bank4

Thread15 Bank15 Thread15 Bank15

NTHU LSA Lab 34

NTHU LSA Lab 35

NTHU LSA Lab 36

NTHU LSA Lab 37

NTHU LSA Lab 38

 Warp accesses a column:

NTHU LSA Lab 39

NTHU LSA Lab 41

NTHU LSA Lab 42

T1 Block1 needs the result of 14 from

NTHU LSA Lab 43

NTHU LSA Lab 45

// Use shared memory for computations

// Wait for other threads to finish moving

// Sync between threads in the same block

NTHU LSA Lab 46

NTHU LSA Lab 47

Executed by one Multiprocessor

Highly divergent wrap (threadID 0~14)

Highly divergent memory access locations

NTHU LSA Lab 51

NTHU LSA Lab 63

NTHU LSA Lab 73

You might also like

extern shared int S[];