周03
周03
中国科学院大学计算机学院硕士生专业选修课
GPU架构与编程
第二课: Introduction to CUDA C
赵地
中科院计算所
2023年秋季学期
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
1
2023秋
int main(void) {
size_t inputLength = 500;
thrust::host_vector<float> hostInput1(inputLength);
thrust::host_vector<float> hostInput2(inputLength);
thrust::device_vector<float> deviceInput1(inputLength);
thrust::device_vector<float> deviceInput2(inputLength);
thrust::device_vector<float> deviceOutput(inputLength);
thrust::copy(hostInput1.begin(), hostInput1.end(),
deviceInput1.begin());
thrust::copy(hostInput2.begin(), hostInput2.end(),
deviceInput2.begin());
thrust::transform(deviceInput1.begin(), deviceInput1.end(),
deviceInput2.begin(), deviceOutput.begin(),
thrust::plus<float>());
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
2
2023秋
C# Hybridizer
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
3
2023秋
+ + + +
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
4
2023秋
// Part 2
// Part 3
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Host
Global
Memory
– Host code can
– Transfer data to/from per
grid global memory
We will cover more memory types and
more sophisticated memory models later.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
5
2023秋
– cudaMalloc()
– Allocates an object in the device global
memory
(Device) Grid
Block (0, 0) Block (0, 1)
– Two parameters
– cudaFree()
– Frees object from device global memory
– One parameter
– Pointer to freed object
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
H o s t - D e v i c e D a t a Tr a n s f e r A P I f u n c t i o n s
–cudaMemcpy()
– memory data transfer
(Device) Grid
Block (0, 0) Block (0, 1)
– Requires four parameters
Registers Registers Registers Registers –Pointer to destination
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
Host
–Pointer to source
Global
Memory
–Number of bytes copied
–Type/Direction of transfer
– Transfer to device is synchronous
with respect to the host
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
6
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Unified Memory
– cudaMallocManaged(
(Device) Grid void** ptr, size_t size)
Block (0, 0) Block (0, 1)
– Single memory space for all
Registers Registers Registers Registers CPUs/GPUs
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) – Maintain single copy of data
Host
– CUDA-managed data
Global
Memory
– On-demand page migration
– Compatible with cudaMalloc(),
… Unified Memory … cudaFree()
– Can be optimized
– cudaMemAdvise(),
cudaMemPrefetchAsync(),
– cudaMemcpyAsync()
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
7
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
8
2023秋
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
+ + + +
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
9
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
i = blockIdx.x *
blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
10
2023秋
… … …
i = blockIdx.x * i = blockIdx.x * i = blockIdx.x *
blockDim.x + blockDim.x + … blockDim.x +
threadIdx.x; threadIdx.x; threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
… … …
• Simplifies memory
addressing when processing
Block (1, 0) Block (1, 1)
–…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
11
2023秋
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
NVCC Compiler
–NVIDIA provides a CUDA-C compiler
–nvcc
12
2023秋
NVIDIA Provided
3rd Party
https://developer.nvidia.com/debugging-solutions
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
NVIDIA Provided
TAU VampirTrace
3rd Party
https://developer.nvidia.com/performance-analysis-tools
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
13
2023秋
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
Profiling Tools
nvprof NVVP
Phasing out
Nsight Nsight
Compute Systems
Current
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
14
2023秋
讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory
15
2023秋
(Device) Grid
● Two parameters, with an optional third
Block (0, 0) Block (0, 1) parameter.
Registers Registers Registers Registers ● Address of a pointer to the allocated
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) object
● Size of the allocated object in terms of
Host
Memory
Global
Memory
bytes
Unified Memory ● [Optional] Flag indicating if memory can
be accessed from any device or stream
● cudaFree()
16
2023秋
// Kernel invocation code - to be shown m_A, m_B gets initialized on the host
later
The device performs the actual vector
cudaFree(m_A); cudaFree(m_B); addition
cudaFree(m_C);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
17
2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU架构与编程
第二课:CUDA Parallelism Model
赵地
中科院计算所
2023年秋季学期
18
2023秋
讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling
__global__
void vecAddKernel(float* A, float* B, float* C, int
n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
19
2023秋
E x a m p l e : Ve c t o r A d d i t i o n K e r n e l L a u n c h
(Host Code)
Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int
n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C,
n);
} The ceiling function makes sure that there
are enough threads to cover all elements.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
20
2023秋
讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling
21
2023秋
host device
Block (0, Block (0,
Grid 1
0) 1)
Kernel 1
Block (1, Block (1,
0) 1)
Block (1,0)
Grid 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3)
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
16×16 blocks
62×76 picture
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
22
2023秋
M
Row*Width+Col = 2*4+1 = 9
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
23
2023秋
Not all threads in a Block will follow the same control flow path.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
24
2023秋
讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling
25
2023秋
0.71
0.21 0.07
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
26
2023秋
__global__ void colorConvert(unsigned char * // one can think of the RGB image having
grayImage, // CHANNEL times columns than the gray scale
image
unsigned char * rgbImage,
int width, int int rgbOffset = grayOffset*CHANNELS;
height) { unsigned char r = rgbImage[rgbOffset ]; //
int x = threadIdx.x + blockIdx.x * blockDim.x; red value for pixel
27
2023秋
讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling
28
2023秋
Image Blurring
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Blurring Box
Pixels
processed
by a thread
block
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
29
2023秋
} }
} }
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
30
2023秋
讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling
Transparent Scalability
Device
Device
Thread grid
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 time Block 4 Block 5 Block 6 Block 7
Block 2 Block 3 Block 6 Block 7
Block 4 Block 5
Block 6 Block 7
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
31
2023秋
32
2023秋
Warp Example
• If 3 blocks are assigned to an SM and each
block has 256 threads, how many Warps are
there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
…t0 t1 t2 … …t0 t1 t2 … …t0 t1 t2 …
Block 0 Warps Block 1 Warps Block 2 Warps
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
33
2023秋
34
2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU架构与编程
第二课:Memory and Data Locality
赵地
中科院计算所
2023年秋季学期
讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms
35
2023秋
36
2023秋
WIDTH
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
37
2023秋
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
strip
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3
Block(1,0) Block(1,1)
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
38
2023秋
Calculation of P 0 , 0 and P 0 , 1
N0,0 N0,1
N1,0 N1,1
N2,0 N2,1
N3,0 N3,1
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Grid
Constant Memory
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
39
2023秋
Example:
Shared Memory Variable Declaration
void blurKernel(unsigned char * in,
unsigned char * out, int w, int h)
{
__shared__ float
ds_in[TILE_WIDTH][TILE_WIDTH];
…
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
40
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
41
2023秋
讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms
Thread 1 Thread 2
…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
42
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
43
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
44
2023秋
Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2
45
2023秋
讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms
Matrix Multiplication
– Data access pattern N
and a column of N
– Each thread block – a strip
of M and a strip of N
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
46
2023秋
WIDTH
– so that the data
accesses by the thread
block in each phase
are focused on one tile M P
of M and one tile of N
BLOCK_WIDTHE
– The tile is of
WIDTH
Row
BLOCK_SIZE elements
in each dimension BLOCK_WIDTH
WIDTH WIDTH
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Col
Loading a Tile
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
47
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
48
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
49
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
50
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
51
2023秋
Barrier Synchronization
– Synchronize all threads in a block
– __syncthreads()
讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms
52
2023秋
WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P
TILE_WIDTHE
WIDTH
Row
TILE_WIDTH
WIDTH WIDTH
Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
53
2023秋
WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
54
2023秋
M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]
N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
where p is the sequence number of the current phase
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
55
2023秋
+=
float Pvalue = 0;
for (int i = 0; i < TILE_WIDTH;
// Loop over the M and N tiles required to
compute the P element ++i)Pvalue ds_M[ty][i] *
for (int p = 0; p < n/TILE_WIDTH; ++p) {
ds_N[i][tx];
... }
P[Row*Width+Col] = Pvalue; __synchthreads();
}
}
+=
for (int i = 0; i <
// Loop over the M and N tiles required to
compute the P element
TILE_WIDTH; ++i)Pvalue
for (int p = 0; p < n/TILE_WIDTH; ++p) {
... ds_M[ty][i] * ds_N[i][tx];
} __synchthreads();
P[Row*Width+Col] = Pvalue; }
}
56
2023秋
57
2023秋
讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms
58
2023秋
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
59
2023秋
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
60
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
A “Simple” Solution
61
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
– Need to test
TILE_WIDTH TILE_WIDTH
62
2023秋
TILE_WIDTH
B
Col]
TILE_WIDTH
– Need to test
– (p*TILE_WIDTH+ty < Width)
&& (Col< Width)
– If true, load N element
– Else , load 0
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
63
2023秋
64
2023秋
65