0% found this document useful (0 votes)
43 views65 pages

周03

This document summarizes a lecture on GPU architecture and CUDA programming. The lecture covers CUDA C vs libraries like Thrust, memory allocation and data movement functions, threads and kernels, and the CUDA toolkit. It provides examples of vector addition in CUDA C, Thrust, and OpenACC. It also discusses data parallelism, traditional C vector addition code, and the host and device code structure in CUDA. Finally, it provides a partial overview of CUDA memories and device memory management API functions.

Uploaded by

sy1990010111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views65 pages

周03

This document summarizes a lecture on GPU architecture and CUDA programming. The lecture covers CUDA C vs libraries like Thrust, memory allocation and data movement functions, threads and kernels, and the CUDA toolkit. It provides examples of vector addition in CUDA C, Thrust, and OpenACC. It also discusses data parallelism, traditional C vector addition code, and the host and device code structure in CUDA. Finally, it provides a partial overview of CUDA memories and device memory management API functions.

Uploaded by

sy1990010111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课: Introduction to CUDA C
赵地
中科院计算所
2023年秋季学期

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

1
2023秋

Libraries: Easy, High-Quality Acceleration


//Vector Addition in Thrust
#include <thrust/device_vector.h>
#include <thrust/copy.h>

int main(void) {
size_t inputLength = 500;
thrust::host_vector<float> hostInput1(inputLength);
thrust::host_vector<float> hostInput2(inputLength);
thrust::device_vector<float> deviceInput1(inputLength);
thrust::device_vector<float> deviceInput2(inputLength);
thrust::device_vector<float> deviceOutput(inputLength);

thrust::copy(hostInput1.begin(), hostInput1.end(),
deviceInput1.begin());
thrust::copy(hostInput2.begin(), hostInput2.end(),
deviceInput2.begin());

thrust::transform(deviceInput1.begin(), deviceInput1.end(),
deviceInput2.begin(), deviceOutput.begin(),
thrust::plus<float>());
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Compiler Directives: Easy,


Portable Acceleration
– OpenACC Compiler directives for C, C++,
and FORTRAN
#pragma acc parallel loop
copyin(input1[0:inputLength],input
2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength;
++i) {
output[i] = input1[i] +
input2[i];
}

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

2
2023秋

Programming Languages: Most


Performance and Flexible Acceleration

Numerical analytics MATLAB,, Mathematica, LabVIEW


Python PyCUDA, Numba

Fortran CUDA Fortran, OpenACC


C CUDA C, OpenACC
C++ CUDA C++, Thrust

C# Hybridizer

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

3
2023秋

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

+ + + +

vector C C[0] C[1] C[2] … C[N-1]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Vector Addition – Traditional C Code


// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}

int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements

vecAdd(h_A, h_B, h_C, N);
}

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

4
2023秋

Heterogeneous Computing vecAdd CUDA Host Code


#include <cuda.h>

void vecAdd(float *h_A, float *h_B, float *h_C, int n){

int size = n* sizeof(float);


Part 1
float *d_A, *d_B, *d_C;
Part 2
// Part 1
Host Memory Device Memory
// Allocate device memory for A, B, and C
CPU GPU
// copy A and B to device memory

// Part 2

Part 3 // Kernel launch code – the device performs the actual


vector addition

// Part 3

// copy C from the device memory

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Partial Overview of CUDA Memories


– Device code can:
(Device) Grid
Block (0, 0) Block (0, 1)
– R/W per-thread registers
Registers Registers Registers Registers
– R/W all-shared global
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

Host
Global
Memory
– Host code can
– Transfer data to/from per
grid global memory
We will cover more memory types and
more sophisticated memory models later.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

5
2023秋

CUDA Device Memory Management API functions

– cudaMalloc()
– Allocates an object in the device global
memory
(Device) Grid
Block (0, 0) Block (0, 1)
– Two parameters

Registers Registers Registers Registers


– Address of a pointer to the allocated
object
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

– Size of allocated object in terms of


Host
Global bytes
Memory

– cudaFree()
– Frees object from device global memory
– One parameter
– Pointer to freed object

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

H o s t - D e v i c e D a t a Tr a n s f e r A P I f u n c t i o n s

–cudaMemcpy()
– memory data transfer
(Device) Grid
Block (0, 0) Block (0, 1)
– Requires four parameters
Registers Registers Registers Registers –Pointer to destination
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

Host
–Pointer to source
Global
Memory
–Number of bytes copied
–Type/Direction of transfer
– Transfer to device is synchronous
with respect to the host
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

6
2023秋

Vector Addition, Explicit Memory Management


… Allocate h_A, h_B, h_C …
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);


cudaMalloc((void **) &d_B, size);
cudaMalloc((void **) &d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Kernel invocation code – to be shown later

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);


cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
… Free h_A, h_B, h_C …

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Unified Memory
– cudaMallocManaged(
(Device) Grid void** ptr, size_t size)
Block (0, 0) Block (0, 1)
– Single memory space for all
Registers Registers Registers Registers CPUs/GPUs
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) – Maintain single copy of data

Host
– CUDA-managed data
Global
Memory
– On-demand page migration
– Compatible with cudaMalloc(),
… Unified Memory … cudaFree()
– Can be optimized
– cudaMemAdvise(),
cudaMemPrefetchAsync(),
– cudaMemcpyAsync()
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

7
2023秋

Vector Additi on, Unified Memory


float *A, *B, *C
cudaMallocManaged(&A, n * sizeof(float));
cudaMallocManaged(&B, n * sizeof(float));
cudaMallocManaged(&C, n * sizeof(float));
// Initialize A, B
void vecAdd(float *A, float *B, float *C, int n)
{
// Kernel invocation code – to be shown later
}
cudaFree(A);
cudaFree(B);
cudaFree(C);

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

In Practice, Check for API Errors in Host Code


cudaError_t err = cudaMalloc((void **) &d_A,
size);
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”,
cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

8
2023秋

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

+ + + +

vector C C[0] C[1] C[2] … C[N-1]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

9
2023秋

CUDA Execution Model


– Heterogeneous host (CPU) + device (GPU)
application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)


KernelB<<< nBlk, nTid >>>(args); ...

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Arrays of Parall el Threads

• A CUDA kernel is executed by a grid


(array) of threads
–All threads in a grid run the same kernel code
(Single Program Multiple Data)
–Each thread has indexes that it uses to compute
memory addresses and make control decisions
0 1 2 254 255

i = blockIdx.x *
blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];


The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

10
2023秋

Thread Blocks: Scalable Cooperation


Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255

… … …
i = blockIdx.x * i = blockIdx.x * i = blockIdx.x *
blockDim.x + blockDim.x + … blockDim.x +
threadIdx.x; threadIdx.x; threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];

… … …

– Divide thread array into multiple blocks


– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks do not interact
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

blockIdx and threadIdx

• Each thread uses indices to decide what data to


work on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0) device
– threadIdx: 1D, 2D, or 3D Grid
Block (0, 0) Block (0, 1)

• Simplifies memory
addressing when processing
Block (1, 0) Block (1, 1)

multidimensional data Block (1,1)


– Image processing
(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread

– Solving PDEs on volumes


(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread (0,0,0)
Thread
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

–…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

11
2023秋

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

NVCC Compiler
–NVIDIA provides a CUDA-C compiler
–nvcc

–NVCC compiles device code then


forwards code on to the host compiler
(e.g. g++)
–Can be used to compile & link host only
applications
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

12
2023秋

Developer Tools - Debuggers


Nsight CUDA
Nsight Systems CUDA-GDB MEMCHECK

NVIDIA Provided

3rd Party
https://developer.nvidia.com/debugging-solutions

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Developer Tools - Profilers

NSIGHT NVVP NVPROF

NVIDIA Provided

TAU VampirTrace

3rd Party
https://developer.nvidia.com/performance-analysis-tools

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

13
2023秋

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

Profiling Tools

nvprof NVVP

Phasing out

Nsight Nsight
Compute Systems

Current

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

14
2023秋

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

Partial Overview of CUDA Memories


● Device code can:
● R/W per-thread registers
(Device) Grid
Block (0, 0) Block (0, 1)
● R/W all-shared global memory
Registers Registers Registers Registers
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) ● R/W managed memory
Global
(Unified Memory)
Host
Memory Memory

Unified Memory ● Host code can


● Transfer data to/from per grid
global memory
● R/W managed memory
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

15
2023秋

Partial Overview of CUDA Memories


● cudaMallocManaged()

● Allocates an object in the Unified Memory


address space.

(Device) Grid
● Two parameters, with an optional third
Block (0, 0) Block (0, 1) parameter.
Registers Registers Registers Registers ● Address of a pointer to the allocated
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) object
● Size of the allocated object in terms of
Host
Memory
Global
Memory
bytes
Unified Memory ● [Optional] Flag indicating if memory can
be accessed from any device or stream
● cudaFree()

● Frees object from unified memory.


● One parameter
● Pointer to freed object
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Partial Overview of CUDA Memories


● cudaMemcpy()
● Memory data transfer
(Device) Grid
Block (0, 0) Block (0, 1)
● Requires four parameters
● Pointer to destination
Registers Registers Registers Registers
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
● Pointer to source
● Number of bytes copied
Host
Memory
Global
Memory
● Type/Direction of transfer
Unified Memory ● Depending on the transfer type, the driver
may decide to use the memory on the host
or the device.
● In Unified Memory this function is utilized
to copy data between different arrays,
regardless of position.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

16
2023秋

Putting i t all together, vecAdd CUDA host code


using Unified Memory
int main() {

float *m_A, float *m_B, float *m_C, int n;

int size = n * sizeof(float);

cudaMallocManaged((void**) &m_A, size);


cudaMallocManaged((void**) &m_B, size);
cudaMallocManaged((void**) &m_C, size); Allocation of Managed Memory

// Memory initialization on the Host

// Kernel invocation code - to be shown m_A, m_B gets initialized on the host
later
The device performs the actual vector
cudaFree(m_A); cudaFree(m_B); addition
cudaFree(m_C);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

CUDA Unified Memory for different


architectures
Prior to compute Compute capability 6.x
capability 6.x onwards
● There are specialized
● There is no specialized hardware units managing
hardware units to improve UM page faulting.
efficiency. ● Data is migrated on demand,
● For data migration the full meaning that data gets
memory block needs to be copied only on page fault.
copied synchronically by the
driver. ● Possibility to oversubscribe
memory, enabling
● No memory oversubscription. larger arrays than the device
memory size.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

17
2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课:CUDA Parallelism Model
赵地
中科院计算所
2023年秋季学期

18
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

Example: Vector Addition Kernel


Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition

__global__
void vecAddKernel(float* A, float* B, float* C, int
n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

19
2023秋

E x a m p l e : Ve c t o r A d d i t i o n K e r n e l L a u n c h
(Host Code)
Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int
n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C,
n);
} The ceiling function makes sure that there
are enough threads to cover all elements.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

More on Kernel Launch (Host Code)


Host Code

void vecAdd(float* h_A, float* h_B, float* h_C, int


n)
This is an equivalent way to express the
{ ceiling function.

dim3 DimGrid((n-1)/256 + 1, 1, 1);


dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C,
n);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

20
2023秋

More on CUDA Function Declarations

Executed on Only callable


the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

− __global__ defines a kernel function


− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

21
2023秋

A Multi-Dimensional Grid Example

host device
Block (0, Block (0,
Grid 1
0) 1)
Kernel 1
Block (1, Block (1,
0) 1)

Block (1,0)
Grid 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread


(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread Thread
(0,0,0)
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Processing a Picture with a 2D Grid

16×16 blocks

62×76 picture

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

22
2023秋

Row-Major Layout in C/C++

M
Row*Width+Col = 2*4+1 = 9
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

M0,0 M0,1 M0,2 M0,3


M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Source Code of a PictureKernel


__global__ void PictureKernel(float* d_Pin, float* d_Pout,
int height, int width)
{

// Calculate the row # of the d_Pin and d_Pout element


int Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element


int Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in range


if ((Row < height) && (Col < width)) {
d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
} Scale every pixel value by 2.0
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

23
2023秋

Host Code for Launching PictureKernel


// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device

dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Co v eri ng a 6 2× 7 6 Pic tu r e w it h 16 × 16 Blo ck s

Not all threads in a Block will follow the same control flow path.

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

24
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

RGB Color Image Representation


● Each pixel in an image is an RGB value

● The format of an image’s row is


(r g b) (r g b) … (r g b)

● RGB ranges are not distributed uniformly

● Many different color spaces, here we show the


constants to convert to AdobeRGB color space
● The vertical axis (y value) and horizontal axis (x
value) show the fraction of the pixel intensity that
should be allocated to G and B. The remaining
fraction (1-y–x) of the pixel intensity that should be
assigned to R
● The triangle contains all the representable colors in
this color space
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

25
2023秋

RGB to Grayscale Conversion

A grayscale digital image is an image in which the value


of each pixel carries only intensity information.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Color Calculating Formula

●For each pixel (r g b) at (I, J) do:


grayPixel[I,J] = 0.21*r + 0.71*g + 0.07*b
●This is just a dot product <[r,g,b],[0.21,0.71,0.07]>
with the constants being specific to input RGB space

0.71
0.21 0.07

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

26
2023秋

RGB to Grayscale Conversion Code


#define CHANNELS 3 // we have 3 channels if (x < width && y < height) {
corresponding to RGB
// get 1D coordinate for the grayscale image
// The input image is encoded as unsigned
characters [0, 255] int grayOffset = y*width + x;

__global__ void colorConvert(unsigned char * // one can think of the RGB image having
grayImage, // CHANNEL times columns than the gray scale
image
unsigned char * rgbImage,
int width, int int rgbOffset = grayOffset*CHANNELS;
height) { unsigned char r = rgbImage[rgbOffset ]; //
int x = threadIdx.x + blockIdx.x * blockDim.x; red value for pixel

int y = threadIdx.y + blockIdx.y * blockDim.y; unsigned char g = rgbImage[rgbOffset + 2]; //


green value for pixel
unsigned char b = rgbImage[rgbOffset + 3]; //
if (x < width && y < height) { blue value for pixel
... // perform the rescaling and store it
} // We multiply by floating point constants
} grayImage[grayOffset] = 0.21f*r + 0.71f*g +
0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

RGB to Grayscale Conversion Code


if (x < width && y < height) {
#define CHANNELS 3 // we have 3 channels corresponding
// get 1D coordinate for the grayscale
to RGB
image
// The input image is encoded as unsigned characters
int grayOffset = y*width + x;
[0, 255]
// one can think of the RGB image having
__global__ void colorConvert(unsigned char *
grayImage, // CHANNEL times columns than the gray
scale image
unsigned char * rgbImage,
int rgbOffset = grayOffset*CHANNELS;
int width, int height) {
unsigned char r = rgbImage[rgbOffset
int x = threadIdx.x + blockIdx.x * blockDim.x;
]; // red value for pixel
int y = threadIdx.y + blockIdx.y * blockDim.y;
unsigned char g = rgbImage[rgbOffset + 1];
// green value for pixel

if (x < width && y < height) { unsigned char b = rgbImage[rgbOffset + 2];


// blue value for pixel
...
// perform the rescaling and store it
}
// We multiply by floating point constants
}
grayImage[grayOffset] = 0.21f*r + 0.71f*g +
0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

27
2023秋

RGB to Grayscale Conversion Code


if (x < width && y < height) {
// we have 3 channels corresponding to RGB
// get 1D coordinate for the grayscale image
// The input image is encoded as unsigned
int grayOffset = y*width + x;
characters [0, 255]
// one can think of the RGB image having
__global__ void colorConvert(unsigned char *
grayImage, // CHANNEL times columns than the gray scale
image
unsigned char * rgbImage,
int rgbOffset = grayOffset*CHANNELS;
int width, int
height) { unsigned char r = rgbImage[rgbOffset ]; //
red value for pixel
int x = threadIdx.x + blockIdx.x * blockDim.x;
unsigned char g = rgbImage[rgbOffset + 2]; //
int y = threadIdx.y + blockIdx.y * blockDim.y;
green value for pixel
unsigned char b = rgbImage[rgbOffset + 3]; //
if (x < width && y < height) { blue value for pixel

... // perform the rescaling and store it

} // We multiply by floating point constants

} grayImage[grayOffset] = 0.21f*r + 0.71f*g +


0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

28
2023秋

Image Blurring

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Blurring Box

Pixels
processed
by a thread
block

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

29
2023秋

Image Blur as a 2D Kernel


__global__
void blurKernel(unsigned char * in, unsigned char * out, int w,
int h)
{
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {


... // Rest of our kernel
}
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

// Get the average of the surrounding


2xBLUR_SIZE x 2xBLUR_SIZE box
__global__
for(int blurRow = -BLUR_SIZE; blurRow
void blurKernel(unsigned char * in, unsigned < BLUR_SIZE+1; ++blurRow) {
char * out, int w, int h) {
for(int blurCol = -BLUR_SIZE;
int Col = blockIdx.x * blockDim.x + blurCol < BLUR_SIZE+1; ++blurCol) {
threadIdx.x;
int Row = blockIdx.y * blockDim.y +
threadIdx.y; int curRow = Row + blurRow;
int curCol = Col + blurCol;

if (Col < w && Row < h) { // Verify we have a valid


image pixel
int pixVal = 0;
if(curRow > -1 && curRow < h
int pixels = 0; && curCol > -1 && curCol < w) {
... pixVal += in[curRow * w +
// Write our new pixel value out curCol];

out[Row * w + Col] = (unsigned pixels++; // Keep track of


char)(pixVal / pixels); number of pixels in the accumulated total

} }

} }
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

30
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

Transparent Scalability
Device
Device
Thread grid
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 time Block 4 Block 5 Block 6 Block 7
Block 2 Block 3 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7

● Each block can execute in any order relative to others.


● Hardware is free to assign blocks to any processor at any time

● A kernel scales to any number of parallel processors

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

31
2023秋

Example: Executing Thread Blocks


– Threads are assigned to Streaming
Multiprocessors (SM) in block
granularity
– Up to 32 blocks to each SM as resource allows t0 t1 t2 … tm SM
– Volta SM can take up to 2048 threads SP

–Could be 256 (threads/block) * 8 blocks Blocks


–Or 512 (threads/block) * 4 blocks, etc. Shared
Memory

– SM maintains thread/block idx #s


– SM manages/schedules thread
execution The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Warps as Scheduling Units

• Each Block is executed as 32-thread Warps


–An implementation decision, not part of the
CUDA programming model
–Warps are scheduling units in SM
–Threads in a warp execute in SIMD
–Future GPUs may have different number of
threads in each warp
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

32
2023秋

Warp Example
• If 3 blocks are assigned to an SM and each
block has 256 threads, how many Warps are
there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
…t0 t1 t2 … …t0 t1 t2 … …t0 t1 t2 …
Block 0 Warps Block 1 Warps Block 2 Warps

…t31 …t31 …t31

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Example: Thread Scheduling (Cont.)


–SM implements zero-overhead warp
scheduling
–Warps whose next instruction has its operands
ready for consumption are eligible for
execution
–Eligible Warps are selected for execution based
on a prioritized scheduling policy
–All threads in a warp execute the same
instruction when selected
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

33
2023秋

Block Granularity Considerations


– For Matrix Multiplication using multiple blocks, should
each block have 4X4, 8X8 or 30X30 threads for Volta?
– For 4X4, we have 16 threads per Block. Each SM can take
up to 2048 threads, which translates to 128 Blocks. However,
each SM can only take up to 32 Blocks, so only 512 threads
will go into each SM!
– For 8X8, we have 64 threads per Block. Since each SM can
take up to 2048 threads, it can take up to 32 Blocks and
achieve full capacity unless other resource considerations
overrule.
– For 30X30, we would have 900 threads per Block. Only two
blocks could fit into an SM for Volta, so only 1800/2048 of
the SM thread capacity would be utilized.

34
2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课:Memory and Data Locality
赵地
中科院计算所
2023年秋季学期

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

35
2023秋

Review: Image Blur Kernel.


// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol)
{

int curRow = Row + blurRow;


int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the
accumulated total
}
}
}
// Write our new pixel value out
out[Row * w + Col] = (unsigned char)(pixVal / pixels);
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

How about performance on a GPU


– All threads access global memory for their input matrix
elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS

– Assume a GPU with


– Peak floating-point rate 1,600 GFLOPS with 600 GB/s DRAM bandwidth
– 4*1,600 = 6,400 GB/s required to achieve peak FLOPS rating
– The 600 GB/s memory bandwidth limits the execution at 150 GFLOPS

– This limits the execution rate to 9.3% (150/1600) of the peak


floating-point execution rate of the device!
–Need to drastically cut down memory accesses to get close
to the1,600 GFLOPS

36
2023秋

Example – Matrix Multiplication


N

WIDTH
M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

A Basic Matrix Multiplication


__global__ void MatrixMulKernel(float* M, float* N, float* P, int
Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}

37
2023秋

Example – Matrix Multiplication


__global__ void MatrixMulKernel(float* M, float* N, float* P, int
Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

A Toy Example: Thread to P Data Mapping

Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
strip
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

38
2023秋

Calculation of P 0 , 0 and P 0 , 1
N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1


0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Programmer View of CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

39
2023秋

Declaring CUDA Variables


Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

– __device__ is optional when used with


__shared__, or __constant__
– Automatic variables reside in a register
–Except per-thread arrays that reside in global
memory
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Example:
Shared Memory Variable Declaration
void blurKernel(unsigned char * in,
unsigned char * out, int w, int h)
{
__shared__ float
ds_in[TILE_WIDTH][TILE_WIDTH];

}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

40
2023秋

Where to Declare Variables?

global Can host access it? register


constant shared
Outside of
In the kernel
any Function

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Shared Memory in CUDA


– A special type of memory whose contents are
explicitly defined and used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput)
than global memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the
corresponding thread finishes terminates execution
– Accessed by memory load/store instructions
– A form of scratchpad memory in computer architecture

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

41
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Global Memory Access Pattern


of the Basic Matrix Multiplication Kernel
Global Memory

Thread 1 Thread 2

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

42
2023秋

Basic Concept of Tiling


– In a congested traffic system, significant reduction of vehicles can
greatly improve the delay seen by all vehicles
– Carpooling for commuters
– Tiling for global memory accesses
– drivers = threads accessing their memory data operands
– cars = memory access requests

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Some Computations are More Challenging to Tile


– Some carpools may be easier than others
– Car pool participants need to have similar work schedule
– Some vehicles may be more suitable for carpooling

– Similar challenges exist in tiling

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

43
2023秋

Carpools need synchronization.


– Good: when people have similar schedule

Worker A sleep work dinner


Time
Worker B sleep work dinner

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Carpools need synchronization.


– Bad: when people have very different schedule

Worker A party sleep work


time
Worker B sleep work dinner

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

44
2023秋

Same with Tiling


– Good: when threads have similar access timing

Thread 1
Time
Thread 2

Thread 1
Time
Thread 2

– Bad: when threads have very different timing


The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Outli ne of Tiling Technique


– Identify a tile of global memory contents that are accessed
by multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads
are ready to start the phase
– Have the multiple threads to access their data from the on-
chip memory
– Use barrier synchronization to make sure that all threads
have completed the current phase
– Move on to the next tile
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

45
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Matrix Multiplication
– Data access pattern N

– Each thread - a row of M


WIDTH

and a column of N
– Each thread block – a strip
of M and a strip of N
M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col

46
2023秋

Tiled Matrix Multiplication


– Break up the execution N
of each thread into
phases

WIDTH
– so that the data
accesses by the thread
block in each phase
are focused on one tile M P
of M and one tile of N

BLOCK_WIDTHE
– The tile is of

WIDTH
Row
BLOCK_SIZE elements
in each dimension BLOCK_WIDTH

WIDTH WIDTH

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Col

Loading a Tile

–All threads in a block


participate
–Each thread loads one M element
and one N element in tiled code

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

47
2023秋

Phase 0 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 0 Use for Block (0,0)


(iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

48
2023秋

Phase 0 Use for Block (0,0)


(iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

49
2023秋

Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

50
2023秋

Execution Phases of Toy Example

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Execution Phases of Toy Example (cont.)

Shared memory allows each value to be accessed by multiple


threads
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

51
2023秋

Barrier Synchronization
– Synchronize all threads in a block
– __syncthreads()

– All threads in the same block must reach the


__syncthreads() before any of the them can
move on
– Best used to coordinate the phased execution
tiled algorithms
– To ensure that all elements of a tile are loaded at the beginning of
a phase
– To ensure that all elements of a tile are consumed at the end of a
phase

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

52
2023秋

Loading Input Tile 0 of M (Phase 0)


– Have each thread load an M
N
element and an N element at
the same relative position
as its P element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

TILE_WIDTHE

WIDTH
Row

TILE_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Input Tile 0 of N (Phase 0)


– Have each thread load an M
N
element and an N element at
the same relative position as
its P element.
WIDTH

int Row = by * blockDim.y + ty;


int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

53
2023秋

Loading Input Tile 1 of M (Phase 1)


N

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Input Tile 1 of N (Phase 1)


N

2D indexing for accessing Tile 1:


WIDTH

M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

54
2023秋

M and N are dynamically allocated - use 1D indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
where p is the sequence number of the current phase
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Tiled Matrix Multiplication Kernel


_ _ g lo b a l _ _ v o i d M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , // Loop over the M and N tiles
f l o a t * P, I n t W i d t h )
required to compute the P
{ element
__shared__ float
ds_M[TILE_WIDTH][TILE_WIDTH]; for (int p = 0; p <
__shared__ float n/TILE_WIDTH; ++p) {
ds_N[TILE_WIDTH][TILE_WIDTH];
// Collaborative loading of
M and N tiles into shared
int bx = blockIdx.x; int by = blockIdx.y;
memory
int tx = threadIdx.x; int ty = threadIdx.y;
ds_M[ty][tx] = M[Row*Width +
p*TILE_WIDTH+tx];
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx; ds_N[ty][tx] =
float Pvalue = 0; N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
// Loop over the M and N tiles required to
compute the P element for (int i = 0; i <
for (int p = 0; p < n/TILE_WIDTH; ++p) { TILE_WIDTH; ++i)Pvalue +=
...
ds_M[ty][i] * ds_N[i][tx];
} __synchthreads();
P[Row*Width+Col] = Pvalue;
}
}

55
2023秋

Tiled Matrix Multiplication Kernel


_ _ g l o b a l_ _ v o id M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , // Loop over th e M and N til es required
f lo a t * P, I n t W id t h ) to compute the P element
{
for (int p = 0; p < n/TILE_WIDTH;
__shared__ float
ds_M[TILE_WIDTH][TILE_WIDTH]; ++p) {
__shared__ float // Collaborative loading of M and N
ds_N[TILE_WIDTH][TILE_WIDTH]; tiles into shared memory

int bx = blockIdx.x; int by = blockIdx.y;


ds_M[ty][tx] = M[Row*Width +
p*TILE_WIDTH+tx];
int tx = threadIdx.x; int ty = threadIdx.y;
ds_N[ty][tx] =
int Row = by * blockDim.y + ty;
N[(p*TILE_WIDTH+ty)*Width + Col];
int Col = bx * blockDim.x + tx; __syncthreads();

+=
float Pvalue = 0;
for (int i = 0; i < TILE_WIDTH;
// Loop over the M and N tiles required to
compute the P element ++i)Pvalue ds_M[ty][i] *
for (int p = 0; p < n/TILE_WIDTH; ++p) {
ds_N[i][tx];
... }
P[Row*Width+Col] = Pvalue; __synchthreads();
}
}

Tiled Matrix Multiplication Kernel


_ _ g lo b a l _ _ v o i d M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , f l o a t * P,
Int Width) // L oop over the M and N tile s
r equired to co mpute the P element
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; for (int p = 0; p < n/TILE_WIDTH;
++p) {
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
/ / Collaborati ve loading o f M and
N tiles into s hared memory
int bx = blockIdx.x; int by = blockIdx.y;
ds_M[ty][tx] = M[Row*Width +
int tx = threadIdx.x; int ty = threadIdx.y;
p*TILE_WIDTH+tx];
ds_N[ty][tx] =
int Row = by * blockDim.y + ty;
N[(p*TILE_WIDTH+ty)*Width + Col];
int Col = bx * blockDim.x + tx;
__syncthreads();
float Pvalue = 0;

+=
for (int i = 0; i <
// Loop over the M and N tiles required to
compute the P element
TILE_WIDTH; ++i)Pvalue
for (int p = 0; p < n/TILE_WIDTH; ++p) {
... ds_M[ty][i] * ds_N[i][tx];
} __synchthreads();
P[Row*Width+Col] = Pvalue; }
}

56
2023秋

Tile (Thread Block) Size Considerations


– Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads

– For 16, in each phase, each block performs 2*256 = 512


float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for
each memory load)
– For 32, in each phase, each block performs 2*1024 =
2048 float loads from global memory for 1024 * (2*32) =
65,536 mul/add operations. (32 floating-point operation
for each memory load)

Shared Memory and Threading


– For an SM with 16KB shared memory
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared
memory.
– For 16KB shared memory, one can potentially have up to 8 thread blocks
executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared memory
usage per thread block, allowing 2 thread blocks active at the same time
– However, in a GPU where the thread count is limited to 1536 threads per SM,
the number of blocks per SM is reduced to one!

– Each __syncthread() can reduce the number of active


threads for a block
– More thread blocks can be advantageous

57
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Handling Matrix of Arbitrary Size

• The tiled matrix multiplication kernel we presented so


far can handle only square matrices whose
dimensions (Width) are multiples of the tile width
(TILE_WIDTH)
• However, real applications need to handle arbitrary sized
matrices.
• One could pad (add elements to) the rows and columns
into multiples of the tile size, but would have significant
space and data transfer time overhead.
• We will take a different approach.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

58
2023秋

Phase 1 Loads for Block (0,0) for a 3x3 Example

Threads (1,0) and (1,1) need special


N0,0 N0,1 N0,2 treatment in loading N tile
N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

Threads (0,1) and (1,1) need special


treatment in loading M tile

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2


N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

59
2023秋

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2


N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

All Threads need special treatment. None of


them should introduce invalidate
contributions to their P elements.

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 0 Loads for Block (1,1) for a 3x3 Example


Threads (0,1) and (1,1) need special
treatment in loading N tile

N0,0 N0,1 N0,2 N0,2


Shared Memory
N1,0 N1,1 N1,2 N1,2
N2,0 N2,1 N2,2

M0,0 M0,1 M0,2 P0,0 P0,1 P0,2


M1,0 M1,1 M1,2 Shared Memory P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 M2,0 M2,1 P2,0 P2,1 P2,2

Threads (1,0) and (1,1) need special


treatment in loading M tile

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

60
2023秋

Major Cases in Toy Example

– Threads that do not calculate valid P elements but


still need to participate in loading the input tiles
– Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-
existent P[3,2] but need to participate in loading tile element
N[1,2]

– Threads that calculate valid P elements may


attempt to load non-existing input elements when
loading input tiles
– Phase 0 of Block(0,0), Thread(1,0), assigned to calculate valid
P[1,0] but attempts to load non-existing N[3,0]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

A “Simple” Solution

– When a thread is to load any input element, test if it is


in the valid index range
– If valid, proceed to load
– Else, do not load, just write a 0

– Rationale: a 0 value will ensure that that the multiply-


add step does not affect the final value of the output
element
– The condition tested for loading input elements is
different from the test for calculating output P element
– A thread that does not calculate valid P element can still participate in
loading input tile elements
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

61
2023秋

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2


N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory
0 0
Shared Memory
M0,0 M0,1 M0,2 M0,2 0 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 0 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Boundary Condition for Input M Tile

– Each thread loads


– M[Row][p*TILE_WIDTH+tx]
A
– M[Row*Width +
p*TILE_WIDTH+tx]

– Need to test
TILE_WIDTH TILE_WIDTH

– (Row < Width) &&


(p*TILE_WIDTH+tx < Width)
– If true, load M element
– Else , load 0
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

62
2023秋

Boundary Condition for Input N Tile

– Each thread loads


– N[p*TILE_WIDTH+ty][Col]
– N[(p*TILE_WIDTH+ty)*Width+

TILE_WIDTH
B
Col]

TILE_WIDTH
– Need to test
– (p*TILE_WIDTH+ty < Width)
&& (Col< Width)
– If true, load N element
– Else , load 0
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Elements – with boundary check


– 8 for (int p = 0; p < (Width-1) / TILE_WIDTH + 1; ++p) {
– ++ if(Row < Width && t * TILE_WIDTH+tx < Width) {
– 9 ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
– ++ } else {
– ++ ds_M[ty][tx] = 0.0;
– ++ }
– ++ if (p*TILE_WIDTH+ty < Width && Col < Width) {
– 10 ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col];
– ++ } else {
– ++ ds_N[ty][tx] = 0.0;
– ++ }
– 11 __syncthreads();

63
2023秋

Inner Product – Before and After


– ++ if(Row < Width && Col < Width) {
– 12 for (int i = 0; i < TILE_WIDTH; ++i) {
– 13 Pvalue += ds_M[ty][i] *
ds_N[i][tx];
– }
– 14 __syncthreads();
– 15 } /* end of outer for loop */
– ++ if (Row < Width && Col < Width)
– 16 P[Row*Width + Col] = Pvalue;
– } /* end of kernel */

Some Important Points

–For each thread the conditions are


different for
–Loading M element
–Loading N element
–Calculating and storing output elements

–The effect of control divergence


should be small for large matrices
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

64
2023秋

Handling General Rectangular Matrices


– In general, the matrix multiplication is defined in
terms of rectangular matrices
– A j x k M matrix multiplied with a k x l N matrix results in a j x l P matrix

– We have presented square matrix multiplication, a


special case
– The kernel function needs to be generalized to
handle general rectangular matrices
– The Width argument is replaced by three arguments: j, k, l
– When Width is used to refer to the height of M or height of P, replace it with j
– When Width is used to refer to the width of M or height of N, replace it with k
– When Width is used to refer to the width of N or width of P, replace it with l
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

65

You might also like