0% found this document useful (0 votes)

43 views65 pages

周03

This document summarizes a lecture on GPU architecture and CUDA programming. The lecture covers CUDA C vs libraries like Thrust, memory allocation and data movement functions, threads and kernels, and the CUDA toolkit. It provides examples of vector addition in CUDA C, Thrust, and OpenACC. It also discusses data parallelism, traditional C vector addition code, and the host and device code structure in CUDA. Finally, it provides a partial overview of CUDA memories and device memory management API functions.

Uploaded by

sy1990010111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views65 pages

周03

Uploaded by

sy1990010111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课： Introduction to CUDA C
赵地
中科院计算所
2023年秋季学期

讲授内容
ØCUDA C vs. Thrust vs. CUDA Libraries
ØMemory Allocation and Data Movement API Functions
ØThreads and Kernel Functions
ØIntroduction to the CUDA Toolkit
ØNsight Compute and Nsight Systems
ØUnified Memory

1
2023秋

Libraries: Easy, High-Quality Acceleration

//Vector Addition in Thrust
#include <thrust/device_vector.h>
#include <thrust/copy.h>

int main(void) {
size_t inputLength = 500;
thrust::host_vector<float> hostInput1(inputLength);
thrust::host_vector<float> hostInput2(inputLength);
thrust::device_vector<float> deviceInput1(inputLength);
thrust::device_vector<float> deviceInput2(inputLength);
thrust::device_vector<float> deviceOutput(inputLength);

thrust::copy(hostInput1.begin(), hostInput1.end(),
deviceInput1.begin());
thrust::copy(hostInput2.begin(), hostInput2.end(),
deviceInput2.begin());

thrust::transform(deviceInput1.begin(), deviceInput1.end(),
deviceInput2.begin(), deviceOutput.begin(),
thrust::plus<float>());
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Compiler Directives: Easy,

Portable Acceleration
– OpenACC Compiler directives for C, C++,
and FORTRAN
#pragma acc parallel loop
copyin(input1[0:inputLength],input
2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength;
++i) {
output[i] = input1[i] +
input2[i];
}

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

2
2023秋

Programming Languages: Most

Performance and Flexible Acceleration

Numerical analytics MATLAB,, Mathematica, LabVIEW

Python PyCUDA, Numba

Fortran CUDA Fortran, OpenACC

C CUDA C, OpenACC
C++ CUDA C++, Thrust

C# Hybridizer

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

3
2023秋

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

+ + + +

vector C C[0] C[1] C[2] … C[N-1]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Vector Addition – Traditional C Code

// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}

int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

4
2023秋

Heterogeneous Computing vecAdd CUDA Host Code

#include <cuda.h>

void vecAdd(float h_A, float h_B, float *h_C, int n){

int size = n* sizeof(float);

Part 1
float *d_A, *d_B, *d_C;
Part 2
// Part 1
Host Memory Device Memory
// Allocate device memory for A, B, and C
CPU GPU
// copy A and B to device memory

// Part 2

Part 3 // Kernel launch code – the device performs the actual

vector addition

// Part 3

// copy C from the device memory

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Partial Overview of CUDA Memories

– Device code can:
(Device) Grid
Block (0, 0) Block (0, 1)
– R/W per-thread registers
Registers Registers Registers Registers
– R/W all-shared global
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

Host
Global
Memory
– Host code can
– Transfer data to/from per
grid global memory
We will cover more memory types and
more sophisticated memory models later.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

5
2023秋

CUDA Device Memory Management API functions

– cudaMalloc()
– Allocates an object in the device global
memory
(Device) Grid
Block (0, 0) Block (0, 1)
– Two parameters

Registers Registers Registers Registers

– Address of a pointer to the allocated
object
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

– Size of allocated object in terms of

Host
Global bytes
Memory

– cudaFree()
– Frees object from device global memory
– One parameter
– Pointer to freed object

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

H o s t - D e v i c e D a t a Tr a n s f e r A P I f u n c t i o n s

–cudaMemcpy()
– memory data transfer
(Device) Grid
Block (0, 0) Block (0, 1)
– Requires four parameters
Registers Registers Registers Registers –Pointer to destination
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

Host
–Pointer to source
Global
Memory
–Number of bytes copied
–Type/Direction of transfer
– Transfer to device is synchronous
with respect to the host
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

6
2023秋

Vector Addition, Explicit Memory Management

… Allocate h_A, h_B, h_C …
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);

cudaMalloc((void **) &d_B, size);
cudaMalloc((void **) &d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Kernel invocation code – to be shown later

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
… Free h_A, h_B, h_C …

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Unified Memory
– cudaMallocManaged(
(Device) Grid void** ptr, size_t size)
Block (0, 0) Block (0, 1)
– Single memory space for all
Registers Registers Registers Registers CPUs/GPUs
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) – Maintain single copy of data

Host
– CUDA-managed data
Global
Memory
– On-demand page migration
– Compatible with cudaMalloc(),
… Unified Memory … cudaFree()
– Can be optimized
– cudaMemAdvise(),
cudaMemPrefetchAsync(),
– cudaMemcpyAsync()
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

7
2023秋

Vector Additi on, Unified Memory

float *A, *B, *C
cudaMallocManaged(&A, n * sizeof(float));
cudaMallocManaged(&B, n * sizeof(float));
cudaMallocManaged(&C, n * sizeof(float));
// Initialize A, B
void vecAdd(float *A, float *B, float *C, int n)
{
// Kernel invocation code – to be shown later
}
cudaFree(A);
cudaFree(B);
cudaFree(C);

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

In Practice, Check for API Errors in Host Code

cudaError_t err = cudaMalloc((void **) &d_A,
size);
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”,
cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

8
2023秋

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

+ + + +

vector C C[0] C[1] C[2] … C[N-1]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

9
2023秋

CUDA Execution Model

– Heterogeneous host (CPU) + device (GPU)
application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args); ...

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Arrays of Parall el Threads

• A CUDA kernel is executed by a grid

(array) of threads
–All threads in a grid run the same kernel code
(Single Program Multiple Data)
–Each thread has indexes that it uses to compute
memory addresses and make control decisions
0 1 2 254 255
…

i = blockIdx.x *
blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

10
2023秋

Thread Blocks: Scalable Cooperation

Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255

… … …
i = blockIdx.x * i = blockIdx.x * i = blockIdx.x *
blockDim.x + blockDim.x + … blockDim.x +
threadIdx.x; threadIdx.x; threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];

… … …

– Divide thread array into multiple blocks

– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks do not interact
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

blockIdx and threadIdx

• Each thread uses indices to decide what data to

work on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0) device
– threadIdx: 1D, 2D, or 3D Grid
Block (0, 0) Block (0, 1)

• Simplifies memory
addressing when processing
Block (1, 0) Block (1, 1)

multidimensional data Block (1,1)

– Image processing
(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread

– Solving PDEs on volumes

(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread (0,0,0)
Thread
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

–…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

11
2023秋

NVCC Compiler
–NVIDIA provides a CUDA-C compiler
–nvcc

–NVCC compiles device code then

forwards code on to the host compiler
(e.g. g++)
–Can be used to compile & link host only
applications
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

12
2023秋

Developer Tools - Debuggers

Nsight CUDA
Nsight Systems CUDA-GDB MEMCHECK

NVIDIA Provided

3rd Party
https://developer.nvidia.com/debugging-solutions

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Developer Tools - Profilers

NSIGHT NVVP NVPROF

NVIDIA Provided

TAU VampirTrace

3rd Party
https://developer.nvidia.com/performance-analysis-tools

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

13
2023秋

Profiling Tools

nvprof NVVP

Phasing out

Nsight Nsight
Compute Systems

Current

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

14
2023秋

Partial Overview of CUDA Memories

● Device code can:
● R/W per-thread registers
(Device) Grid
Block (0, 0) Block (0, 1)
● R/W all-shared global memory
Registers Registers Registers Registers
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) ● R/W managed memory
Global
(Unified Memory)
Host
Memory Memory

Unified Memory ● Host code can

● Transfer data to/from per grid
global memory
● R/W managed memory
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

15
2023秋

Partial Overview of CUDA Memories

● cudaMallocManaged()

● Allocates an object in the Unified Memory

address space.

(Device) Grid
● Two parameters, with an optional third
Block (0, 0) Block (0, 1) parameter.
Registers Registers Registers Registers ● Address of a pointer to the allocated
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) object
● Size of the allocated object in terms of
Host
Memory
Global
Memory
bytes
Unified Memory ● [Optional] Flag indicating if memory can
be accessed from any device or stream
● cudaFree()

● Frees object from unified memory.

● One parameter
● Pointer to freed object
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Partial Overview of CUDA Memories

● cudaMemcpy()
● Memory data transfer
(Device) Grid
Block (0, 0) Block (0, 1)
● Requires four parameters
● Pointer to destination
Registers Registers Registers Registers
Host
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
● Pointer to source
● Number of bytes copied
Host
Memory
Global
Memory
● Type/Direction of transfer
Unified Memory ● Depending on the transfer type, the driver
may decide to use the memory on the host
or the device.
● In Unified Memory this function is utilized
to copy data between different arrays,
regardless of position.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

16
2023秋

Putting i t all together, vecAdd CUDA host code

using Unified Memory
int main() {

float m_A, float m_B, float *m_C, int n;

int size = n * sizeof(float);

cudaMallocManaged((void**) &m_A, size);

cudaMallocManaged((void**) &m_B, size);
cudaMallocManaged((void**) &m_C, size); Allocation of Managed Memory

// Memory initialization on the Host

// Kernel invocation code - to be shown m_A, m_B gets initialized on the host
later
The device performs the actual vector
cudaFree(m_A); cudaFree(m_B); addition
cudaFree(m_C);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

CUDA Unified Memory for different

architectures
Prior to compute Compute capability 6.x
capability 6.x onwards
● There are specialized
● There is no specialized hardware units managing
hardware units to improve UM page faulting.
efficiency. ● Data is migrated on demand,
● For data migration the full meaning that data gets
memory block needs to be copied only on page fault.
copied synchronically by the
driver. ● Possibility to oversubscribe
memory, enabling
● No memory oversubscription. larger arrays than the device
memory size.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

17
2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课：CUDA Parallelism Model
赵地
中科院计算所
2023年秋季学期

18
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

Example: Vector Addition Kernel

Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition

__global__
void vecAddKernel(float* A, float* B, float* C, int
n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

19
2023秋

E x a m p l e : Ve c t o r A d d i t i o n K e r n e l L a u n c h
(Host Code)
Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int
n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C,
n);
} The ceiling function makes sure that there
are enough threads to cover all elements.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

More on Kernel Launch (Host Code)

Host Code

void vecAdd(float* h_A, float* h_B, float* h_C, int

n)
This is an equivalent way to express the
{ ceiling function.

dim3 DimGrid((n-1)/256 + 1, 1, 1);

dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C,
n);
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

20
2023秋

Executed on Only callable

the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

− global defines a kernel function

− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

21
2023秋

A Multi-Dimensional Grid Example

host device
Block (0, Block (0,
Grid 1
0) 1)
Kernel 1
Block (1, Block (1,
0) 1)

Block (1,0)
Grid 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread

(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread Thread
(0,0,0)
(0,1,0) (0,1,1) (0,1,2) (0,1,3)

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Processing a Picture with a 2D Grid

16×16 blocks

62×76 picture

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

22
2023秋

Row-Major Layout in C/C++

M
Row*Width+Col = 2*4+1 = 9
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Source Code of a PictureKernel

__global__ void PictureKernel(float* d_Pin, float* d_Pout,
int height, int width)
{

// Calculate the row # of the d_Pin and d_Pout element

int Row = blockIdx.y*blockDim.y + threadIdx.y;

// Calculate the column # of the d_Pin and d_Pout element

int Col = blockIdx.x*blockDim.x + threadIdx.x;

// each thread computes one element of d_Pout if in range

if ((Row < height) && (Col < width)) {
d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
} Scale every pixel value by 2.0
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

23
2023秋

Host Code for Launching PictureKernel

// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device
…
dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);
…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Co v eri ng a 6 2× 7 6 Pic tu r e w it h 16 × 16 Blo ck s

Not all threads in a Block will follow the same control flow path.

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

24
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

RGB Color Image Representation

● Each pixel in an image is an RGB value

● The format of an image’s row is

(r g b) (r g b) … (r g b)

● RGB ranges are not distributed uniformly

● Many different color spaces, here we show the

constants to convert to AdobeRGB color space
● The vertical axis (y value) and horizontal axis (x
value) show the fraction of the pixel intensity that
should be allocated to G and B. The remaining
fraction (1-y–x) of the pixel intensity that should be
assigned to R
● The triangle contains all the representable colors in
this color space
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

25
2023秋

RGB to Grayscale Conversion

A grayscale digital image is an image in which the value

of each pixel carries only intensity information.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Color Calculating Formula

●For each pixel (r g b) at (I, J) do:

grayPixel[I,J] = 0.21*r + 0.71*g + 0.07*b
●This is just a dot product <[r,g,b],[0.21,0.71,0.07]>
with the constants being specific to input RGB space

0.71
0.21 0.07

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

26
2023秋

RGB to Grayscale Conversion Code

#define CHANNELS 3 // we have 3 channels if (x < width && y < height) {
corresponding to RGB
// get 1D coordinate for the grayscale image
// The input image is encoded as unsigned
characters [0, 255] int grayOffset = y*width + x;

__global__ void colorConvert(unsigned char * // one can think of the RGB image having
grayImage, // CHANNEL times columns than the gray scale
image
unsigned char * rgbImage,
int width, int int rgbOffset = grayOffset*CHANNELS;
height) { unsigned char r = rgbImage[rgbOffset ]; //
int x = threadIdx.x + blockIdx.x * blockDim.x; red value for pixel

int y = threadIdx.y + blockIdx.y * blockDim.y; unsigned char g = rgbImage[rgbOffset + 2]; //

green value for pixel
unsigned char b = rgbImage[rgbOffset + 3]; //
if (x < width && y < height) { blue value for pixel
... // perform the rescaling and store it
} // We multiply by floating point constants
} grayImage[grayOffset] = 0.21f*r + 0.71f*g +
0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

RGB to Grayscale Conversion Code

if (x < width && y < height) {
#define CHANNELS 3 // we have 3 channels corresponding
// get 1D coordinate for the grayscale
to RGB
image
// The input image is encoded as unsigned characters
int grayOffset = y*width + x;
[0, 255]
// one can think of the RGB image having
__global__ void colorConvert(unsigned char *
grayImage, // CHANNEL times columns than the gray
scale image
unsigned char * rgbImage,
int rgbOffset = grayOffset*CHANNELS;
int width, int height) {
unsigned char r = rgbImage[rgbOffset
int x = threadIdx.x + blockIdx.x * blockDim.x;
]; // red value for pixel
int y = threadIdx.y + blockIdx.y * blockDim.y;
unsigned char g = rgbImage[rgbOffset + 1];
// green value for pixel

if (x < width && y < height) { unsigned char b = rgbImage[rgbOffset + 2];

// blue value for pixel
...
// perform the rescaling and store it
}
// We multiply by floating point constants
}
grayImage[grayOffset] = 0.21f*r + 0.71f*g +
0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

27
2023秋

RGB to Grayscale Conversion Code

if (x < width && y < height) {
// we have 3 channels corresponding to RGB
// get 1D coordinate for the grayscale image
// The input image is encoded as unsigned
int grayOffset = y*width + x;
characters [0, 255]
// one can think of the RGB image having
__global__ void colorConvert(unsigned char *
grayImage, // CHANNEL times columns than the gray scale
image
unsigned char * rgbImage,
int rgbOffset = grayOffset*CHANNELS;
int width, int
height) { unsigned char r = rgbImage[rgbOffset ]; //
red value for pixel
int x = threadIdx.x + blockIdx.x * blockDim.x;
unsigned char g = rgbImage[rgbOffset + 2]; //
int y = threadIdx.y + blockIdx.y * blockDim.y;
green value for pixel
unsigned char b = rgbImage[rgbOffset + 3]; //
if (x < width && y < height) { blue value for pixel

... // perform the rescaling and store it

} // We multiply by floating point constants

} grayImage[grayOffset] = 0.21fr + 0.71fg +

0.07f*b;
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

28
2023秋

Image Blurring

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Blurring Box

Pixels
processed
by a thread
block

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

29
2023秋

Image Blur as a 2D Kernel

__global__
void blurKernel(unsigned char * in, unsigned char * out, int w,
int h)
{
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {

... // Rest of our kernel
}
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

// Get the average of the surrounding

2xBLUR_SIZE x 2xBLUR_SIZE box
__global__
for(int blurRow = -BLUR_SIZE; blurRow
void blurKernel(unsigned char * in, unsigned < BLUR_SIZE+1; ++blurRow) {
char * out, int w, int h) {
for(int blurCol = -BLUR_SIZE;
int Col = blockIdx.x * blockDim.x + blurCol < BLUR_SIZE+1; ++blurCol) {
threadIdx.x;
int Row = blockIdx.y * blockDim.y +
threadIdx.y; int curRow = Row + blurRow;
int curCol = Col + blurCol;

if (Col < w && Row < h) { // Verify we have a valid

image pixel
int pixVal = 0;
if(curRow > -1 && curRow < h
int pixels = 0; && curCol > -1 && curCol < w) {
... pixVal += in[curRow * w +
// Write our new pixel value out curCol];

out[Row * w + Col] = (unsigned pixels++; // Keep track of

char)(pixVal / pixels); number of pixels in the accumulated total

} }

} }
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

30
2023秋

讲授内容
ØKernel-Based SPMD Parallel Programming
ØMultidimensional Kernel Configuration
ØColor-to-Grayscale Image Processing Example
ØImage Blur Example
ØThread Scheduling

Transparent Scalability
Device
Device
Thread grid
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 time Block 4 Block 5 Block 6 Block 7
Block 2 Block 3 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7

● Each block can execute in any order relative to others.

● Hardware is free to assign blocks to any processor at any time

● A kernel scales to any number of parallel processors

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

31
2023秋

Example: Executing Thread Blocks

– Threads are assigned to Streaming
Multiprocessors (SM) in block
granularity
– Up to 32 blocks to each SM as resource allows t0 t1 t2 … tm SM
– Volta SM can take up to 2048 threads SP

–Could be 256 (threads/block) * 8 blocks Blocks

–Or 512 (threads/block) * 4 blocks, etc. Shared
Memory

– SM maintains thread/block idx #s

– SM manages/schedules thread
execution The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Warps as Scheduling Units

• Each Block is executed as 32-thread Warps

–An implementation decision, not part of the
CUDA programming model
–Warps are scheduling units in SM
–Threads in a warp execute in SIMD
–Future GPUs may have different number of
threads in each warp
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

32
2023秋

Warp Example
• If 3 blocks are assigned to an SM and each
block has 256 threads, how many Warps are
there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
…t0 t1 t2 … …t0 t1 t2 … …t0 t1 t2 …
Block 0 Warps Block 1 Warps Block 2 Warps

…t31 …t31 …t31

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Example: Thread Scheduling (Cont.)

–SM implements zero-overhead warp
scheduling
–Warps whose next instruction has its operands
ready for consumption are eligible for
execution
–Eligible Warps are selected for execution based
on a prioritized scheduling policy
–All threads in a warp execute the same
instruction when selected
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

33
2023秋

Block Granularity Considerations

– For Matrix Multiplication using multiple blocks, should
each block have 4X4, 8X8 or 30X30 threads for Volta?
– For 4X4, we have 16 threads per Block. Each SM can take
up to 2048 threads, which translates to 128 Blocks. However,
each SM can only take up to 32 Blocks, so only 512 threads
will go into each SM!
– For 8X8, we have 64 threads per Block. Since each SM can
take up to 2048 threads, it can take up to 32 Blocks and
achieve full capacity unless other resource considerations
overrule.
– For 30X30, we would have 900 threads per Block. Only two
blocks could fit into an SM for Volta, so only 1800/2048 of
the SM thread capacity would be utilized.

34
2023秋

中国科学院大学计算机学院硕士生专业选修课

GPU架构与编程
第二课：Memory and Data Locality
赵地
中科院计算所
2023年秋季学期

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

35
2023秋

Review: Image Blur Kernel.

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol)
{

int curRow = Row + blurRow;

int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the
accumulated total
}
}
}
// Write our new pixel value out
out[Row * w + Col] = (unsigned char)(pixVal / pixels);
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

How about performance on a GPU

– All threads access global memory for their input matrix
elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS

– Assume a GPU with

– Peak floating-point rate 1,600 GFLOPS with 600 GB/s DRAM bandwidth
– 4*1,600 = 6,400 GB/s required to achieve peak FLOPS rating
– The 600 GB/s memory bandwidth limits the execution at 150 GFLOPS

– This limits the execution rate to 9.3% (150/1600) of the peak

floating-point execution rate of the device!
–Need to drastically cut down memory accesses to get close
to the1,600 GFLOPS

36
2023秋

Example – Matrix Multiplication

WIDTH
M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

A Basic Matrix Multiplication

__global__ void MatrixMulKernel(float* M, float* N, float* P, int
Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}

37
2023秋

Example – Matrix Multiplication

A Toy Example: Thread to P Data Mapping

Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
strip
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

38
2023秋

Calculation of P 0 , 0 and P 0 , 1
N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Programmer View of CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

39
2023秋

Declaring CUDA Variables

Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

– device is optional when used with

__shared__, or __constant__
– Automatic variables reside in a register
–Except per-thread arrays that reside in global
memory
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Example:
Shared Memory Variable Declaration
void blurKernel(unsigned char * in,
unsigned char * out, int w, int h)
{
__shared__ float
ds_in[TILE_WIDTH][TILE_WIDTH];
…
}
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

40
2023秋

Where to Declare Variables?

global Can host access it? register

constant shared
Outside of
In the kernel
any Function

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Shared Memory in CUDA

– A special type of memory whose contents are
explicitly defined and used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput)
than global memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the
corresponding thread finishes terminates execution
– Accessed by memory load/store instructions
– A form of scratchpad memory in computer architecture

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

41
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Global Memory Access Pattern

of the Basic Matrix Multiplication Kernel
Global Memory

Thread 1 Thread 2
…
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

42
2023秋

Basic Concept of Tiling

– In a congested traffic system, significant reduction of vehicles can
greatly improve the delay seen by all vehicles
– Carpooling for commuters
– Tiling for global memory accesses
– drivers = threads accessing their memory data operands
– cars = memory access requests

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Some Computations are More Challenging to Tile

– Some carpools may be easier than others
– Car pool participants need to have similar work schedule
– Some vehicles may be more suitable for carpooling

– Similar challenges exist in tiling

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

43
2023秋

Carpools need synchronization.

– Good: when people have similar schedule

Worker A sleep work dinner

Time
Worker B sleep work dinner

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Carpools need synchronization.

– Bad: when people have very different schedule

Worker A party sleep work

time
Worker B sleep work dinner

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

44
2023秋

Same with Tiling

– Good: when threads have similar access timing

Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2

– Bad: when threads have very different timing

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Outli ne of Tiling Technique

– Identify a tile of global memory contents that are accessed
by multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads
are ready to start the phase
– Have the multiple threads to access their data from the on-
chip memory
– Use barrier synchronization to make sure that all threads
have completed the current phase
– Move on to the next tile
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

45
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Matrix Multiplication
– Data access pattern N

– Each thread - a row of M

WIDTH

and a column of N
– Each thread block – a strip
of M and a strip of N
M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col

46
2023秋

Tiled Matrix Multiplication

– Break up the execution N
of each thread into
phases

WIDTH
– so that the data
accesses by the thread
block in each phase
are focused on one tile M P
of M and one tile of N

BLOCK_WIDTHE
– The tile is of

WIDTH
Row
BLOCK_SIZE elements
in each dimension BLOCK_WIDTH

WIDTH WIDTH

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Col

Loading a Tile

–All threads in a block

participate
–Each thread loads one M element
and one N element in tiled code

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

47
2023秋

Phase 0 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 0 Use for Block (0,0)

(iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

48
2023秋

Phase 0 Use for Block (0,0)

(iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

49
2023秋

Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

50
2023秋

Execution Phases of Toy Example

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Execution Phases of Toy Example (cont.)

Shared memory allows each value to be accessed by multiple

threads
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

51
2023秋

Barrier Synchronization
– Synchronize all threads in a block
– __syncthreads()

– All threads in the same block must reach the

__syncthreads() before any of the them can
move on
– Best used to coordinate the phased execution
tiled algorithms
– To ensure that all elements of a tile are loaded at the beginning of
a phase
– To ensure that all elements of a tile are consumed at the end of a
phase

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

52
2023秋

Loading Input Tile 0 of M (Phase 0)

– Have each thread load an M
N
element and an N element at
the same relative position
as its P element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

TILE_WIDTHE

WIDTH
Row

TILE_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Input Tile 0 of N (Phase 0)

– Have each thread load an M
N
element and an N element at
the same relative position as
its P element.
WIDTH

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

53
2023秋

Loading Input Tile 1 of M (Phase 1)

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Input Tile 1 of N (Phase 1)

2D indexing for accessing Tile 1:

WIDTH

M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P
BLOCK_WIDTHE

WIDTH

Row

BLOCK_WIDTH

WIDTH WIDTH

Col
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

54
2023秋

M and N are dynamically allocated - use 1D indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
where p is the sequence number of the current phase
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Tiled Matrix Multiplication Kernel

_ _ g lo b a l _ _ v o i d M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , // Loop over the M and N tiles
f l o a t * P, I n t W i d t h )
required to compute the P
{ element
__shared__ float
ds_M[TILE_WIDTH][TILE_WIDTH]; for (int p = 0; p <
__shared__ float n/TILE_WIDTH; ++p) {
ds_N[TILE_WIDTH][TILE_WIDTH];
// Collaborative loading of
M and N tiles into shared
int bx = blockIdx.x; int by = blockIdx.y;
memory
int tx = threadIdx.x; int ty = threadIdx.y;
ds_M[ty][tx] = M[Row*Width +
p*TILE_WIDTH+tx];
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx; ds_N[ty][tx] =
float Pvalue = 0; N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
// Loop over the M and N tiles required to
compute the P element for (int i = 0; i <
for (int p = 0; p < n/TILE_WIDTH; ++p) { TILE_WIDTH; ++i)Pvalue +=
...
ds_M[ty][i] * ds_N[i][tx];
} __synchthreads();
P[Row*Width+Col] = Pvalue;
}
}

55
2023秋

Tiled Matrix Multiplication Kernel

_ _ g l o b a l_ _ v o id M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , // Loop over th e M and N til es required
f lo a t * P, I n t W id t h ) to compute the P element
{
for (int p = 0; p < n/TILE_WIDTH;
__shared__ float
ds_M[TILE_WIDTH][TILE_WIDTH]; ++p) {
__shared__ float // Collaborative loading of M and N
ds_N[TILE_WIDTH][TILE_WIDTH]; tiles into shared memory

int bx = blockIdx.x; int by = blockIdx.y;

ds_M[ty][tx] = M[Row*Width +
p*TILE_WIDTH+tx];
int tx = threadIdx.x; int ty = threadIdx.y;
ds_N[ty][tx] =
int Row = by * blockDim.y + ty;
N[(p*TILE_WIDTH+ty)*Width + Col];
int Col = bx * blockDim.x + tx; __syncthreads();

+=
float Pvalue = 0;
for (int i = 0; i < TILE_WIDTH;
// Loop over the M and N tiles required to
compute the P element ++i)Pvalue ds_M[ty][i] *
for (int p = 0; p < n/TILE_WIDTH; ++p) {
ds_N[i][tx];
... }
P[Row*Width+Col] = Pvalue; __synchthreads();
}
}

Tiled Matrix Multiplication Kernel

_ _ g lo b a l _ _ v o i d M a t r i x M u l K e r n e l ( f l o a t * M , f l o a t * N , f l o a t * P,
Int Width) // L oop over the M and N tile s
r equired to co mpute the P element
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; for (int p = 0; p < n/TILE_WIDTH;
++p) {
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
/ / Collaborati ve loading o f M and
N tiles into s hared memory
int bx = blockIdx.x; int by = blockIdx.y;
ds_M[ty][tx] = M[Row*Width +
int tx = threadIdx.x; int ty = threadIdx.y;
p*TILE_WIDTH+tx];
ds_N[ty][tx] =
int Row = by * blockDim.y + ty;
N[(p*TILE_WIDTH+ty)*Width + Col];
int Col = bx * blockDim.x + tx;
__syncthreads();
float Pvalue = 0;

+=
for (int i = 0; i <
// Loop over the M and N tiles required to
compute the P element
TILE_WIDTH; ++i)Pvalue
for (int p = 0; p < n/TILE_WIDTH; ++p) {
... ds_M[ty][i] * ds_N[i][tx];
} __synchthreads();
P[Row*Width+Col] = Pvalue; }
}

56
2023秋

Tile (Thread Block) Size Considerations

– Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads

– For 16, in each phase, each block performs 2*256 = 512

float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for
each memory load)
– For 32, in each phase, each block performs 2*1024 =
2048 float loads from global memory for 1024 * (2*32) =
65,536 mul/add operations. (32 floating-point operation
for each memory load)

Shared Memory and Threading

– For an SM with 16KB shared memory
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared
memory.
– For 16KB shared memory, one can potentially have up to 8 thread blocks
executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared memory
usage per thread block, allowing 2 thread blocks active at the same time
– However, in a GPU where the thread count is limited to 1536 threads per SM,
the number of blocks per SM is reduced to one!

– Each __syncthread() can reduce the number of active

threads for a block
– More thread blocks can be advantageous

57
2023秋

讲授内容
ØCUDA Memories
ØTiled Parallel Algorithms
ØTiled Matrix Multiplication
ØTiled Matrix Multiplication Kernel
ØHandling Arbitrary Matrix Sizes in Tiled Algorithms

Handling Matrix of Arbitrary Size

• The tiled matrix multiplication kernel we presented so

far can handle only square matrices whose
dimensions (Width) are multiples of the tile width
(TILE_WIDTH)
• However, real applications need to handle arbitrary sized
matrices.
• One could pad (add elements to) the rows and columns
into multiples of the tile size, but would have significant
space and data transfer time overhead.
• We will take a different approach.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

58
2023秋

Phase 1 Loads for Block (0,0) for a 3x3 Example

Threads (1,0) and (1,1) need special

N0,0 N0,1 N0,2 treatment in loading N tile
N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

Threads (0,1) and (1,1) need special

treatment in loading M tile

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2

N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

59
2023秋

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2

N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory

Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

All Threads need special treatment. None of

them should introduce invalidate
contributions to their P elements.

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Phase 0 Loads for Block (1,1) for a 3x3 Example

Threads (0,1) and (1,1) need special
treatment in loading N tile

N0,0 N0,1 N0,2 N0,2

Shared Memory
N1,0 N1,1 N1,2 N1,2
N2,0 N2,1 N2,2

M0,0 M0,1 M0,2 P0,0 P0,1 P0,2

M1,0 M1,1 M1,2 Shared Memory P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 M2,0 M2,1 P2,0 P2,1 P2,2

Threads (1,0) and (1,1) need special

treatment in loading M tile

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

60
2023秋

Major Cases in Toy Example

– Threads that do not calculate valid P elements but

still need to participate in loading the input tiles
– Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-
existent P[3,2] but need to participate in loading tile element
N[1,2]

– Threads that calculate valid P elements may

attempt to load non-existing input elements when
loading input tiles
– Phase 0 of Block(0,0), Thread(1,0), assigned to calculate valid
P[1,0] but attempts to load non-existing N[3,0]

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

A “Simple” Solution

– When a thread is to load any input element, test if it is

in the valid index range
– If valid, proceed to load
– Else, do not load, just write a 0

– Rationale: a 0 value will ensure that that the multiply-

add step does not affect the final value of the output
element
– The condition tested for loading input elements is
different from the test for calculating output P element
– A thread that does not calculate valid P element can still participate in
loading input tile elements
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

61
2023秋

Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2

N1,0 N1,1 N1,2
N2,0 N2,1 N2,2 N2,0 N2,1
Shared Memory
0 0
Shared Memory
M0,0 M0,1 M0,2 M0,2 0 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 0 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Boundary Condition for Input M Tile

– Each thread loads

– M[Row][p*TILE_WIDTH+tx]
A
– M[Row*Width +
p*TILE_WIDTH+tx]

– Need to test
TILE_WIDTH TILE_WIDTH

– (Row < Width) &&

(p*TILE_WIDTH+tx < Width)
– If true, load M element
– Else , load 0
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

62
2023秋

Boundary Condition for Input N Tile

– Each thread loads

– N[p*TILE_WIDTH+ty][Col]
– N[(p*TILE_WIDTH+ty)*Width+

TILE_WIDTH
B
Col]

TILE_WIDTH
– Need to test
– (p*TILE_WIDTH+ty < Width)
&& (Col< Width)
– If true, load N element
– Else , load 0
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Loading Elements – with boundary check

– 8 for (int p = 0; p < (Width-1) / TILE_WIDTH + 1; ++p) {
– ++ if(Row < Width && t * TILE_WIDTH+tx < Width) {
– 9 ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
– ++ } else {
– ++ ds_M[ty][tx] = 0.0;
– ++ }
– ++ if (p*TILE_WIDTH+ty < Width && Col < Width) {
– 10 ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col];
– ++ } else {
– ++ ds_N[ty][tx] = 0.0;
– ++ }
– 11 __syncthreads();

63
2023秋

Inner Product – Before and After

– ++ if(Row < Width && Col < Width) {
– 12 for (int i = 0; i < TILE_WIDTH; ++i) {
– 13 Pvalue += ds_M[ty][i] *
ds_N[i][tx];
– }
– 14 __syncthreads();
– 15 } /* end of outer for loop */
– ++ if (Row < Width && Col < Width)
– 16 P[Row*Width + Col] = Pvalue;
– } /* end of kernel */

Some Important Points

–For each thread the conditions are

different for
–Loading M element
–Loading N element
–Calculating and storing output elements

–The effect of control divergence

should be small for large matrices
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

64
2023秋

Handling General Rectangular Matrices

– In general, the matrix multiplication is defined in
terms of rectangular matrices
– A j x k M matrix multiplied with a k x l N matrix results in a j x l P matrix

– We have presented square matrix multiplication, a

special case
– The kernel function needs to be generalized to
handle general rectangular matrices
– The Width argument is replaced by three arguments: j, k, l
– When Width is used to refer to the height of M or height of P, replace it with j
– When Width is used to refer to the width of M or height of N, replace it with k
– When Width is used to refer to the width of N or width of P, replace it with l
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda
No ratings yet
Cuda
4 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
GPU Programming Course Overview
No ratings yet
GPU Programming Course Overview
49 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA Additionof2Vector
No ratings yet
CUDA Additionof2Vector
2 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
L06 GPGPU CUDA Programming 1
No ratings yet
L06 GPGPU CUDA Programming 1
23 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Addition Cuda
No ratings yet
Addition Cuda
2 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
3 Computation
No ratings yet
3 Computation
28 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
CUDA Lab Guide for Students
No ratings yet
CUDA Lab Guide for Students
19 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
Cuda4 1
No ratings yet
Cuda4 1
4 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Rishi
No ratings yet
Rishi
30 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Module 4
No ratings yet
Module 4
40 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Threads
No ratings yet
Threads
54 pages
Cuda 4.1
No ratings yet
Cuda 4.1
2 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Parallel Scan in C CUda
No ratings yet
Parallel Scan in C CUda
3 pages
The Regulatory Vacuum
No ratings yet
The Regulatory Vacuum
4 pages
TSQ-179 (V) (CGS) - Archived 1/2006
No ratings yet
TSQ-179 (V) (CGS) - Archived 1/2006
9 pages
Ex6 Answers
No ratings yet
Ex6 Answers
5 pages
Fiche Utilisation Topspin (Analyseur RMN Et Autres Spectres)
No ratings yet
Fiche Utilisation Topspin (Analyseur RMN Et Autres Spectres)
106 pages
14 Traceability
No ratings yet
14 Traceability
42 pages
Handheld Game Quotation
No ratings yet
Handheld Game Quotation
65 pages
Literary Agents Scouts Centre (Registration Form FBM 2024)
No ratings yet
Literary Agents Scouts Centre (Registration Form FBM 2024)
3 pages
My City Bhopal
No ratings yet
My City Bhopal
8 pages
Retail Solution With Cit Features
No ratings yet
Retail Solution With Cit Features
2 pages
High-Performance Computing Scaling Challenges
No ratings yet
High-Performance Computing Scaling Challenges
10 pages
Procurement Analysis Case Study
No ratings yet
Procurement Analysis Case Study
4 pages
Data Sheet ROTAVISC Me-Vi Complete
No ratings yet
Data Sheet ROTAVISC Me-Vi Complete
4 pages
Online Job Portal
No ratings yet
Online Job Portal
126 pages
Javi Minguez TFM ARTERIA
No ratings yet
Javi Minguez TFM ARTERIA
281 pages
Air Actuators: For Pneumatic Applications
No ratings yet
Air Actuators: For Pneumatic Applications
12 pages
CCS592 - AZHT KSCP 2017marking Scheme USM
No ratings yet
CCS592 - AZHT KSCP 2017marking Scheme USM
5 pages
Cpe Research Manual2021 2022-1
No ratings yet
Cpe Research Manual2021 2022-1
19 pages
MANSCIE
No ratings yet
MANSCIE
3 pages
雅洁五金ARCHIE
No ratings yet
雅洁五金ARCHIE
114 pages
IOE Entrance Exam Guidelines
No ratings yet
IOE Entrance Exam Guidelines
23 pages
Products and Services of Ethiotelecom PDF
50% (2)
Products and Services of Ethiotelecom PDF
9 pages
RTS151KEY - Datsheet PDF
No ratings yet
RTS151KEY - Datsheet PDF
2 pages
Energy Mastery Booklet
No ratings yet
Energy Mastery Booklet
21 pages
Xcel HD
No ratings yet
Xcel HD
26 pages
Activities of DS Tech Hub
0% (1)
Activities of DS Tech Hub
3 pages
Introduction To Business Logic
No ratings yet
Introduction To Business Logic
10 pages
Student Powerpoint Presentation Rubric
No ratings yet
Student Powerpoint Presentation Rubric
1 page
Assignment
No ratings yet
Assignment
6 pages
Ic23 Unit03 Script
No ratings yet
Ic23 Unit03 Script
26 pages
Accounts 31.10.2020 11-30
No ratings yet
Accounts 31.10.2020 11-30
5 pages

周03

Uploaded by

周03

Uploaded by

2023秋

Libraries: Easy, High-Quality Acceleration

Compiler Directives: Easy,

Programming Languages: Most

Numerical analytics MATLAB,, Mathematica, LabVIEW

Fortran CUDA Fortran, OpenACC

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

vector C C[0] C[1] C[2] … C[N-1]

Vector Addition – Traditional C Code

Heterogeneous Computing vecAdd CUDA Host Code

void vecAdd(float *h_A, float *h_B, float *h_C, int n){

int size = n* sizeof(float);

Part 3 // Kernel launch code – the device performs the actual

// copy C from the device memory

Partial Overview of CUDA Memories

CUDA Device Memory Management API functions

Registers Registers Registers Registers

– Size of allocated object in terms of

Vector Addition, Explicit Memory Management

cudaMalloc((void **) &d_A, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// Kernel invocation code – to be shown later

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Vector Additi on, Unified Memory

In Practice, Check for API Errors in Host Code

Data Parallelism - Vector Addition Example

vector A A[0] A[1] A[2] … A[N-1]

vector B B[0] B[1] B[2] … B[N-1]

vector C C[0] C[1] C[2] … C[N-1]

CUDA Execution Model

Serial Code (host)

Parallel Kernel (device)

Arrays of Parall el Threads

• A CUDA kernel is executed by a grid

Thread Blocks: Scalable Cooperation

– Divide thread array into multiple blocks

blockIdx and threadIdx

• Each thread uses indices to decide what data to

multidimensional data Block (1,1)

Thread Thread Thread Thread

– Solving PDEs on volumes

–NVCC compiles device code then

Developer Tools - Debuggers

Developer Tools - Profilers

NSIGHT NVVP NVPROF

Partial Overview of CUDA Memories

Unified Memory ● Host code can

Partial Overview of CUDA Memories

● Allocates an object in the Unified Memory

● Frees object from unified memory.

Partial Overview of CUDA Memories

Putting i t all together, vecAdd CUDA host code

float *m_A, float *m_B, float *m_C, int n;

int size = n * sizeof(float);

cudaMallocManaged((void**) &m_A, size);

// Memory initialization on the Host

CUDA Unified Memory for different

Example: Vector Addition Kernel

More on Kernel Launch (Host Code)

void vecAdd(float* h_A, float* h_B, float* h_C, int

dim3 DimGrid((n-1)/256 + 1, 1, 1);

More on CUDA Function Declarations

Executed on Only callable

− __global__ defines a kernel function

A Multi-Dimensional Grid Example

Thread Thread Thread Thread

Processing a Picture with a 2D Grid

Row-Major Layout in C/C++

M0,0 M0,1 M0,2 M0,3

Source Code of a PictureKernel

// Calculate the row # of the d_Pin and d_Pout element

// Calculate the column # of the d_Pin and d_Pout element

// each thread computes one element of d_Pout if in range

Host Code for Launching PictureKernel

void vecAdd(float h_A, float h_B, float *h_C, int n){

float m_A, float m_B, float *m_C, int n;

− global defines a kernel function

} grayImage[grayOffset] = 0.21fr + 0.71fg +

– device is optional when used with