Programming langages
Pr. Yahia Benmoussa (yahia.benmoussa@gmail.com)
GPU Programming
●
NVIDIA GPU hardware architecture
●
CUDA Programming model
GPU vs CPU
GPU vs CPU
●
GPU and the CPU exists because they are designed
with different goals
– CPU is designed to execute a sequence of operations,
called a thread, as fast as possible.
●
transistors are devoted to insctruction control
– GPU is designed to execute thousands of them in parallel
●
transistors are devoted to data processing
GPU architecture
What is CUDA
The CUDA parallel programming model is
designed to overcome this challenge while
maintaining a low learning curve for programmers
familiar with standard programming languages
such as C.
CUDA Programming model
●
CUDA Programming model offers three key
abstractions
– Hierarchy of thread groups
– Shared memories
– Barrier synchronization
●
These abstractions are simply exposed to the
programmer as a minimal set of language extensions.
Thread Hierarchy
●
The programmer can to partition the problem into :
– coarse sub-problems that can be solved independently
in parallel by blocks of threads,
– sub-problem into finer pieces that can be solved
cooperatively in parallel by all threads within the block
– Grid is composed of different block
Thread Hierarchy
●
This decomposition preserves language expressivity by
allowing threads to cooperate when solving each sub-
problem
●
At the same time enables automatic scalability where each
block of threads can be scheduled on any of the available
multiprocessors within a GPU, in any order, concurrently or
●
sequentially, so that a compiled CUDA program can
execute on any number of multiprocessors
What is CUDA
●
CUDA C++ extends C++ by allowing the
programmer to define C++ functions, called
kernels, that, when called, are executed N
times in parallel by N different CUDA threads,
as opposed to only once like regular C++
functions.
CUDA Programming model
●
There is a limit to the number of threads per block,
since all threads of a block are expected to reside
on the same streaming multiprocessor core and
must share the limited memory resources of that
– On current GPUs, a thread block may contain up to
1024 threads
●
The size of the grid depends on the data size
CUDA Programming model
●
A kernel is defined using the __global__
declaration specifier and the number of CUDA
threads that execute that kernel for a given
kernel call is specified using a
new<<<...>>>execution configuration syntax
CUDA build-in variables
●
int threadIdx → This variable contains the
thread index within the block
●
int blockDim → This variable contains the
number of threads per block
●
int blockIdx.x → This variable contains the
block index within the grid
Dimensions of the block/grid
CUDA Programming model
Where to run your CUDA code ?
●
On you PC if it has a NVIDIA GPU !
– CUDA Installation Guide for Linux
●
On the cloud
– Google colab : 3 types of NVIDIA GPU :
●
T4
●
A 100
●
L4
– Kaggle
– Amazon SageMarker Studio Lab
T4 GPU
●
Number of SM = 40
●
Number of core per SM = 64
●
Total number of cores = 2560
References
●
CUDA C++ Programming Guide