The Architecture of Graphic Processor Unit - GPU
P. Bakowski
P.Bakowski
Evolution of parallel architectures
We can distinguish 3 generations of massively parallel . architectures (scientific calculation): (1) The super-computers with special processors for vector calculation (Single Instruction Multiple Data) The Cray-1 (1976) contained 200,000 integrated circuits and could perform 100 million floating point operations per second (100 MFLOPS). price: $5 - $8.8 million Number of units sold: 85
P.Bakowski 2
Evolution of parallel architectures
(2) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers. IBM Roadrunner: PowerXCell 8i CPUs, 6480 dual cores - AMD Opteron, Linux Consumption: 2,35 MW Surface: 296 racks, 560 m2 Memory: 103,6 TiB Performance: 1,042 petaflops Price: USD $125M
P.Bakowski 3
Evolution of GPU architectures
(3) General Processing on Graphic Processing Units (GPGPU) technology based on the circuits integrated into graphic cards.
P.Bakowski
GPU based processing
The GPUs (Graphic Processing Units) contain . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide range of computing applications. CUDA cores 48 per Streaming Processor Example - nVIDIA: GT200,300,400,500 series
P.Bakowski
CPUs and SSE extensions
Modern CPU integrate specific SIMD units for graphic . processing. These units implement - SSE2, SSE3, SSE4 instructions and contain 4 arithmetic units that may operate in parallel on 4 fixed or floating point data.
P.Bakowski
CPUs and GPUs
The .GPU are based on multiple processing units with multiple processing cores (8/16/32 cores per processing unit), they contain register files and shared memories. A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each processing unit, and special memories for constant values.
P.Bakowski
GPUs : streaming multi-processors
. The streaming multiprocessor (SM) integrated in GPUs are the SIMD blocks with several arithmetic cores. Each core contains one Floating Point unit and one INTeger unit
8/16/32/48 cores per SM
P.Bakowski 8
CPUs and cache memories
CPUs use cache memories to reduce the access latency to main memory. CPU caches need more and more of the surface of the processor and use a lot of energy.
P.Bakowski
Cache memory : latency
P.Bakowski
10
CPUs and cache memories
GPUs use caches or shared memory to increase the bandwidth of memory.
Global Memory
P.Bakowski 11
GPU memory : transfer data rate
Each GPU multiprocessor has its own memory controller, For example, each memory controller of nVIDIA GT200 chip provides 8 64-bit communication channels. Shared Memory Raster OutPut SMs
8 * 64-bit channels
P.Bakowski 12
GPU memory : transfer data rate
data_rate = interface_width/8 * memory_clock*2 for GTX275: number of bytes on the bus: 448-bit/8 = 56 data_rate in bytes: 56 * 1224MHz = 68,544MB/s 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s two reads/writes per clock cycle: DDR2
P.Bakowski
13
CPU/GPU : execution threads
GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on the memory access, the other one my be executed at the same time.
thread executes thread waits thread executes
P.Bakowski
14
CPU/GPU : execution threads
A CPU may execute 1-2 threads per core; a GPU multiprocessor may maintain up to 1024 threads each. The cost of thread context switching for a CPU core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.
P.Bakowski
15
SIMD versus SIMT
SIMD The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on multiple data elements) - single execution thread !
The GPUs use SIMT operational mode; single instruction is executed by multiple threads. SIMT processing does not require the transformation of the data into vectors. It allows for arbitrary branches in the threads.
SIMT
P.Bakowski
16
GPUs and high density computing
The GPUs give excellent results when the same sequence of operations is applied to a great number of data. The best results are obtained when the number of arithmetical operations greatly exceeds the number of memory accesses. High density of calculation does not require large cache memory that is necessary in CPUs.
calculations
high
low
memory access
P.Bakowski
17
GPUs : performance
P.Bakowski
18
GPU based calculus
In several cases the performance of GPU based processing is 5-30 times greater than CPU based processing. The biggest difference - performance gain up to 100 times! - relates to the code, that is not adapted to SEE instructions but suits well the GPU functions.
P.Bakowski
19
GPU based calculus
Some example of synthetic code accelerated by the use of GPUs compared to the same code vectorized for SSE : processing for fluorescent microscope : 12x modeling of molecular dynamics : 8-16x modeling electrostatic fields : 40-120x et 7x.
P.Bakowski
20
GPU based calculus: speed-up
The comparison of the speed-up relative to SSE
P.Bakowski
21
From GeForce8 to Tesla
P.Bakowski
22
From GeForce8 to Tesla
8-16 CUDA cores
P.Bakowski
23
From GeForce8 to Tesla
How many CUDA cores ?
P.Bakowski
24
From GeForce8 to Tesla
P.Bakowski
25
Tesla system S1070
P.Bakowski
26
NVIDIA and CUDA
CUDA technology is a software architecture based on nVIDIA hardware. CUDA language is an extension of the C programming language. It gives acces to GPU instructions and to the video memory for parallel calculations. CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips (GeForce 9, GeForce 200, GeForce 300, GeForce 400, GeForce 500), Quadro and Tesla.
P.Bakowski
27
NVIDIA and CUDA
P.Bakowski
28
NVIDIA and CUDA
The CUDA Toolkit contains: compiler: nvcc libraries FFT and BLAS profiler debugger gdb for GPU runtime driver for CUDA included in nVIDIA drivers guide of programming SDK for CUDA developers source codes (examples) and documentation
P.Bakowski
29
CUDA : compilation phases
The CUDA C code is compiled with nvcc, that is a script activating other programs: cudacc, g++ , cl , etc.
P.Bakowski 30
CUDA : compilation phases
nvcc generates: the CPU code, compiled with other parts of application and written in pure C , and the PTX object code for the GPU
P.Bakowski
31
CUDA : compilation phases
The executable files with CUDA code require: runtime CUDA library (cudart) and base CUDA library
P.Bakowski
32
CUDA : advantages
Main CUDA advantage for GPGPU computing results from the new GPU architecture designed for the efficient implementation of non-graphic calculations and the use of C programming language. There is no need to convert the algorithms into pipelined format required for graphic calculations. The GPGPU does not use the graphic API and the corresponding drivers
P.Bakowski
33
CUDA : advantages
CUDA provides: the access to 16 KB of memory per SM; this access is shared by the SM threads an efficient transfer of data between the system and video memory (global GPU memory) a memory with linear addressing scheme and with random access to any memory location hardware implemented operations for FP, integers and bits
P.Bakowski
34
CUDA : limitations
Limitations: no recursive functions (no stack) processing block of minimum 32 threads (warp) CUDA is a proprietary architecture of nVIDIA
P.Bakowski
35
CUDA : programming model
CUDA programming model is based on groups of threads. The blocks of threads grids of one or two dimensions of threads cooperate via shared memory and synchronization points. A kernel program is executed in a grid of blocks of threads. Only one grid of blocks of threads is executed at a time. Each block may be built in one, two or three dimensions, and contain up two 512 threads.
P.Bakowski
36
CUDA : programming model
The blocks of threads are executed by groups of 32 threads called warps. A warp is a minimal volume of data that is processed by streaming processors. CUDA works with blocks of threads containing from 32 to 512 threads.
P.Bakowski
37
CUDA : memory model
Local and Global Memory is not cached . Local and Global Memory are implemented in separate circuits. The access time to Local and Global Memory is much longer than the Register access time.
P.Bakowski
38
CUDA : memory model
There are 1024 register entries per SM. The access to these registers is very rapid. Each register may store one 32-bit integer or floating point number.
P.Bakowski
39
CUDA : memory model
Global Memory from 256Mo to 2Go ( up to 4Go in Tesla). Data bandwidth may be over 100 Go/s but the latency is high (several hundreds of clock cycles) . There is no cache memory for Global Memory. Global Memory is used for global data and instructions
P.Bakowski
40
CUDA : memory model
Shared Memory: 16-KB of shared memory for all cores in a block of threads. Shared Memory is as rapid as the Registers.
P.Bakowski
41
CUDA : memory model
Constant Memory - 64 KB, read-only for all SM units Constant Memory is high latency memory with access time of several hundreds of clock cycles.
P.Bakowski
42
CUDA : memory model
That is why the Constant Memory data are cached in blocks of 8KB for each SM.
P.Bakowski
43
CUDA : memory model
Texture Memory is accessible (read-only) to all MS. Texture data are used directly by GPU, they may be interpolated linearly without additional operations.
P.Bakowski
44
CUDA : memory model
Texture Memory has long latency access and is cached.
P.Bakowski
45
CUDA : memory model
Typical use of CUDA memories: divide the task into several sub-tasks decompose the input data into blocks that correspond to the shared memory size each block of data will be processed by a block of threads load the data blocks from the Global Memory to Shared Memory process the data in the Shared Memory copy the results from the Shared Memory to Global Memory
P.Bakowski 46
CUDA : program example
main() - function at the CPU side
P.Bakowski
47
CUDA : program example
main() - function at the CPU side (cont.)
P.Bakowski
48
CUDA : program example
main() - function at the CPU side (cont.)
P.Bakowski
49
CUDA : program example
kernel function: at the GPU side
10 threads
++++++++++ no loop but several threads each thread with an index threadIdx.x
P.Bakowski 50
CUDA and graphic APIs
CUDA programs my exploit the graphic functions provided by graphic APIs (DirectX, openGL). These functions provide necessary image processing operations for rastering and shading rendering of the images on the screen. The proposed module does not deal with these primitives. However some of openGL operations may be used in practical classes to display the images directly from GPU memory.
P.Bakowski
51
Summary
Evolution of multiprocessing CPUs and GPUs SIMD and SIMT processing modes Performances of GPUs NVIDIA and CUDA CUDA processing model CUDA memory model a simple example
P.Bakowski
52