0% found this document useful (0 votes)

406 views52 pages

GPU Introduction

(1) GPUs have evolved from specialized processors in supercomputers to being integrated into graphics cards and used for general purpose processing (GPGPU). Modern GPUs contain hundreds/thousands of arithmetic units that can be used to accelerate many computing applications in parallel. (2) GPUs use a streaming multi-processor architecture with multiple cores that can maintain thousands of threads at once to hide memory latency. They rely on high computational density rather than large caches. (3) NVIDIA's CUDA extends C/C++ for general purpose GPU programming. It provides hardware-optimized memory and a programming model based on thread blocks and grids well-suited for data-parallel workloads.

Uploaded by

spark1122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

406 views52 pages

GPU Introduction

Uploaded by

spark1122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

The Architecture of Graphic Processor Unit - GPU

P. Bakowski

P.Bakowski

Evolution of parallel architectures

We can distinguish 3 generations of massively parallel . architectures (scientific calculation): (1) The super-computers with special processors for vector calculation (Single Instruction Multiple Data) The Cray-1 (1976) contained 200,000 integrated circuits and could perform 100 million floating point operations per second (100 MFLOPS). price: $5 - $8.8 million Number of units sold: 85
P.Bakowski 2

Evolution of parallel architectures

(2) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers. IBM Roadrunner: PowerXCell 8i CPUs, 6480 dual cores - AMD Opteron, Linux Consumption: 2,35 MW Surface: 296 racks, 560 m2 Memory: 103,6 TiB Performance: 1,042 petaflops Price: USD $125M
P.Bakowski 3

Evolution of GPU architectures

(3) General Processing on Graphic Processing Units (GPGPU) technology based on the circuits integrated into graphic cards.

P.Bakowski

GPU based processing

The GPUs (Graphic Processing Units) contain . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide range of computing applications. CUDA cores 48 per Streaming Processor Example - nVIDIA: GT200,300,400,500 series

P.Bakowski

CPUs and SSE extensions

Modern CPU integrate specific SIMD units for graphic . processing. These units implement - SSE2, SSE3, SSE4 instructions and contain 4 arithmetic units that may operate in parallel on 4 fixed or floating point data.

P.Bakowski

CPUs and GPUs

The .GPU are based on multiple processing units with multiple processing cores (8/16/32 cores per processing unit), they contain register files and shared memories. A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each processing unit, and special memories for constant values.

P.Bakowski

GPUs : streaming multi-processors

. The streaming multiprocessor (SM) integrated in GPUs are the SIMD blocks with several arithmetic cores. Each core contains one Floating Point unit and one INTeger unit

8/16/32/48 cores per SM

P.Bakowski 8

CPUs and cache memories

CPUs use cache memories to reduce the access latency to main memory. CPU caches need more and more of the surface of the processor and use a lot of energy.

P.Bakowski

Cache memory : latency

P.Bakowski

CPUs and cache memories

GPUs use caches or shared memory to increase the bandwidth of memory.

Global Memory
P.Bakowski 11

GPU memory : transfer data rate

Each GPU multiprocessor has its own memory controller, For example, each memory controller of nVIDIA GT200 chip provides 8 64-bit communication channels. Shared Memory Raster OutPut SMs

8 * 64-bit channels
P.Bakowski 12

GPU memory : transfer data rate

data_rate = interface_width/8 * memory_clock*2 for GTX275: number of bytes on the bus: 448-bit/8 = 56 data_rate in bytes: 56 * 1224MHz = 68,544MB/s 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s two reads/writes per clock cycle: DDR2

P.Bakowski

CPU/GPU : execution threads

GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on the memory access, the other one my be executed at the same time.
thread executes thread waits thread executes

P.Bakowski

CPU/GPU : execution threads

A CPU may execute 1-2 threads per core; a GPU multiprocessor may maintain up to 1024 threads each. The cost of thread context switching for a CPU core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.

P.Bakowski

SIMD versus SIMT

SIMD The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on multiple data elements) - single execution thread !

The GPUs use SIMT operational mode; single instruction is executed by multiple threads. SIMT processing does not require the transformation of the data into vectors. It allows for arbitrary branches in the threads.

SIMT

P.Bakowski

GPUs and high density computing

The GPUs give excellent results when the same sequence of operations is applied to a great number of data. The best results are obtained when the number of arithmetical operations greatly exceeds the number of memory accesses. High density of calculation does not require large cache memory that is necessary in CPUs.

calculations

high

low

memory access

P.Bakowski

GPUs : performance

P.Bakowski

GPU based calculus

In several cases the performance of GPU based processing is 5-30 times greater than CPU based processing. The biggest difference - performance gain up to 100 times! - relates to the code, that is not adapted to SEE instructions but suits well the GPU functions.

P.Bakowski

GPU based calculus

Some example of synthetic code accelerated by the use of GPUs compared to the same code vectorized for SSE : processing for fluorescent microscope : 12x modeling of molecular dynamics : 8-16x modeling electrostatic fields : 40-120x et 7x.

P.Bakowski

GPU based calculus: speed-up

The comparison of the speed-up relative to SSE

P.Bakowski

From GeForce8 to Tesla

P.Bakowski

From GeForce8 to Tesla

8-16 CUDA cores

P.Bakowski

From GeForce8 to Tesla

How many CUDA cores ?

P.Bakowski

From GeForce8 to Tesla

P.Bakowski

Tesla system S1070

P.Bakowski

NVIDIA and CUDA

CUDA technology is a software architecture based on nVIDIA hardware. CUDA language is an extension of the C programming language. It gives acces to GPU instructions and to the video memory for parallel calculations. CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips (GeForce 9, GeForce 200, GeForce 300, GeForce 400, GeForce 500), Quadro and Tesla.

P.Bakowski

NVIDIA and CUDA

P.Bakowski

NVIDIA and CUDA

The CUDA Toolkit contains: compiler: nvcc libraries FFT and BLAS profiler debugger gdb for GPU runtime driver for CUDA included in nVIDIA drivers guide of programming SDK for CUDA developers source codes (examples) and documentation

P.Bakowski

CUDA : compilation phases

The CUDA C code is compiled with nvcc, that is a script activating other programs: cudacc, g++ , cl , etc.
P.Bakowski 30

CUDA : compilation phases

nvcc generates: the CPU code, compiled with other parts of application and written in pure C , and the PTX object code for the GPU

P.Bakowski

CUDA : compilation phases

The executable files with CUDA code require: runtime CUDA library (cudart) and base CUDA library

P.Bakowski

CUDA : advantages
Main CUDA advantage for GPGPU computing results from the new GPU architecture designed for the efficient implementation of non-graphic calculations and the use of C programming language. There is no need to convert the algorithms into pipelined format required for graphic calculations. The GPGPU does not use the graphic API and the corresponding drivers

P.Bakowski

CUDA : advantages
CUDA provides: the access to 16 KB of memory per SM; this access is shared by the SM threads an efficient transfer of data between the system and video memory (global GPU memory) a memory with linear addressing scheme and with random access to any memory location hardware implemented operations for FP, integers and bits

P.Bakowski

CUDA : limitations
Limitations: no recursive functions (no stack) processing block of minimum 32 threads (warp) CUDA is a proprietary architecture of nVIDIA

P.Bakowski

CUDA : programming model

CUDA programming model is based on groups of threads. The blocks of threads grids of one or two dimensions of threads cooperate via shared memory and synchronization points. A kernel program is executed in a grid of blocks of threads. Only one grid of blocks of threads is executed at a time. Each block may be built in one, two or three dimensions, and contain up two 512 threads.

P.Bakowski

CUDA : programming model

The blocks of threads are executed by groups of 32 threads called warps. A warp is a minimal volume of data that is processed by streaming processors. CUDA works with blocks of threads containing from 32 to 512 threads.

P.Bakowski

CUDA : memory model

Local and Global Memory is not cached . Local and Global Memory are implemented in separate circuits. The access time to Local and Global Memory is much longer than the Register access time.

P.Bakowski

CUDA : memory model

There are 1024 register entries per SM. The access to these registers is very rapid. Each register may store one 32-bit integer or floating point number.

P.Bakowski

CUDA : memory model

Global Memory from 256Mo to 2Go ( up to 4Go in Tesla). Data bandwidth may be over 100 Go/s but the latency is high (several hundreds of clock cycles) . There is no cache memory for Global Memory. Global Memory is used for global data and instructions

P.Bakowski

CUDA : memory model

Shared Memory: 16-KB of shared memory for all cores in a block of threads. Shared Memory is as rapid as the Registers.

P.Bakowski

CUDA : memory model

Constant Memory - 64 KB, read-only for all SM units Constant Memory is high latency memory with access time of several hundreds of clock cycles.

P.Bakowski

CUDA : memory model

That is why the Constant Memory data are cached in blocks of 8KB for each SM.

P.Bakowski

CUDA : memory model

Texture Memory is accessible (read-only) to all MS. Texture data are used directly by GPU, they may be interpolated linearly without additional operations.

P.Bakowski

CUDA : memory model

Texture Memory has long latency access and is cached.

P.Bakowski

CUDA : memory model

Typical use of CUDA memories: divide the task into several sub-tasks decompose the input data into blocks that correspond to the shared memory size each block of data will be processed by a block of threads load the data blocks from the Global Memory to Shared Memory process the data in the Shared Memory copy the results from the Shared Memory to Global Memory
P.Bakowski 46

CUDA : program example

main() - function at the CPU side

P.Bakowski

CUDA : program example

main() - function at the CPU side (cont.)

P.Bakowski

CUDA : program example

main() - function at the CPU side (cont.)

P.Bakowski

CUDA : program example

kernel function: at the GPU side

10 threads

++++++++++ no loop but several threads each thread with an index threadIdx.x
P.Bakowski 50

CUDA and graphic APIs

CUDA programs my exploit the graphic functions provided by graphic APIs (DirectX, openGL). These functions provide necessary image processing operations for rastering and shading rendering of the images on the screen. The proposed module does not deal with these primitives. However some of openGL operations may be used in practical classes to display the images directly from GPU memory.

P.Bakowski

Summary
Evolution of multiprocessing CPUs and GPUs SIMD and SIMT processing modes Performances of GPUs NVIDIA and CUDA CUDA processing model CUDA memory model a simple example

P.Bakowski

Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Unreadable Document Content
No ratings yet
Unreadable Document Content
71 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Modern GPU
100% (1)
Modern GPU
221 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
43 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
CUDA Memory for HPC Students
No ratings yet
CUDA Memory for HPC Students
27 pages
GPU & Game Development Insights
100% (1)
GPU & Game Development Insights
58 pages
OSPEN: An Open Source Platform For Emulating Neuromorphic Hardware
No ratings yet
OSPEN: An Open Source Platform For Emulating Neuromorphic Hardware
8 pages
Convolutional Neural Networks: Computer Vision
No ratings yet
Convolutional Neural Networks: Computer Vision
14 pages
F# For Machine Learning Essentials - Sample Chapter
No ratings yet
F# For Machine Learning Essentials - Sample Chapter
29 pages
Soft Com Putting
No ratings yet
Soft Com Putting
64 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Nvidia Cuda Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
No ratings yet
Nvidia Cuda Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
15 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Cuda C
No ratings yet
Cuda C
70 pages
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
No ratings yet
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
152 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
Intro To Robotics
No ratings yet
Intro To Robotics
50 pages
Tutorials
No ratings yet
Tutorials
17 pages
Introduction To Convolutional Neural Networks
No ratings yet
Introduction To Convolutional Neural Networks
41 pages
Model Driven Software Engineering in Practice Second Edition Synthesis Lectures On Software Engineering 2nbsped 1627057080 9781627057080
100% (1)
Model Driven Software Engineering in Practice Second Edition Synthesis Lectures On Software Engineering 2nbsped 1627057080 9781627057080
234 pages
Archived: Deep Learning On AWS
No ratings yet
Archived: Deep Learning On AWS
51 pages
CUDA Binary Utilities
No ratings yet
CUDA Binary Utilities
36 pages
Nist Ir 8022 PDF
No ratings yet
Nist Ir 8022 PDF
21 pages
Iot Merged
No ratings yet
Iot Merged
132 pages
Imlementation of ANN On FPGA
No ratings yet
Imlementation of ANN On FPGA
76 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
Chapter 06
No ratings yet
Chapter 06
76 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Julia for GPU Power Flow Solvers
100% (1)
Julia for GPU Power Flow Solvers
10 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
No ratings yet
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
11 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
Hwu 2017 Na00 0494 000 000
No ratings yet
Hwu 2017 Na00 0494 000 000
1 page
NLP With Transformers
0% (1)
NLP With Transformers
3 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Web App Frameworks Evolution
No ratings yet
Web App Frameworks Evolution
13 pages
Towards Network Automation - A Multi-Agent Based Intelligent Networking System
No ratings yet
Towards Network Automation - A Multi-Agent Based Intelligent Networking System
220 pages
S3516 Build Your Own GPU Research Cluster
No ratings yet
S3516 Build Your Own GPU Research Cluster
28 pages
Installing A Python Based Machine Learning Environment in Windows 10
No ratings yet
Installing A Python Based Machine Learning Environment in Windows 10
9 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
CUDA OpenGL
No ratings yet
CUDA OpenGL
9 pages
PyTorch Overview and Key Concepts
No ratings yet
PyTorch Overview and Key Concepts
35 pages
Research & Simulation - Network Simulations and Installation of NS2 and NS3
No ratings yet
Research & Simulation - Network Simulations and Installation of NS2 and NS3
2 pages
GPU Datasheet
No ratings yet
GPU Datasheet
3 pages
07 Race Conditions
No ratings yet
07 Race Conditions
52 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
EAT 104 - Fundamental of Chemical Engine PDF
No ratings yet
EAT 104 - Fundamental of Chemical Engine PDF
101 pages
List
No ratings yet
List
2 pages
Data Sheet KRS-85.5: Specification Shaft Seal
No ratings yet
Data Sheet KRS-85.5: Specification Shaft Seal
4 pages
Sample Written Test & Answer Key
No ratings yet
Sample Written Test & Answer Key
5 pages
E680 Instruction Manual EN V1.30
No ratings yet
E680 Instruction Manual EN V1.30
40 pages
Short-Term Training Calender - L&T PTI1
No ratings yet
Short-Term Training Calender - L&T PTI1
29 pages
ElementTech (EDI SACS Guide)
No ratings yet
ElementTech (EDI SACS Guide)
69 pages
Esteriana Haskasa - General Resume
No ratings yet
Esteriana Haskasa - General Resume
3 pages
e-LAB Mechanics of Materials (BFC 21201) Semester II 2020/2021 - Lab Briefing
No ratings yet
e-LAB Mechanics of Materials (BFC 21201) Semester II 2020/2021 - Lab Briefing
28 pages
PLC Logic Gates Project
No ratings yet
PLC Logic Gates Project
9 pages
Lift Inverter Series L A: For Modernization and New Installation
No ratings yet
Lift Inverter Series L A: For Modernization and New Installation
12 pages
Mechanical Vibrations Laboratory: Experiment 6: Forced Vibrations With Negligible Damping
No ratings yet
Mechanical Vibrations Laboratory: Experiment 6: Forced Vibrations With Negligible Damping
3 pages
LB Brochure Main Proof4
No ratings yet
LB Brochure Main Proof4
24 pages
Source 2.1: Software Manual
No ratings yet
Source 2.1: Software Manual
21 pages
Commercial Building Standard For Telecom Pathway & Spaces
No ratings yet
Commercial Building Standard For Telecom Pathway & Spaces
58 pages
Mod Menu Log - Com - Swiftgames.survival
No ratings yet
Mod Menu Log - Com - Swiftgames.survival
29 pages
Guidance Notes On Safe Use of Loadshifting Machines For Earth Moving Operations On Construction Sites
No ratings yet
Guidance Notes On Safe Use of Loadshifting Machines For Earth Moving Operations On Construction Sites
36 pages
Core Cutting 2pdf
No ratings yet
Core Cutting 2pdf
2 pages
Compressor Parts Breakdown
No ratings yet
Compressor Parts Breakdown
4 pages
Report Adjustable Power Supply 2
No ratings yet
Report Adjustable Power Supply 2
11 pages
Analysis of A Simply Supported Beam
67% (6)
Analysis of A Simply Supported Beam
4 pages
All Sensor Code
No ratings yet
All Sensor Code
12 pages
Columbarium Floor-Plan: Vaults
No ratings yet
Columbarium Floor-Plan: Vaults
1 page
SEO-Optimized Document Title
No ratings yet
SEO-Optimized Document Title
28 pages
Revision Chapter 1 and 2 2018
No ratings yet
Revision Chapter 1 and 2 2018
6 pages
Protection Schem
No ratings yet
Protection Schem
34 pages
Nasa GB A301
No ratings yet
Nasa GB A301
18 pages
Motion Less Electro-Magnetic Generator
100% (1)
Motion Less Electro-Magnetic Generator
21 pages
Construction of Gas Turbine
100% (5)
Construction of Gas Turbine
88 pages
Hoyer Motormanual Standard Motors 2025 English
No ratings yet
Hoyer Motormanual Standard Motors 2025 English
40 pages

GPU Introduction

Uploaded by

GPU Introduction

Uploaded by

The Architecture of Graphic Processor Unit - GPU

Evolution of parallel architectures

Evolution of parallel architectures

Evolution of GPU architectures

GPU based processing

CPUs and SSE extensions

CPUs and GPUs

GPUs : streaming multi-processors

8/16/32/48 cores per SM

CPUs and cache memories

Cache memory : latency

CPUs and cache memories

GPU memory : transfer data rate

GPU memory : transfer data rate

CPU/GPU : execution threads

CPU/GPU : execution threads

SIMD versus SIMT

GPUs and high density computing

GPU based calculus

GPU based calculus

GPU based calculus: speed-up

From GeForce8 to Tesla

From GeForce8 to Tesla

From GeForce8 to Tesla

How many CUDA cores ?

From GeForce8 to Tesla

Tesla system S1070

NVIDIA and CUDA

NVIDIA and CUDA

NVIDIA and CUDA

CUDA : compilation phases

CUDA : compilation phases

CUDA : compilation phases

CUDA : programming model

CUDA : programming model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : memory model

CUDA : program example

CUDA : program example

CUDA : program example

CUDA : program example

CUDA and graphic APIs

You might also like