0% found this document useful (0 votes)
17 views15 pages

Lecture 6 3

The document is an introduction to GPU computing, covering topics such as GPU architecture, programming models, and performance optimization techniques. It contrasts CPU and GPU systems, detailing their components, functionalities, and operational focuses. Additionally, it discusses the evolution of GPUs and their applications beyond graphics processing, highlighting the differences between integrated and dedicated GPUs.

Uploaded by

online online
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

Lecture 6 3

The document is an introduction to GPU computing, covering topics such as GPU architecture, programming models, and performance optimization techniques. It contrasts CPU and GPU systems, detailing their components, functionalities, and operational focuses. Additionally, it discusses the evolution of GPUs and their applications beyond graphics processing, highlighting the differences between integrated and dedicated GPUs.

Uploaded by

online online
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Introduction to GPU Computing

Dr. Sabbi Vamshi Krishna


Course Outline

 GPU Server Board


• GPU connection on server board.
 GPU Architecture
• Detailed study of GPU cores, memory hierarchies, and compute units.
 Vector Pipeline
• Understanding vector processing in GPUs.
 SIMD vs SIMT
• Comparison of SIMD (Single Instruction, Multiple Data) and SIMT (Single Instruction, Multiple
Threads).

 Row-Major and Column-Major Order


• Memory layout and access patterns.
 GPU Programming Models
• Introduction to different programming models for GPUs.
 CUDA Programming Models
• Detailed exploration of CUDA programming paradigms.
Dr. Sabbi Vamshi Krishna
Course Outline
 CUDA Memories
• Understanding various CUDA memory types (global, shared, local).
 Thread Divergence
• Impact of divergent branches on performance.
 Page Fault and Zero-Copy Concepts
• Memory management techniques including page faults and zero-copy.
 GPU Occupancy
• Concepts of Streaming Multiprocessor (SM) and GPU occupancy.
 Profiler
• Tools and techniques for profiling GPU performance.
 Performance Optimization Techniques
Techniques for optimizing GPU performance such as memory coalescing and occupancy
tuning.

 Python Numba Programming


Using Numba for GPU acceleration in Python.

Dr. Sabbi Vamshi Krishna


CPU System and GPU System

Dr. Sabbi Vamshi Krishna


Server
Boards
Placement of Components
CPU, GPU, RAM and Storage
 Critical components at the heart of computing
Storage infrastructure
 Different architecture to meet diverse processing
CPU1 needs
CPU Server
CPU0  Multiple sockets to accommodate high-
performance CPUs
Motherboard  Handle multiple threads simultaneously and
perform several tasks in parallel.
CPU Compute Node  Focused on balanced performance, providing
robust memory bandwidth, high-speed
interconnects, and support for large amounts of
RAM.
Storage
 Backbone of traditional computing
infrastructures.
CPU1 GPU Server Board
 Includes multiple PCIe slots or specialized
connectors
CPU0
 Equipped with advanced features like NVMe
Motherboard storage for high-speed data access and
networking capabilities to handle large-scale
GPU Compute Node data transfers.
Dr. Sabbi Vamshi Krishna
Peripheral Component Interconnect Express (PCIe) Slots
 General Purpose, high speed interface standard
 Use to connect various components like GPUs, Storage,
Network Cards etc.

Characteristics
 Universal Standard and compatible with wide range of
devices beyond the just GPUs
 PCIe4.0  16GT/s per lane means 16Billion bits per second,
encoding used 128b/130b.
 128bits are actual data, and 2 bits are used for encoding
overhead.

 Effective bit rate= 𝟏𝟐𝟖×16 GT/s=15.754 Gb/s per lane


𝟏𝟑𝟎

 Data rate in GB/s= 𝟏𝟓.𝟕𝟓 = 1.97 GB/s per lane = 2GB/s


𝟖

 Bandwidth for Multiple Lanes =


16 lanes×2 GB/s=32 GB/s in each direction

https://serverfault.com/questions/11633/whats-the-bandwidth-  Operates on Point-to-Point connections model


a n d-form-fac tor-for-pc ie-x1-x4-x8- and-x16
Dr. S a bb i Va m s hi K r i sh na  High latency due to general purpose design
NVLink
 High Speed interconnect technology developed by Nvidia.
 Provide connectivity for GPU to GPU and GPU to CPU in their GPU architecture.
 Designed to overcome the limitation of PCIe interconnect particularly in high
performance computing and Deep learning applications.

Characteristics
 Offer higher bandwidth then PCIe (NVLink 2.0 provides up to 25 GB/s bidirectional
BW per link used in Tesla V100 GPU and support 300 GB/s when fully connected
similar for Nvlink 3.0 used in A100 GPU, support 50GB/s per link with total GPU-GPU
BW up to 600GB/s.
 Designed for low latency, critical for applications where GPUs need to exchange
data frequently and rapidly.
 Support Mesh Network Topolgy, Connects GPUs directly to each other and enables
GPUs to communicate without needing to go through CPU like PCIe, at full Nvlink
speed.
 More power efficient then PCIe and effective in dense computing environment.
Dr. Sabbi Vamshi Krishna
Computing Device
Multi Cores Device  Primary component of computing architecture
 Designed with a few powerful cores
 Capable of executing complex instructions
 Allows for concurrent processing of multiple tasks
 Effective for sequential task execution and complex decision-making
processes
 Handle different aspects of a workload simultaneously.
 Having ability to handle a variety of instructions and execute them with
high precision.
 Having inherent limitations when dealing with highly parallel tasks.
 Excel at single-threaded performance but may struggle with workloads
Many Cores Device that require massive parallelism.

 Initially designed to accelerate rendering tasks in graphical


applications.
 Characterized by its highly parallel structure

GPU
 Having thousands of smaller cores designed to handle many operations
simultaneously.
 Well-suited for tasks that can be parallelized, such as graphics
rendering, complex simulations, and large-scale data processing

Dr. Sabbi Va mshi Kris h na


Nvidia GPU Node View

Dr. Sabbi Vamshi Krishna https://www.broadberry.co.uk/tesla-gpu-rackmount-servers


GPU Evolution
 Central Processing Unit (CPU) used for visual rendering (1990) and
Origin made computer to perform slowly.
 Utilized GPU in combination with a CPU and improved the computer
speed since the GPU could conduct several computations
New Capabilities Current simultaneously
Use  Came into existence as specialized hardware for graphics
computation with primary role of visual output acceleration.
 Today, serve as powerful, programmable accelerators for a wide
Fixed-Function Hardware Evolution range of data-parallel workloads including graphics.
 Supports diverse fields such as machine learning, scientific
computations, and large-scale simulations — wherever massive
parallelism and high throughput are needed.
Programmability Driving Forces

Following are driving forces behind early GPU development


 Appetite for greater realism, more sophisticated effects, higher
Modern screen resolutions, and increased frames per second.
Architecture  Operated with a fixed-function pipeline, dedicated to executing
specific graphical tasks.
The evolution of GPU hardware has been  Gradually introduced programmability across multiple stages of
complexly linked with the changing usage the pipeline.
 enabled GPUs to undertake a broader range of tasks beyond
patterns.
graphics alone.

Dr. Sabbi Vamshi Krishna


A General-Purpose Graphics Processing Unit (GPGPU) is a graphics
processing unit (GPU) that can be used for purposes beyond graphical
processing, such as performing computations typically conducted by a
Central Processing Unit (CPU).

GPGPU vs GPU
“Extending the use of the graphics processor to non-graphic workloads
known as General Purpose GPU (GPGPU) computing”

Dr. Sabbi Vamshi Krishna


“A special-purpose device that supplements the main general-purpose
CPU to speed up certain operations”
Why Called Accelerator or
“A GPU is an additional hardware component that can perform
operations alongside a CPU”

Dr. Sabbi Vamshi Krishna


GPUs come in two flavors

Integrated GPUs
A graphics processor engine which is contained in the CPU.

Do not Have Dedicated Memory


Use the System Memory

GPU Types

Dedicated GPUs
A GPU on a separate peripheral card.

Having Dedicated Memory

Dr. Sabbi Vamshi Krishna


Clock Speed
1 CPU : High clock speed
GPU : Slow clock speed

Cores and Threads

Hardware Performance
2 CPU : Few cores but faster
GPU : Many cores but slower
Comparison

Function
Comparaison CPU vs GPU 3 CPU : Generalized component that handles main
processing functions
Exploring Key Differences GPU : Specialized component for parallel computing

Processing
4 CPU : Designed for serial instruction processing
GPU : Designed for parallel instruction processing

Suited for
5 CPU : General purpose computing applications
GPU : High-performance computing applications

Dr. Sabbi Vamshi Krishna


Operational focus
6 CPU : Low Latency
GPU : High Throughput

Interaction with other components


Hardware Performance 7 CPU : Interacts with more computer components such
Comparison as RAM, ROM, the basic input/output system
(BIOS), and input/output (I/O) ports.
GPU : Interacts mainly with RAM, Display.
Comparaison CPU vs GPU
8 Versatility
Exploring Key Differences CPU : More versatile (Execute numerous tasks)
GPU : Less versatile (Execute limited tasks)

API limitations
9 CPU : No API limitations
GPU : Limited APIs

Context switch latency


10 CPU : Slowly between multiple threads
GPU : No inter-warp context switching
Dr. Sabbi Vamshi Krishna

You might also like