Introduction to GPU Computing
Dr. Sabbi Vamshi Krishna
Course Outline
GPU Server Board
• GPU connection on server board.
GPU Architecture
• Detailed study of GPU cores, memory hierarchies, and compute units.
Vector Pipeline
• Understanding vector processing in GPUs.
SIMD vs SIMT
• Comparison of SIMD (Single Instruction, Multiple Data) and SIMT (Single Instruction, Multiple
Threads).
Row-Major and Column-Major Order
• Memory layout and access patterns.
GPU Programming Models
• Introduction to different programming models for GPUs.
CUDA Programming Models
• Detailed exploration of CUDA programming paradigms.
Dr. Sabbi Vamshi Krishna
Course Outline
CUDA Memories
• Understanding various CUDA memory types (global, shared, local).
Thread Divergence
• Impact of divergent branches on performance.
Page Fault and Zero-Copy Concepts
• Memory management techniques including page faults and zero-copy.
GPU Occupancy
• Concepts of Streaming Multiprocessor (SM) and GPU occupancy.
Profiler
• Tools and techniques for profiling GPU performance.
Performance Optimization Techniques
Techniques for optimizing GPU performance such as memory coalescing and occupancy
tuning.
Python Numba Programming
Using Numba for GPU acceleration in Python.
Dr. Sabbi Vamshi Krishna
CPU System and GPU System
Dr. Sabbi Vamshi Krishna
Server
Boards
Placement of Components
CPU, GPU, RAM and Storage
Critical components at the heart of computing
Storage infrastructure
Different architecture to meet diverse processing
CPU1 needs
CPU Server
CPU0 Multiple sockets to accommodate high-
performance CPUs
Motherboard Handle multiple threads simultaneously and
perform several tasks in parallel.
CPU Compute Node Focused on balanced performance, providing
robust memory bandwidth, high-speed
interconnects, and support for large amounts of
RAM.
Storage
Backbone of traditional computing
infrastructures.
CPU1 GPU Server Board
Includes multiple PCIe slots or specialized
connectors
CPU0
Equipped with advanced features like NVMe
Motherboard storage for high-speed data access and
networking capabilities to handle large-scale
GPU Compute Node data transfers.
Dr. Sabbi Vamshi Krishna
Peripheral Component Interconnect Express (PCIe) Slots
General Purpose, high speed interface standard
Use to connect various components like GPUs, Storage,
Network Cards etc.
Characteristics
Universal Standard and compatible with wide range of
devices beyond the just GPUs
PCIe4.0 16GT/s per lane means 16Billion bits per second,
encoding used 128b/130b.
128bits are actual data, and 2 bits are used for encoding
overhead.
Effective bit rate= 𝟏𝟐𝟖×16 GT/s=15.754 Gb/s per lane
𝟏𝟑𝟎
Data rate in GB/s= 𝟏𝟓.𝟕𝟓 = 1.97 GB/s per lane = 2GB/s
𝟖
Bandwidth for Multiple Lanes =
16 lanes×2 GB/s=32 GB/s in each direction
https://serverfault.com/questions/11633/whats-the-bandwidth- Operates on Point-to-Point connections model
a n d-form-fac tor-for-pc ie-x1-x4-x8- and-x16
Dr. S a bb i Va m s hi K r i sh na High latency due to general purpose design
NVLink
High Speed interconnect technology developed by Nvidia.
Provide connectivity for GPU to GPU and GPU to CPU in their GPU architecture.
Designed to overcome the limitation of PCIe interconnect particularly in high
performance computing and Deep learning applications.
Characteristics
Offer higher bandwidth then PCIe (NVLink 2.0 provides up to 25 GB/s bidirectional
BW per link used in Tesla V100 GPU and support 300 GB/s when fully connected
similar for Nvlink 3.0 used in A100 GPU, support 50GB/s per link with total GPU-GPU
BW up to 600GB/s.
Designed for low latency, critical for applications where GPUs need to exchange
data frequently and rapidly.
Support Mesh Network Topolgy, Connects GPUs directly to each other and enables
GPUs to communicate without needing to go through CPU like PCIe, at full Nvlink
speed.
More power efficient then PCIe and effective in dense computing environment.
Dr. Sabbi Vamshi Krishna
Computing Device
Multi Cores Device Primary component of computing architecture
Designed with a few powerful cores
Capable of executing complex instructions
Allows for concurrent processing of multiple tasks
Effective for sequential task execution and complex decision-making
processes
Handle different aspects of a workload simultaneously.
Having ability to handle a variety of instructions and execute them with
high precision.
Having inherent limitations when dealing with highly parallel tasks.
Excel at single-threaded performance but may struggle with workloads
Many Cores Device that require massive parallelism.
Initially designed to accelerate rendering tasks in graphical
applications.
Characterized by its highly parallel structure
GPU
Having thousands of smaller cores designed to handle many operations
simultaneously.
Well-suited for tasks that can be parallelized, such as graphics
rendering, complex simulations, and large-scale data processing
Dr. Sabbi Va mshi Kris h na
Nvidia GPU Node View
Dr. Sabbi Vamshi Krishna https://www.broadberry.co.uk/tesla-gpu-rackmount-servers
GPU Evolution
Central Processing Unit (CPU) used for visual rendering (1990) and
Origin made computer to perform slowly.
Utilized GPU in combination with a CPU and improved the computer
speed since the GPU could conduct several computations
New Capabilities Current simultaneously
Use Came into existence as specialized hardware for graphics
computation with primary role of visual output acceleration.
Today, serve as powerful, programmable accelerators for a wide
Fixed-Function Hardware Evolution range of data-parallel workloads including graphics.
Supports diverse fields such as machine learning, scientific
computations, and large-scale simulations — wherever massive
parallelism and high throughput are needed.
Programmability Driving Forces
Following are driving forces behind early GPU development
Appetite for greater realism, more sophisticated effects, higher
Modern screen resolutions, and increased frames per second.
Architecture Operated with a fixed-function pipeline, dedicated to executing
specific graphical tasks.
The evolution of GPU hardware has been Gradually introduced programmability across multiple stages of
complexly linked with the changing usage the pipeline.
enabled GPUs to undertake a broader range of tasks beyond
patterns.
graphics alone.
Dr. Sabbi Vamshi Krishna
A General-Purpose Graphics Processing Unit (GPGPU) is a graphics
processing unit (GPU) that can be used for purposes beyond graphical
processing, such as performing computations typically conducted by a
Central Processing Unit (CPU).
GPGPU vs GPU
“Extending the use of the graphics processor to non-graphic workloads
known as General Purpose GPU (GPGPU) computing”
Dr. Sabbi Vamshi Krishna
“A special-purpose device that supplements the main general-purpose
CPU to speed up certain operations”
Why Called Accelerator or
“A GPU is an additional hardware component that can perform
operations alongside a CPU”
Dr. Sabbi Vamshi Krishna
GPUs come in two flavors
Integrated GPUs
A graphics processor engine which is contained in the CPU.
Do not Have Dedicated Memory
Use the System Memory
GPU Types
Dedicated GPUs
A GPU on a separate peripheral card.
Having Dedicated Memory
Dr. Sabbi Vamshi Krishna
Clock Speed
1 CPU : High clock speed
GPU : Slow clock speed
Cores and Threads
Hardware Performance
2 CPU : Few cores but faster
GPU : Many cores but slower
Comparison
Function
Comparaison CPU vs GPU 3 CPU : Generalized component that handles main
processing functions
Exploring Key Differences GPU : Specialized component for parallel computing
Processing
4 CPU : Designed for serial instruction processing
GPU : Designed for parallel instruction processing
Suited for
5 CPU : General purpose computing applications
GPU : High-performance computing applications
Dr. Sabbi Vamshi Krishna
Operational focus
6 CPU : Low Latency
GPU : High Throughput
Interaction with other components
Hardware Performance 7 CPU : Interacts with more computer components such
Comparison as RAM, ROM, the basic input/output system
(BIOS), and input/output (I/O) ports.
GPU : Interacts mainly with RAM, Display.
Comparaison CPU vs GPU
8 Versatility
Exploring Key Differences CPU : More versatile (Execute numerous tasks)
GPU : Less versatile (Execute limited tasks)
API limitations
9 CPU : No API limitations
GPU : Limited APIs
Context switch latency
10 CPU : Slowly between multiple threads
GPU : No inter-warp context switching
Dr. Sabbi Vamshi Krishna