High Performance Computing
High Performance Computing
• What is HPC?
• Who needs high performance systems?
• How do you achieve high performance?
• How to analyze or evaluate performance?
• Power Performance Tradeoff : Green Computing
• Best architecture/design for a problem
• Parallel Architecture: Design and Programming
• Cloud Computing, FOG/EDGE Computing/IoT
What are Supercomputers Used For?
Scientific simulations
Animated graphics
Analysis of geological data
Nuclear energy research and meteorology
Computational fluid dynamics
Analysis of business data
Online Sales
Analysis of social data
Social media, Facebook, Utube, LinedIn,…
How do you achieve high performance?
Performance: FLOPS or MIPS
High Performance = => Increase FLOPS
How?
How do you achieve high performance?
How?
Increase number of FPU of the system
Increase number processor in the system
Increase amount of Register/Cache/RAM of system
Use different cache/RAM mapping/management
policy
Restructure Program, Use different Language
Use different compiler
Use different algorithm/approaches for same
problem
Cost, AMC, Power Consumption
How ?
Increase number of FPU of the system
Vector Processor (SIMD, SSE, MMX), GPU Accelerator
Increase number processor in the system
Core i3/i5/i7, Ryzen R3/R5/R7: Dual/Quad/Hexa/Octa cores
Intel Xeon 4,6,8,10,12,16,18,20, 24, 38 cores
Intel Xeon Phi ( KNL), 72 cores/288Thread
AMD Thread Ripper : 8, 16, 32, 64 cores
Increase amount of Register/Cache/RAM of system
Big register file/Cache :Power Hungry
RAM/NVRAM /SSD: No disk moment fast but costly
Technology Trends
• Desktop 8086/80386
– Processor, Mother Board, Co-Processor (Floating Point
Unit), Graphics Card, RAM, Audio, Ethernet
• Desktop Pentium
– Processor (Coprocessor inside) + Mother Board (Audio,
Ethernet) + Graphics Card
• Desktop PIV
– Processor + Mother Board (Graphics + Audio +
Ethernet)
• Desktop Core
– Processor ( Graphics Inside) + Board (Audio, Ethernet)
• Mobile SOC
– Processor + Graphics+ Board (Almost in Chip)
Technology Trend
• Performance is no longer is main issue
– Power, Energy, Cost
– DVFS : run at lower frequency to reduce
power/energy consumption
• Most of modern day servers are
– Under utilized (core, RAM)
– Same for Laptop/Desktop/Mobile
• Under utilized
– Wastage of resources, can be shared
with others
– Sharing methodology (virtualization)
– Leads to Cloud Computing
Technology Trend
• Cloud Computing
– Economy: Similar to OLA/UBER
– Renting Model
• IoT : Many things on Internet
– Control and Management of Big Work
– Sensors and actuators
• FOG
– Peers Computing, Multiple Level
• Edge
– Computing at Edge, Latency sensitive
Cloud/IoT/Edges/FoG
Cloud/IoT/Edges/FoG
User
User Cnode User
Cloud System User
Amazon EC2,
Google AWS,
MS Azure
User
User Cnode User
Server User
at USA
Users of India
Cloud/IoT/Edges/FoG
User
User Cnode User
Cloud System Edge User
Amazon EC2, Server
Google AWS,
MS Azure Edge
Server User
User Cnode User
Server Edge User
at USA Servers
at India Users of India
Cloud/IoT/Edges/FoG
Cloud
System
FC FC
FC FC FC
Technology Trend
• Single processor/Single Computer
– Single processor with SIMD instruction
• Multi Computer
– Cluster, Data need to travel outside PC via LAN cable
• Multi processor
– Tightly coupled, Data no need to travel out side PC,
out side board
• Processor + Accelerator
– PCI or Board level Communication
• Processor and Accelerator in the same chip
– On chip, High BW , Intel Core (Graphics are
in Chip)
• 3D chip
Quest for Performance
Quest for Performance
• Pipelining
• Superscalar Architecture
• Out of Order Execution Single Processor
• Caches, SMT Past research
• ISA Advancements
• Parallelism
– Multi-core processors
– Clusters This is the
current
– Grid, Cloud System
and future
Trend of HPC
• HPC system
– Multi Nodes/Computer/Blades
– Programming Model MPI
• Nodes are Multicore
– Node have accelerators
– Programming Model : OpenMP, OpenCL/Cuda
• Core
– Multi Threaded
– With vector instructions
– 4 issue OOO Pipelines, Multilevel Caches,
– Programming Model: gcc optimized, vectorized
code, OpenMP
Need to study in HPC: User Prospects
• Single Processor
– Architecture: Core Pipeline, Core Multithreading,
Cache Hierarchy, SIMD
– C/C++ Optimization Methods: gcc, OpenMP,
cache optimized code
• Multicore node
– Multicore, Accelerator, Interconnections
– OpenMP Model, Cuda Model, Accelerated
Model
• HPC Server
– Multiple Nodes/Blades, Interconnection,
Storages
Param Ishan HPC
• HPC system : Data Center
• Many Racks, Many rack server in a rack
• Nodes are Multicore : One rack
server
Node/RackServer
Param Ishan SC
• Login Node:
– 2x CPU login nodes, 1x GPU login node
• Head and Management:
– 1 pair head node (in redundant mode), 1x Management Node
• Compute Node:
– 126x compute nodes
– 4x high memory compute nodes
– 16x CPU-GPU hybrid compute nodes
– 16x CPU-MIC hybrid compute nodes.
• Network: FDR InfiniBand network
• Storage :
– 150TB high throughput scratch space.
– 100TB high throughput home area
– 50TB archival for long term data storage.
HPC : overall
• Top 500 HPC : Multiprocessor, Accelerator based
• Applications : Programming Model,
Management
• Cost of HPC: Initial cost (System: Racks, Rack server,
SAS) , Place, AC, ..
• Running Cost of HPC : AMC, Energy,
Management
• HPC on Rent :
– VM, Management, Revenue Model, Cost
Model
– Cloud Model, IasS, PasS, SaaS
(Infra/Platform/Software)
Processors Trends
In the “old days” of scientific supercomputing, leading-edge high
performance systems were specially designed for the HPC
market by companies like Cray, CDC, NEC, Fujitsu, or Thinking Machin
es.
Today the HPC world is dominated by cost-effective, off-the-shelf
systems with processors that were not primarily designed for scientific
computing.
Stored-program computer architecture (SISD)
During the last decade, multicore processors have superseded the tradit
ional single-core designs. In a multicore chip, several processors
(“cores”) execute code concurrently.
Performance metrics and benchmarks brought some architectural
changes like L2 cache, Floating point units etc, to increase the
speed.
Transistors galore: Moore’s Law
• Increasing chip transistor counts and clock speeds have enabled processor designers
to implement many advanced techniques.
• A multitude of concepts have been developed, including the following:
1. Pipelined functional units.
2. Instruction- level parallelism (ILP).
3. Superscalar architecture: “direct” instruction-level
parallelism by enabling an instruction throughput of more than one per cycle.
4. Data parallelism through SIMD instructions. SIMD (Single Instruction Multi-
ple Data) instructions issue identical operations on a whole array of integer or FP operands, in sp
registers.
5. Out-of-Order execution. If arguments to instructions are not available in registers “on time,” .
6. Larger caches.
7. RISC paradigm took place.
8. Multicore processors
9. Pipelining
• Out-of-order execution and compiler optimization must work together
in order to fully exploit superscalarity.
• However, even on the most advanced architectures it is extremely hard for compiler-
generated code to achieve a throughput of more than 2–3 instructions per cycle.
Cache mapping
So far we have implicitly assumed that there is no restriction on which cache line can be
associated with which memory locations. A cache design that follows this
rule is called fully associative. the decision which cache line to replace next if the cache is
full is made by some algorithm implemented in hardware
- least recently used (LRU)
- NRU (not recently used) or random replacement are possible.
Modern processors
• Multicore processors
• Multithreaded processors
• Vector processors
They follow the SIMD paradigm
which demands that a single machine instruction is automatically applied to a la
rge number of arguments of the same type, i.e., a vector.
Multiprocessors
A Sahu 18
Mobile SoC Example
• Heterogeneous
• Diff H/W for
different purpose
• Efficiency in
terms
– Perf.
– Energy
• All in one Chip
Mobile SoC + Peripherals
Similar to motherboard and components assembly
For every components we get dozens of variety to choose
Mobile SoC
antenna
• Apple: A15, M1, M1X, M2
– 2x 3.23GHz (Firestorm) + 4x 2GHz (Icestorm) or 8 core,
Neural engine, GPU
• Qualcomm: SD888, SD870
– y. It has 1 KryoX1@2.8, 3 A78@2.4, 4 A55@1.8, AI,
5G, GPU
• Samsung: Exynos 9611
– 4 A73@2.3Ghz, 4 A53@1.7Ghz, Mali G72, 5G, Codecs
• Huwai : hisilicon9000
– 1 A77@3.13, 3 A77@2.54, 4 A55@2.0, Mali MP24, AI,
5G,
neural
• Mediatek : Dimensity 1200
– 1 A78@3.0, 3 A78@2.6, 4 A55@2.0, Mali MP24, 5G,
AI,
• Benchmarking: Antutu9, Geekbench 5, 3D Mark,
• Saturation of single processor performance
• Speed limit not to crosses : 4GHz
– The ultimate point
– Power consumption is proportional to cube of frequency
P = k.f3
• Single-processor
– Branch prediction accuracy gone upto 95%
– L1 Cache hits gone upto 80%
– ILP (Inst. Level Parall) exploited by uniprocessor is upto
8
– Thread/Data level parallelism needs to exploit
Power Aware Scheduling
• P =1/2 C V2 F = kF3 As V-F Pairs, V α F
– Running Processor at 3 Ghz will consumes 27 times higher
Power as compared to running at 1Ghz
• E = kF3*T
• Running a task at F and F/3 :Assume 3Ghz and 1Ghz
– Ef = k F 3 T
– Ef/3 = k (F/3)3 T*3 = k F3 T /9 = Ef/9
– 3 times slower but 9 times energy efficient
• If time permit reduce the frequency
• If task have enough slack before deadline reduce frequency
• Application specific IC (ASIC)
– High performance, low power than Processor
– But complexity of ASIC design is very High
– Example: 50MP+UHDVideo, GPS Camera in
side mobile handset
– It is fixed for an application
• VLSI technology offering high integration density
• Moore’s Law (In 1965, Gordon Moore Prediction)
– Exponential growth of the number of transistors on an IC
– Doubled every 26 months for the past three decades
• Why more transistors per IC? Smaller transistors, Larger
dice
A Sahu 25
• Many applications are highly parallel
– Take benefit of all parallelism (instruction, data and thread)
• Multiprocessors
• Flexible, programmable, high performance
• Take benefit of all parallelism (instruction, data and thread)
• Likely to be cost/power effective solutions
• Multiprocessors are
– Flexible, programmable, high performance
•Processor are programmable as compared to ASIC
•Flexible in terms of portability as compared to ASIC
•Higher Performance than single processor
• Multiprocessors are likely to be cost/power effective
solutions
– Share lots of resources
•Personal room is costlier than dormitory
•You cannot allocate a Bungalow to each
student: it will too costly
–Hostel room with shared facility is sufficient
– Need not require very high frequency to run
– Lots of replication makes easy to manage and cost
effective in design
– Sharing resource arise many other problems
•Critical Sections
–Lock and Barrier Design
•Coherence
–Shared data at all placed should be same
•Consistency
–Order should be similar to serial
•One processor Interference others
–Share efficiently using some policy
29
• Many applications are highly parallel
– Take benefit of all parallelism (instruction, data and thread)
– Most of the coder write sequential code
– Who will extract parallelism from
applications ?
– There is no successful auto-parallelisation tool till date
» Attempts: Cetus, SUIF, SolarisCC
• Good news: CNN/DNN python parallel library is
quite successful in GPU domain
• Task scheduling in multiprocessors
– Deterministic task scheduling on multiprocessor with
more than 2 processor is NP-Complete problem
• Simple Example
– 8 tasks with execution 2, 4, 8, 5, 6, 4, 3, 20
– Need to executed non-pre-emptively on two processor P1 and P2
– So that overall execution time is minimized
– Solution : Divide 8 tasks in to two subsets, with difference
of their sum is minimized ;Subset Sum Problem
The subset sum problem (SSP) is a decision problem in
computer science. In its most general formulation, there is a
multiset S of integers and a target-sum T, and the question is to
decide whether any subset of the integers sum to precisely T.
The problem is known to be NP-complete