0% found this document useful (0 votes)
43 views41 pages

Bluegenu Casestudy

Blue Gene/L is a supercomputer built by IBM to be highly scalable and energy efficient. It uses a large number of low-power CPUs instead of fewer high-power CPUs. The goals of Blue Gene/L were to build supercomputers optimized for bandwidth, scalability, and handling large amounts of data while consuming much less power than existing systems. It was located at Lawrence Livermore National Laboratory and could analyze problems like protein folding using massive parallel processing across its large number of nodes.

Uploaded by

Asif Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views41 pages

Bluegenu Casestudy

Blue Gene/L is a supercomputer built by IBM to be highly scalable and energy efficient. It uses a large number of low-power CPUs instead of fewer high-power CPUs. The goals of Blue Gene/L were to build supercomputers optimized for bandwidth, scalability, and handling large amounts of data while consuming much less power than existing systems. It was located at Lawrence Livermore National Laboratory and could analyze problems like protein folding using massive parallel processing across its large number of nodes.

Uploaded by

Asif Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
You are on page 1/ 41

Blue Gene/L –

A Low Power Supercomputer

George Chiu

Modified for CS 416 Parallel & Distributed Computing. NK

Introduction
Environment
BG/L Architecture
Software
Applications

11/28/23 1
BlueGene and BlueGene/L

 BlueGene is a type of server made by IBM.

 Massive collection of low-power CPUs instead


of a moderate-sized collection of high-power
CPUs.
 Occupied #1 and 8 slots in 2004, and #1, 2, and
9 in 2005.

 BlueGene/L is the BlueGene flagship,


located at Lawrence Livermore National
Laboratory (LLNL).
11/28/23 2
TWO MAIN GOALS OF BLUE GENE/L

 to build a new family of supercomputers


optimized for bandwidth, scalability and
the ability to handle large amounts of
data while consuming a fraction of the
power and floor space required by today's
fastest systems.

 to analyze scientific and biological


problems (protein folding).

11/28/23 3
BlueGene/L program
 December 1999: IBM Research announced a 5 year, $100M US, effort to build a
petaflop/s scale supercomputer to attack science problems such as protein
folding.
Goals:
 Advance the state of the art of scientific simulation.
 Advance the state of the art in computer design and software for
capability and capacity markets.
 November 2001: Announced Research partnership with Lawrence Livermore
National Laboratory (LLNL).
November 2002: Announced planned acquisition of a BG/L machine by LLNL as
part of the ASCI Purple contract.
 May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at
11.68 TFlops/s. It was ranked #4 on 23rd Top500 list.

 June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655
TFlops/s. It was ranked #8 on 23rd Top500 list.
 September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s.
 November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on
the 24th Top500 list.
 December 21,2004 First 16 racks of BG/L accepted by LLNL.
LINPACK is a collection of Fortran subroutines that analyze and
11/28/23 4
solve linear equations and linear least-squares problems.
COST
 The initial cost was 1.5 M $/rack

 The current cost is 2M $/rack

 March 2005 – IBM started renting the machine for


about $10,000 per week to use one-eighth of a Blue
Gene/L rack.

11/28/23 5
BG/L 16384 nodes (IBM Rochester)
Linpack: 70.72 TF/s sustained, 91.7504 TF/s peak
1 TF = 1000,000,000,000 Flops

11/28/23 6
Comparing Systems
ASCI ASCI Q Earth Blue
White Simulator Gene/L
Machine 12.3 30 40.96 367
Peak
(TF/s)
Total Mem. 8 33 10 32
(TBytes)
Footprint 10,000 20,000 34,000 2,500
(sq ft)
Power 1 3.8 6-8.5 1.5
(MW)
Cost ($M) 100 200 400 100
# Nodes 512 4096 640 65,536
MHz
11/28/23 375 1000 500 700 7
Supercomputer Peak Performance
1E+17
multi-Petaflop
Petaflop
Blue Gene/L
1E+14 Earth
Thunder
Red Storm
Peak Speed (flops)

Blue Pacific ASCI White, ASCI Q


ASCI Red Option SX-5
T3E ASCI Red
NWT SX-4
CP-PACS
1E+11 CM-5
Delta T3D
Paragon

i860 (MPPs) SX-3/44


Doubling time = 1.5 yr. CRAY-2 SX-2 VP2600/10
S-810/20 X-MP4 Y-MP8
Cyber 205 X-MP2 (parallel vectors)
1E+8 CDC STAR-100 (vectors) CRAY-1
CDC 7600 ILLIAC IV
CDC 6600 (ICs)
IBM Stretch
1E+5
IBM 7090 (transistors)
IBM 704
IBM 701
UNIVAC
ENIAC (vacuum tubes)
1E+2
1940 1950 1960 1970 1980 1990 2000 2010
Year Introduced

11/28/23 8
Microprocessor Power Density Growth

11/28/23 9
System Power Comparison

BG/L 450 Thinkpads


2048 processors

20.1 kW 20.3 kW
(LS Mok,4/2002)

11/28/23 10
BlueGene/L System Buildup System
64 Racks, 64x32x32
Rack
32 Node Cards

Node Card 180/360 TF/s


(32 chips 4x4x2) 32 TB
16 compute, 0-2 IO cards

2.8/5.6 TF/s
Compute Card 512 GB
2 chips, 1x2x1

Chip
2 processors 90/180 GF/s
16 GB

5.6/11.2 GF/s
1.0 GB
2.8/5.6
11/28/23 GF/s 11
4 MB
THE BLUE GENE/L
THE COMPUTE CARD

11/28/23 12
THE BLUE GENE/L
THE COMPUTE CARD

11/28/23 13
THE BLUE GENE/L
THE NODE CARD

11/28/23 14
THE BLUE RACK GENE/L

11/28/23 15
THE BLUE RACK GENE/L

11/28/23 16
BlueGene/L Compute ASIC
CBGA - ceramic ball grid
array
PLB (4:1) 32k/32k L1 256
128 L2
440 CPU 4MB
EDRAM
Shared
“Double FPU” L3 directory L3 Cache
Multiported for EDRAM 1024+
256 or
snoop Shared 144 ECC
Memory
SRAM
32k/32k L1
Buffer
128
440 CPU L2
256 Includes ECC
I/O proc
256
“Double FPU”

128

DDR
• IBM CU-11, 0.13 µm Ethernet
Gbit
JTAG
Access Torus Tree Global
Control
with ECC
• 11 x 11 mm die size Interrupt

• 25 x 32 mm CBGA Gbit JTAG 6 out and 3 out and 144 bit wide
• 474 pins, 328 signal Ethernet 6 in, each at 3 in, each at 4 global
barriers or
DDR
1.4 Gbit/s link 2.8 Gbit/s link 256/512MB
• 1.5/2.5 Volt interrupts

Joint Test Action Group (JTAG) is the common name used for the IEEE 1149.1 standard entitled
Standard Test Access Port and Boundary-Scan Architecture for test access ports used for
11/28/23 testing printed circuit boards. JTAG is often used as an IC debug or probing port. 17
BlueGene/L Compute ASIC

11/28/23 18
Only Definition of Barrier is relevant

11/28/23 19
BlueGene/L Interconnection Networks
3 Dimensional Torus
 Interconnects all compute nodes (65,536)
 Virtual cut-through hardware routing
 1.4Gb/s on all 12 node links (2.1 GB/s per node)
 1 µs latency between nearest neighbors, 5 µs to the
farthest
 4 µs latency for one hop with MPI, 10 µs to the farthest
 Communications backbone for computations
 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth
Global Tree
 One-to-all broadcast functionality
 Reduction operations functionality
 2.8 Gb/s of bandwidth per link
 Latency of one way tree traversal 2.5 µs
 ~23TB/s total binary tree bandwidth (64k machine)
 Interconnects all compute and I/O nodes (1024)
Ethernet
 Incorporated into every node ASIC
 Active in the I/O nodes (1:64)
 All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier and Interrupt
 Latency of round trip 1.3 µs
Control Network

11/28/23 20
3D Torus Network
 It is used for general-purpose, point-to-point message
passing and multicast operations to a selected “class” of
nodes.
 The topology is a three-dimensional torus constructed with
point-to-point, serial links between routers embedded within
the BlueGene/L ASICs.
 Each ASIC has six nearest-neighbor connections
 Virtual cut-through routing with multipacket buffering on
collision
 Minimal, Adaptive, Deadlock Free

11/28/23 21
Torus Network (cont’d)
 Class Routing Capability (Deadlock-free
Hardware Multicast)
 Packets can be deposited along route to
specified destination.
 Allows for efficient one to many in some
instances
 Active messages allows for fast transposes as
required in FFTs.
 Independent on-chip network interfaces enable
concurrent access.

11/28/23 22
Other Networks

 A global combining/broadcast tree for


collective operations
 A Gigabit Ethernet network for connection to
other systems, such as hosts and file
systems.
 A global barrier and interrupt network
 And another Gigabit Ethernet to JTAG
network for machine control

11/28/23 23
Collective Network
 It has tree structure
 One-to-all broadcast functionality
 Reduction operations functionality
 2.8 Gb/s of bandwidth per link; Latency of tree
traversal 2.5 µs
 ~23TB/s total binary tree bandwidth (64k machine)
 Interconnects all compute and I/O nodes (1024)

11/28/23 24
Gb Ethernet Disk/Host I/O Network

 IO nodes are leaves on collective network.


 Compute and IO nodes use same ASIC, but:
 IO node has Ethernet not torus. Provedes IO
seperation on application.
 Compute node has torus, not Ethernet: No need for
65536 cables.
 Configurable ratio of IO to compute =
1:8,16,32,64,128.
 Application runs on compute nodes, not IO nodes.

11/28/23 25
Fast Barrier/Interrupt Network
 Four Independent Barrier or Interrupt Channels
 Independently Configurable as "or" or "and"
 Asynchronous Propagation
 Halt operation quickly (current estimate is 1.3usec worst case
round trip)
 3/4 of this delay is time-of-flight.
 Sticky bit operation
 Allows global barriers with a single channel.
 User Space Accessible
 System selectable
 It is partitioned along same boundaries as Tree, and
Torus
 Each user partition contains it's own set of barrier/ interrupt
signals
11/28/23 26
Control Network
 JTAG interface to 100Mb Ethernet
 direct access to all nodes.
 boot, system debug availability.
 runtime noninvasive RAS support.
 non-invasive access to performance counters
 direct access to shared SRAM in every node
 Control, configuration and monitoring:
 Make all active devices accessible through JTAG, I2C,
or other “simple” bus. (Only clock buffers & DRAM are
not accessible)

11/28/23 27
BlueGene/L System

11/28/23 28
BLUE GENE/L I/O ARCHITECTURE

11/28/23 29
BlueGene/L Software Hierarchical Organization
 Compute nodes dedicated to running user application, and
almost nothing else - simple compute node kernel (CNK)
 I/O nodes run Linux and provide a more complete range of OS
services – files, sockets, process launch, signaling, debugging,
and termination
 Service node performs system management services (e.g.,
heart beating, monitoring errors) - transparent to application
software

11/28/23 30
Parallel I/O on BG via GPFS

IBM

11/28/23 31
BG/L System Software

 Simplicity
 Space-sharing
 Single-threaded
 No demand paging
 Familiarity
 MPI (MPICH2)
 IBM XL Compilers for PowerPC

11/28/23 32
Operating Systems

 Front-end nodes are commodity PCs


running Linux
 I/O nodes run a customized Linux
kernel
 Compute nodes use an extremely
lightweight custom kernel
 Service node is a single multiprocessor
machine running a custom OS
11/28/23 33
Compute Node Kernel (CNK)

 Single user, dual-threaded


 Flat address space, no paging
 Physical resources are memory-mapped
 Provides standard POSIX functionality
(mostly)
 Two execution modes:
 Virtual node mode
 Coprocessor mode

11/28/23 34
Service Node OS

 Core Management and Control System


(CMCS)
 BG/L’s “global” operating system.

 MMCS - Midplane Monitoring and


Control System
 CIOMAN - Control and I/O Manager

 DB2 relational database

11/28/23 35
Running a User Job

 Compiled, and submitted from front-end


node.
 External scheduler
 Service node sets up partition, and transfers
user’s code to compute nodes.
 All file I/O is done using standard Unix calls
(via the I/O nodes).
 Post-facto debugging done on front-end
nodes.
11/28/23 36
Performance Issues

 User code is easily ported to BG/L.


 However, MPI implementation requires
considerable effort & skill.
 Torus topology instead of crossbar
 Special hardware, such as collective
network.

11/28/23 37
Application Match

 BlueGene provides capability for

 Applications with extremely large compute demand,


 Applications where current capability is constrained by boundary
effects

 Most important impact: New science

 Nanoscale systems (from condensed matter to protein structure)


 Astrophysics
 Fluid Simulation, Structural Analysis

11/28/23 38
Navigator: Viewing active HTC jobs running on
Blue Gene partitions

39
Navigator: Viewing HTC Jobs

40
Closing Points
 Blue Gene represents an innovative way to scale to multi-teraflops
capability
 Massive scalability
 Efficient packaging for low power and floor space consumption
 Unique in the market for its balance between massive scale-out capacity
and preservation of familiar user/administrator environments
 Better than COTS clusters by virtue of density, scalability and innovative interconnect
design
 Better than vector-based supercomputers by virtue of adherence to Linux and MPI
standards
 Blue Gene is applicable to a wide range of Deep Computing workloads

 Programs are underway to ensure Blue Gene technology is accessible to a


broad range of researchers and technologists

 Based on PowerPC, Blue Gene leverages and advances core IBM


technology

 Blue Gene R&D continues so as to ensure the program stays vital


11/28/23 41

You might also like