A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems
Stephen W. Keckler1, Doug Burger1, Charles R. Moore1, Ramadass Nagarajan1, Karthikeyan Sankaralingam1, Vikas Agarwal2, M.S. Hrishikesh2, Nitya Ranganathan1, and Premkishore Shivakumar2
Computer Architecture and Technology Laboratory 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering The University of Texas at Austin
Outline
 Progress and Limitations of Conventional Superscalar Designs  Grid Processor Architecture (GPA) Overview
 Block Compilation  Block Execution Flow  Results
 Extending the GPA with Polymorphism  Conclusion and Future Work
Superscalar Core  Spot the ALU
Two ALUs Two FPUs Two LD/ST
CPU Core
Only 12% of Non-Cache, Non-TLB Core Area is Execution Units
3
Looking Back: Conventional Superscalar
 Enormous gains in frequency
 1998: 500MHz 2002: 3000MHz  Equal contributions from pipelining and technology
 IPC Basically Unchanged
 1998: ~1 IPC 2002: ~1 IPC  uArch innovations just overcome losses due to pipelining  Issue width remains at 4 instructions
 Pushing the limits of Complexity Management
    uArch innovations Verification is the Gate Hundreds of full custom macros 250-500 person design teams Execution units are a small % of processor area
4
Faster, Higher IPC Superscalar Processors?
Faster Deeper Pipelines (8 FO4)  Key latencies increase  IPC decreases  Pipeline bubbles  uArch innovations mitigate losses, but 
 Increases complexity and performance anomalies
 After 8 FO4 jump, frequency growth limited to technology only Higher IPC Wide Issue (16) and Large Window (512+)
 Growth is quadratic but gain is logarithmic  Broadcast results to all pending instructions  Studies indicate only incremental performance gains  Wire delay limits size of monolithic structures  Large structures must be partitioned to meet cycle time  Key latencies increase, reducing IPC gain (again!)  Additional logical and circuit complexity
5
Superscalar Cores  Key Circuit Elements
Conventional 4-Issue
Execution I-Cache Mapper Issue Queue RegFiles D-Cache 2 FP, 2 INT, 2 LD/ST 64KB 1 Port, 64B (1 instance) 8 port x 72-entry CAM (2) 4P x 20-entry dual CAM (3) 72-entry, 4R, 5W ports (4) 32KB 2R/1W ports (1)
Hypothetical 16-issue
8 FP, 8 INT, 8 LD/ST 128KB 2 Ports, 128B (1) 32 port x 512-entry CAM (2) 4P x 40-entry dual CAM (12) 512-entry, 4R, 18W ports (8) 128KB 8R/4W ports (1)
 and pipeline these to use only 8 FO4 delays / cycle !
6
What is Going Wrong?
1. Superscalar MicroArchitecture: Scalability is Limited
 Relies on large, centralized structures that want to grow larger  Partitioning is a slippery slope: Complexity, IPC loss
2. Architecture: Conventional Binary Interface is outdated !
 Linear sequence of instructions  Defined for simple, single-issue machines  Not natural for compiler ..  Internally builds and optimizes 2D Control Flow Graph  Forced to map CFG into 1D linear sequence  Lots of useful information gets thrown away  Not natural for instruction parallel machines ..  Instruction relationships scattered throughout linear sequence  Dynamically re-establish by scanning linear sequence  N2 problem large, centralized structures
7
Grid Processor Overview
Wire-delay constraints exposed at the architecture level Renegotiate the Compiler / HW Binary Interface
GPA
Banked register file
Moves Bank M 0 1 2 3
Execution Node
Instr
Bank 0 Load store queues
Instruction caches
Bank 0 Bank 1 Bank 2 Bank 3 Block termination Logic
Op A ALU
Op B
Data caches
Bank 1 Bank 2 Bank 3
Router
N S E W
8
GPA Execution Model
 Compiler structures program into sequence of hyperblocks
 Atomic unit of fetch / schedule / execute / commit
 Blocks specify explicit instruction placement in the GRID
 Critical path placed to minimize communication delays  Less critical instructions placed in remaining positions
 Instructions specify consumers as explicit targets
 CFG cast into instruction encoding  Point-to-point results forwarding  In-GRID storage expands register space  Only block outputs written back to RF no HW dependency analysis! no associative issue queues! no global bypass network! no register renaming! Fewer RF ports needed!
 Dynamic Instruction Issue
 GRID forms large distributed window with independent issue controls  Instructions execute in original dataflow-order
Block Compilation
Intermediate Code Data Flow Graph Mapping GPA Code
Intermediate Code
i1) add r1, r2, r3 i2) add r7, r2, r1 i3) ld r4, (r1) i4) add r5, r4, 1 i5) beqz r5, 0xdeac
Inputs (r2, r3) Temporaries (r1, r4, r5) Outputs (r7)
Data flow graph
move r2, i1,i2 move r3, i1 r3 r2
Compiler Transforms
i1 i2 i3 i4 i5
r7
10
Block Compilation (cont)
Intermediate Code Data Flow Graph Mapping GPA Code
Data flow graph
move r2, i1,i2 move r3, i1 r3 r2 i1 i2 i3 i4 (1,1)
Mapping onto GPA
move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1
Scheduler
i2
i3 i4
i5
r7 r7
i5
11
Block Compilation (cont)
Intermediate Code Data Flow Graph Mapping GPA Code
Mapping onto GPA
(1,1) move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1 i2 i3 i4 i5
GPA code
I1) : (1,3) add (2,2) (2,3) Code generation
Targets Instruction location Opcode
r7
12
Block Execution
Icache moves
Instruction distribution Input register fetch Block execution Output register writeback
DCache bank 0
Bank0
Bank1
r2
Bank2 Bank2
r3
Bank3
ICache bank 0
add
Load store queues Load store queues
ICache bank 1
add
load
DCache bank 1
ICache bank 2
add
DCache bank 2
ICache bank 3
beqz
DCache bank 3
Block termination Logic
r7
13
Instruction Buffers - frames
 Instruction Buffers add depth and define frames
 2D GRID of execution units; 3D scheduling of instructions  Allows very large blocks to be mapped onto GRID  Result addresses explicitly specified in 3-dimensions (x,y,z)
add
Control opcode src src 2 val 1 valsrc opcode src val 1 val 2 src opcode src val 1 val 2 src opcode src val 1 val 2 ALU Router
add
load
add
Execution Node
Instruction Buffers form a logical z-dimension in each node
beqz
4 logical frames each with 16 instruction slots
14
Using frames for Speculation and ILP
start
16 total frames (4 sets of 4)
E (spec)
Map A onto GRID Start executing A Predict C is next block Speculatively execute C
C
D (spec) C (spec) A
B D
Predict is D is after C Speculatively execute D Predict is E is after D Speculatively execute E Result:  Enormous effective instruction window for extracting ILP  Increased utilization of execution units (accuracy counts!)  Latency tolerance for GRID delays and Load instructions
15
end
Results  GPA Instructions per Cycle
7 6 5 4
8x8 GPA 4x4 GPA 4 issue Superscalar
IPC
3 2 1 0
t ar p m am
N EA M x rte vo f ol tw r e rs pa cf m im ks 88 m ip gz r p m co 2 ip bz
Benchmark
16
Using frames for Thread-Level Parallelism
Th re ad
B(spec) A B(spec) A
Divide frame space among threads - Each can be further divided to enable some degree of speculation - Shown: 2 threads, each with 1 speculative block - Alternate configuration might provide 4 threads
1
Result:  Simultaneous Multithreading (SMT) for Grid Processors  Polymorphism: Use same resources in different ways for different workloads (T-morph)
17
2 Th re ad
Using frames for Data-Level Parallelism
Streaming Kernel: - read input stream element - process element - write output stream element
start loop N times start
la l e ne on ker e rg
(1) (2) (3) loop N/8 times
unroll 8X
(8)
end
end
Result: The instruction buffers act as a distributed I-Cache Ability to absorb and process large amounts of streaming data Another type of Polymorphism (S-morph)
- Map very large blocks (kernels) - Fetch once, use many times - Not shown: streaming data channels
18
Conclusions
 Technology and Architecture Trends:
Good News: Bad News: Lots of transistors, faster transistors Global wire delays are growing Pipeline depth near optimal Superscalar pushing the limits of complexity
 GPA Represents a Promising Technology Direction
Wire delay constraints: MicroArchitecture and Architecture Eliminates difficult centralized structures dominating todays designs Architectural partitioning encourages regularity and re-use Enhanced information flow between compiler and hardware Polymorphic features: performance on a wide range of workloads
19
Future Work
 Architectural Refinement
 Block-oriented predictors  Selective re-execution
 Enhance Compilation and Scheduling Tools
 Hyperblock formation  3D Instruction Scheduling algorithms
 Compatibility bridge to existing architectures  Hardware Prototype (currently in planning stage)
 Four 4x4 GPA cores + NUCA L2 cache on chip  0.10um, ~350mm2, 1000+ signal I/O, 300MHz  4Q 2004 tape-out
20