0% found this document useful (0 votes)
70 views20 pages

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

This document proposes a Grid Processor Architecture (GPA) as a scalable microprocessor design that addresses limitations of conventional superscalar processors. The GPA partitions a program into blocks that are explicitly mapped to a 2D grid of execution nodes. Instructions specify targets rather than dependencies. Blocks can execute out-of-order across the grid in dataflow order. Instruction buffers add a third dimension and enable speculation across blocks. The GPA aims to eliminate centralized structures, encourage regularity, and enhance compiler-hardware cooperation to improve performance scalability. Future work includes architectural refinements, improved compilation tools, and a planned hardware prototype.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views20 pages

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

This document proposes a Grid Processor Architecture (GPA) as a scalable microprocessor design that addresses limitations of conventional superscalar processors. The GPA partitions a program into blocks that are explicitly mapped to a 2D grid of execution nodes. Instructions specify targets rather than dependencies. Blocks can execute out-of-order across the grid in dataflow order. Instruction buffers add a third dimension and enable speculation across blocks. The GPA aims to eliminate centralized structures, encourage regularity, and enhance compiler-hardware cooperation to improve performance scalability. Future work includes architectural refinements, improved compilation tools, and a planned hardware prototype.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems

Stephen W. Keckler1, Doug Burger1, Charles R. Moore1, Ramadass Nagarajan1, Karthikeyan Sankaralingam1, Vikas Agarwal2, M.S. Hrishikesh2, Nitya Ranganathan1, and Premkishore Shivakumar2
Computer Architecture and Technology Laboratory 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering The University of Texas at Austin

Outline
Progress and Limitations of Conventional Superscalar Designs Grid Processor Architecture (GPA) Overview
Block Compilation Block Execution Flow Results

Extending the GPA with Polymorphism Conclusion and Future Work

Superscalar Core Spot the ALU


Two ALUs Two FPUs Two LD/ST

CPU Core

Only 12% of Non-Cache, Non-TLB Core Area is Execution Units


3

Looking Back: Conventional Superscalar


Enormous gains in frequency
1998: 500MHz 2002: 3000MHz Equal contributions from pipelining and technology

IPC Basically Unchanged


1998: ~1 IPC 2002: ~1 IPC uArch innovations just overcome losses due to pipelining Issue width remains at 4 instructions

Pushing the limits of Complexity Management


uArch innovations Verification is the Gate Hundreds of full custom macros 250-500 person design teams Execution units are a small % of processor area
4

Faster, Higher IPC Superscalar Processors?


Faster Deeper Pipelines (8 FO4) Key latencies increase IPC decreases Pipeline bubbles uArch innovations mitigate losses, but
Increases complexity and performance anomalies

After 8 FO4 jump, frequency growth limited to technology only Higher IPC Wide Issue (16) and Large Window (512+)

Growth is quadratic but gain is logarithmic Broadcast results to all pending instructions Studies indicate only incremental performance gains Wire delay limits size of monolithic structures Large structures must be partitioned to meet cycle time Key latencies increase, reducing IPC gain (again!) Additional logical and circuit complexity
5

Superscalar Cores Key Circuit Elements


Conventional 4-Issue
Execution I-Cache Mapper Issue Queue RegFiles D-Cache 2 FP, 2 INT, 2 LD/ST 64KB 1 Port, 64B (1 instance) 8 port x 72-entry CAM (2) 4P x 20-entry dual CAM (3) 72-entry, 4R, 5W ports (4) 32KB 2R/1W ports (1)

Hypothetical 16-issue
8 FP, 8 INT, 8 LD/ST 128KB 2 Ports, 128B (1) 32 port x 512-entry CAM (2) 4P x 40-entry dual CAM (12) 512-entry, 4R, 18W ports (8) 128KB 8R/4W ports (1)

and pipeline these to use only 8 FO4 delays / cycle !


6

What is Going Wrong?


1. Superscalar MicroArchitecture: Scalability is Limited
Relies on large, centralized structures that want to grow larger Partitioning is a slippery slope: Complexity, IPC loss

2. Architecture: Conventional Binary Interface is outdated !


Linear sequence of instructions Defined for simple, single-issue machines Not natural for compiler .. Internally builds and optimizes 2D Control Flow Graph Forced to map CFG into 1D linear sequence Lots of useful information gets thrown away Not natural for instruction parallel machines .. Instruction relationships scattered throughout linear sequence Dynamically re-establish by scanning linear sequence N2 problem large, centralized structures
7

Grid Processor Overview


Wire-delay constraints exposed at the architecture level Renegotiate the Compiler / HW Binary Interface
GPA
Banked register file
Moves Bank M 0 1 2 3

Execution Node

Instr
Bank 0 Load store queues

Instruction caches

Bank 0 Bank 1 Bank 2 Bank 3 Block termination Logic

Op A ALU

Op B

Data caches

Bank 1 Bank 2 Bank 3

Router

N S E W
8

GPA Execution Model


Compiler structures program into sequence of hyperblocks
Atomic unit of fetch / schedule / execute / commit

Blocks specify explicit instruction placement in the GRID


Critical path placed to minimize communication delays Less critical instructions placed in remaining positions

Instructions specify consumers as explicit targets


CFG cast into instruction encoding Point-to-point results forwarding In-GRID storage expands register space Only block outputs written back to RF no HW dependency analysis! no associative issue queues! no global bypass network! no register renaming! Fewer RF ports needed!

Dynamic Instruction Issue


GRID forms large distributed window with independent issue controls Instructions execute in original dataflow-order

Block Compilation
Intermediate Code Data Flow Graph Mapping GPA Code

Intermediate Code
i1) add r1, r2, r3 i2) add r7, r2, r1 i3) ld r4, (r1) i4) add r5, r4, 1 i5) beqz r5, 0xdeac
Inputs (r2, r3) Temporaries (r1, r4, r5) Outputs (r7)

Data flow graph


move r2, i1,i2 move r3, i1 r3 r2

Compiler Transforms

i1 i2 i3 i4 i5

r7
10

Block Compilation (cont)


Intermediate Code Data Flow Graph Mapping GPA Code

Data flow graph


move r2, i1,i2 move r3, i1 r3 r2 i1 i2 i3 i4 (1,1)

Mapping onto GPA


move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1

Scheduler

i2

i3 i4

i5

r7 r7

i5

11

Block Compilation (cont)


Intermediate Code Data Flow Graph Mapping GPA Code

Mapping onto GPA


(1,1) move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1 i2 i3 i4 i5

GPA code
I1) : (1,3) add (2,2) (2,3) Code generation
Targets Instruction location Opcode

r7

12

Block Execution
Icache moves

Instruction distribution Input register fetch Block execution Output register writeback
DCache bank 0

Bank0

Bank1

r2

Bank2 Bank2
r3

Bank3

ICache bank 0

add

Load store queues Load store queues

ICache bank 1

add

load

DCache bank 1

ICache bank 2

add

DCache bank 2

ICache bank 3

beqz

DCache bank 3

Block termination Logic


r7
13

Instruction Buffers - frames


Instruction Buffers add depth and define frames
2D GRID of execution units; 3D scheduling of instructions Allows very large blocks to be mapped onto GRID Result addresses explicitly specified in 3-dimensions (x,y,z)

add
Control opcode src src 2 val 1 valsrc opcode src val 1 val 2 src opcode src val 1 val 2 src opcode src val 1 val 2 ALU Router

add

load

add

Execution Node

Instruction Buffers form a logical z-dimension in each node

beqz

4 logical frames each with 16 instruction slots


14

Using frames for Speculation and ILP


start

16 total frames (4 sets of 4)


E (spec)

Map A onto GRID Start executing A Predict C is next block Speculatively execute C
C

D (spec) C (spec) A

B D

Predict is D is after C Speculatively execute D Predict is E is after D Speculatively execute E Result: Enormous effective instruction window for extracting ILP Increased utilization of execution units (accuracy counts!) Latency tolerance for GRID delays and Load instructions
15

end

Results GPA Instructions per Cycle


7 6 5 4

8x8 GPA 4x4 GPA 4 issue Superscalar

IPC

3 2 1 0

t ar p m am

N EA M x rte vo f ol tw r e rs pa cf m im ks 88 m ip gz r p m co 2 ip bz

Benchmark
16

Using frames for Thread-Level Parallelism


Th re ad
B(spec) A B(spec) A

Divide frame space among threads - Each can be further divided to enable some degree of speculation - Shown: 2 threads, each with 1 speculative block - Alternate configuration might provide 4 threads
1

Result: Simultaneous Multithreading (SMT) for Grid Processors Polymorphism: Use same resources in different ways for different workloads (T-morph)
17

2 Th re ad

Using frames for Data-Level Parallelism


Streaming Kernel: - read input stream element - process element - write output stream element
start loop N times start
la l e ne on ker e rg

(1) (2) (3) loop N/8 times

unroll 8X
(8)

end

end

Result: The instruction buffers act as a distributed I-Cache Ability to absorb and process large amounts of streaming data Another type of Polymorphism (S-morph)

- Map very large blocks (kernels) - Fetch once, use many times - Not shown: streaming data channels

18

Conclusions
Technology and Architecture Trends:
Good News: Bad News: Lots of transistors, faster transistors Global wire delays are growing Pipeline depth near optimal Superscalar pushing the limits of complexity

GPA Represents a Promising Technology Direction


Wire delay constraints: MicroArchitecture and Architecture Eliminates difficult centralized structures dominating todays designs Architectural partitioning encourages regularity and re-use Enhanced information flow between compiler and hardware Polymorphic features: performance on a wide range of workloads

19

Future Work
Architectural Refinement
Block-oriented predictors Selective re-execution

Enhance Compilation and Scheduling Tools


Hyperblock formation 3D Instruction Scheduling algorithms

Compatibility bridge to existing architectures Hardware Prototype (currently in planning stage)


Four 4x4 GPA cores + NUCA L2 cache on chip 0.10um, ~350mm2, 1000+ signal I/O, 300MHz 4Q 2004 tape-out
20

You might also like