0% found this document useful (0 votes)

70 views20 pages

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

This document proposes a Grid Processor Architecture (GPA) as a scalable microprocessor design that addresses limitations of conventional superscalar processors. The GPA partitions a program into blocks that are explicitly mapped to a 2D grid of execution nodes. Instructions specify targets rather than dependencies. Blocks can execute out-of-order across the grid in dataflow order. Instruction buffers add a third dimension and enable speculation across blocks. The GPA aims to eliminate centralized structures, encourage regularity, and enhance compiler-hardware cooperation to improve performance scalability. Future work includes architectural refinements, improved compilation tools, and a planned hardware prototype.

Uploaded by

kevintungjungchen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views20 pages

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

Uploaded by

kevintungjungchen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems

Stephen W. Keckler1, Doug Burger1, Charles R. Moore1, Ramadass Nagarajan1, Karthikeyan Sankaralingam1, Vikas Agarwal2, M.S. Hrishikesh2, Nitya Ranganathan1, and Premkishore Shivakumar2
Computer Architecture and Technology Laboratory 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering The University of Texas at Austin

Outline
Progress and Limitations of Conventional Superscalar Designs Grid Processor Architecture (GPA) Overview
Block Compilation Block Execution Flow Results

Extending the GPA with Polymorphism Conclusion and Future Work

Superscalar Core Spot the ALU

Two ALUs Two FPUs Two LD/ST

CPU Core

Only 12% of Non-Cache, Non-TLB Core Area is Execution Units

Looking Back: Conventional Superscalar

Enormous gains in frequency
1998: 500MHz 2002: 3000MHz Equal contributions from pipelining and technology

IPC Basically Unchanged

1998: ~1 IPC 2002: ~1 IPC uArch innovations just overcome losses due to pipelining Issue width remains at 4 instructions

Pushing the limits of Complexity Management

uArch innovations Verification is the Gate Hundreds of full custom macros 250-500 person design teams Execution units are a small % of processor area
4

Faster, Higher IPC Superscalar Processors?

Faster Deeper Pipelines (8 FO4) Key latencies increase IPC decreases Pipeline bubbles uArch innovations mitigate losses, but
Increases complexity and performance anomalies

After 8 FO4 jump, frequency growth limited to technology only Higher IPC Wide Issue (16) and Large Window (512+)

Growth is quadratic but gain is logarithmic Broadcast results to all pending instructions Studies indicate only incremental performance gains Wire delay limits size of monolithic structures Large structures must be partitioned to meet cycle time Key latencies increase, reducing IPC gain (again!) Additional logical and circuit complexity
5

Superscalar Cores Key Circuit Elements

Conventional 4-Issue
Execution I-Cache Mapper Issue Queue RegFiles D-Cache 2 FP, 2 INT, 2 LD/ST 64KB 1 Port, 64B (1 instance) 8 port x 72-entry CAM (2) 4P x 20-entry dual CAM (3) 72-entry, 4R, 5W ports (4) 32KB 2R/1W ports (1)

Hypothetical 16-issue
8 FP, 8 INT, 8 LD/ST 128KB 2 Ports, 128B (1) 32 port x 512-entry CAM (2) 4P x 40-entry dual CAM (12) 512-entry, 4R, 18W ports (8) 128KB 8R/4W ports (1)

and pipeline these to use only 8 FO4 delays / cycle !

What is Going Wrong?

1. Superscalar MicroArchitecture: Scalability is Limited
Relies on large, centralized structures that want to grow larger Partitioning is a slippery slope: Complexity, IPC loss

2. Architecture: Conventional Binary Interface is outdated !

Linear sequence of instructions Defined for simple, single-issue machines Not natural for compiler .. Internally builds and optimizes 2D Control Flow Graph Forced to map CFG into 1D linear sequence Lots of useful information gets thrown away Not natural for instruction parallel machines .. Instruction relationships scattered throughout linear sequence Dynamically re-establish by scanning linear sequence N2 problem large, centralized structures
7

Grid Processor Overview

Wire-delay constraints exposed at the architecture level Renegotiate the Compiler / HW Binary Interface
GPA
Banked register file
Moves Bank M 0 1 2 3

Execution Node

Instr
Bank 0 Load store queues

Instruction caches

Bank 0 Bank 1 Bank 2 Bank 3 Block termination Logic

Op A ALU

Op B

Data caches

Bank 1 Bank 2 Bank 3

Router

N S E W
8

GPA Execution Model

Compiler structures program into sequence of hyperblocks
Atomic unit of fetch / schedule / execute / commit

Blocks specify explicit instruction placement in the GRID

Critical path placed to minimize communication delays Less critical instructions placed in remaining positions

Instructions specify consumers as explicit targets

CFG cast into instruction encoding Point-to-point results forwarding In-GRID storage expands register space Only block outputs written back to RF no HW dependency analysis! no associative issue queues! no global bypass network! no register renaming! Fewer RF ports needed!

Dynamic Instruction Issue

GRID forms large distributed window with independent issue controls Instructions execute in original dataflow-order

Block Compilation
Intermediate Code Data Flow Graph Mapping GPA Code

Intermediate Code
i1) add r1, r2, r3 i2) add r7, r2, r1 i3) ld r4, (r1) i4) add r5, r4, 1 i5) beqz r5, 0xdeac
Inputs (r2, r3) Temporaries (r1, r4, r5) Outputs (r7)

Data flow graph

move r2, i1,i2 move r3, i1 r3 r2

Compiler Transforms

i1 i2 i3 i4 i5

r7
10

Block Compilation (cont)

Intermediate Code Data Flow Graph Mapping GPA Code

Data flow graph

move r2, i1,i2 move r3, i1 r3 r2 i1 i2 i3 i4 (1,1)

Mapping onto GPA

move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1

Scheduler

i3 i4

r7 r7

Block Compilation (cont)

Intermediate Code Data Flow Graph Mapping GPA Code

Mapping onto GPA

(1,1) move r2, (1,3), (2,2) move r3, (1,3) r2 r3 i1 i2 i3 i4 i5

GPA code
I1) : (1,3) add (2,2) (2,3) Code generation
Targets Instruction location Opcode

Block Execution
Icache moves

Instruction distribution Input register fetch Block execution Output register writeback
DCache bank 0

Bank0

Bank1

Bank2 Bank2
r3

Bank3

ICache bank 0

add

Load store queues Load store queues

ICache bank 1

add

load

DCache bank 1

ICache bank 2

add

DCache bank 2

ICache bank 3

beqz

DCache bank 3

Block termination Logic

r7
13

Instruction Buffers - frames

Instruction Buffers add depth and define frames
2D GRID of execution units; 3D scheduling of instructions Allows very large blocks to be mapped onto GRID Result addresses explicitly specified in 3-dimensions (x,y,z)

add
Control opcode src src 2 val 1 valsrc opcode src val 1 val 2 src opcode src val 1 val 2 src opcode src val 1 val 2 ALU Router

add

load

add

Execution Node

Instruction Buffers form a logical z-dimension in each node

beqz

4 logical frames each with 16 instruction slots

Using frames for Speculation and ILP

start

16 total frames (4 sets of 4)

E (spec)

Map A onto GRID Start executing A Predict C is next block Speculatively execute C
C

D (spec) C (spec) A

B D

Predict is D is after C Speculatively execute D Predict is E is after D Speculatively execute E Result: Enormous effective instruction window for extracting ILP Increased utilization of execution units (accuracy counts!) Latency tolerance for GRID delays and Load instructions
15

end

Results GPA Instructions per Cycle

7 6 5 4

8x8 GPA 4x4 GPA 4 issue Superscalar

IPC

3 2 1 0

t ar p m am

N EA M x rte vo f ol tw r e rs pa cf m im ks 88 m ip gz r p m co 2 ip bz

Benchmark
16

Using frames for Thread-Level Parallelism

Th re ad
B(spec) A B(spec) A

Divide frame space among threads - Each can be further divided to enable some degree of speculation - Shown: 2 threads, each with 1 speculative block - Alternate configuration might provide 4 threads
1

Result: Simultaneous Multithreading (SMT) for Grid Processors Polymorphism: Use same resources in different ways for different workloads (T-morph)
17

2 Th re ad

Using frames for Data-Level Parallelism

Streaming Kernel: - read input stream element - process element - write output stream element
start loop N times start
la l e ne on ker e rg

(1) (2) (3) loop N/8 times

unroll 8X
(8)

end

Result: The instruction buffers act as a distributed I-Cache Ability to absorb and process large amounts of streaming data Another type of Polymorphism (S-morph)

- Map very large blocks (kernels) - Fetch once, use many times - Not shown: streaming data channels

Conclusions
Technology and Architecture Trends:
Good News: Bad News: Lots of transistors, faster transistors Global wire delays are growing Pipeline depth near optimal Superscalar pushing the limits of complexity

GPA Represents a Promising Technology Direction

Wire delay constraints: MicroArchitecture and Architecture Eliminates difficult centralized structures dominating todays designs Architectural partitioning encourages regularity and re-use Enhanced information flow between compiler and hardware Polymorphic features: performance on a wide range of workloads

Future Work
Architectural Refinement
Block-oriented predictors Selective re-execution

Enhance Compilation and Scheduling Tools

Hyperblock formation 3D Instruction Scheduling algorithms

Compatibility bridge to existing architectures Hardware Prototype (currently in planning stage)

Four 4x4 GPA cores + NUCA L2 cache on chip 0.10um, ~350mm2, 1000+ signal I/O, 300MHz 4Q 2004 tape-out
20

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Superscalar Architectures
No ratings yet
Superscalar Architectures
36 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
10 Week
No ratings yet
10 Week
35 pages
Module 2
No ratings yet
Module 2
127 pages
cs152 Notes
No ratings yet
cs152 Notes
34 pages
Unit 1
No ratings yet
Unit 1
5 pages
CA8 2024S2 Newer
No ratings yet
CA8 2024S2 Newer
21 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
HSE-6-Soc Introduction To The System Design Approach
No ratings yet
HSE-6-Soc Introduction To The System Design Approach
69 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Super Cpmputers
No ratings yet
Super Cpmputers
101 pages
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
51 pages
Computer Architecture & Amdahl's Law
No ratings yet
Computer Architecture & Amdahl's Law
23 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
20 Advanced Processor Designs
No ratings yet
20 Advanced Processor Designs
28 pages
Computer Archi
No ratings yet
Computer Archi
58 pages
TRIPS: EDGE Instruction Set Architecture
No ratings yet
TRIPS: EDGE Instruction Set Architecture
35 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors 1st Edition Jean-Loup Baer Download
100% (1)
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors 1st Edition Jean-Loup Baer Download
40 pages
Low-Power Superscalar Architectures
No ratings yet
Low-Power Superscalar Architectures
21 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
40 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
L05 PipeliningII
No ratings yet
L05 PipeliningII
36 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
Superscalar Processors & Parallelism
No ratings yet
Superscalar Processors & Parallelism
50 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
Architecture and Micro
No ratings yet
Architecture and Micro
69 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
No ratings yet
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
35 pages
COA Report
No ratings yet
COA Report
13 pages
07 Superscalar
No ratings yet
07 Superscalar
12 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
188 pages
Lecture 8
No ratings yet
Lecture 8
37 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
ACA Mod2
No ratings yet
ACA Mod2
45 pages
Computer Architecture P1
No ratings yet
Computer Architecture P1
37 pages
S 00458 Ed 1 V 01 y 201212 Cac 021
No ratings yet
S 00458 Ed 1 V 01 y 201212 Cac 021
111 pages
CPU Architecture 2018
No ratings yet
CPU Architecture 2018
42 pages
D355c-3 Maintenance and Operation
No ratings yet
D355c-3 Maintenance and Operation
220 pages
ĐỀ 1
No ratings yet
ĐỀ 1
5 pages
Im25013 00 c251h c351h St5 Fuel Line Modification Es Eng
No ratings yet
Im25013 00 c251h c351h St5 Fuel Line Modification Es Eng
15 pages
Course Book Answers For Cambridge International As A Level Chemistry Coursebook
87% (15)
Course Book Answers For Cambridge International As A Level Chemistry Coursebook
145 pages
Business Continuity Management
No ratings yet
Business Continuity Management
5 pages
Unit 5 Social Entrepreneurship
No ratings yet
Unit 5 Social Entrepreneurship
3 pages
Tripartite Memorandum of Agreement
100% (5)
Tripartite Memorandum of Agreement
4 pages
Database, MySQL & Connectivity Q & A
No ratings yet
Database, MySQL & Connectivity Q & A
154 pages
IGCSE Economics Glossary Guide
No ratings yet
IGCSE Economics Glossary Guide
21 pages
Module 2 Lab: Creating Data Types and Tables
No ratings yet
Module 2 Lab: Creating Data Types and Tables
5 pages
Low Energy Emulsification
No ratings yet
Low Energy Emulsification
5 pages
752 PM 1 GB
No ratings yet
752 PM 1 GB
238 pages
The Navchetana - 20250703 - 015820 - 0000
No ratings yet
The Navchetana - 20250703 - 015820 - 0000
15 pages
Methodologie Conclusion Dissertation Philosophie
100% (2)
Methodologie Conclusion Dissertation Philosophie
6 pages
Dashala System Explained
No ratings yet
Dashala System Explained
2 pages
ECE 6640 Digital Communications
No ratings yet
ECE 6640 Digital Communications
47 pages
Tally Shortcut Keys Guide
No ratings yet
Tally Shortcut Keys Guide
5 pages
How To Build Modern Furniture
100% (3)
How To Build Modern Furniture
224 pages
VSC40 Range (UK)
No ratings yet
VSC40 Range (UK)
6 pages
2nd Condition For Equilibrium
No ratings yet
2nd Condition For Equilibrium
2 pages
EU VAT Changes Impact on UK Businesses
No ratings yet
EU VAT Changes Impact on UK Businesses
2 pages
Giantbicycles 66148 Anthem.x.29er.1
No ratings yet
Giantbicycles 66148 Anthem.x.29er.1
1 page
Uvm Ieee 18002-2020
No ratings yet
Uvm Ieee 18002-2020
458 pages
SAP XSTRING File Storage Guide
No ratings yet
SAP XSTRING File Storage Guide
2 pages
CS II 3rd Year Paper
No ratings yet
CS II 3rd Year Paper
4 pages
OCIMF CFD Current Drag Report For MEG4 Working Group
No ratings yet
OCIMF CFD Current Drag Report For MEG4 Working Group
40 pages
Statistics Exam Solutions
No ratings yet
Statistics Exam Solutions
3 pages
3D Printer A6 Assembly Parts List-2016-7-2 PDF
No ratings yet
3D Printer A6 Assembly Parts List-2016-7-2 PDF
5 pages
Faculty Job Application Form
No ratings yet
Faculty Job Application Form
2 pages
Membership Form
No ratings yet
Membership Form
2 pages

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

Uploaded by

A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems

Uploaded by

A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems

Extending the GPA with Polymorphism Conclusion and Future Work

Superscalar Core Spot the ALU

Only 12% of Non-Cache, Non-TLB Core Area is Execution Units

Looking Back: Conventional Superscalar

IPC Basically Unchanged

Pushing the limits of Complexity Management

Faster, Higher IPC Superscalar Processors?

Superscalar Cores Key Circuit Elements

and pipeline these to use only 8 FO4 delays / cycle !

What is Going Wrong?

2. Architecture: Conventional Binary Interface is outdated !

Grid Processor Overview

Bank 0 Bank 1 Bank 2 Bank 3 Block termination Logic

Bank 1 Bank 2 Bank 3

GPA Execution Model

Blocks specify explicit instruction placement in the GRID

Instructions specify consumers as explicit targets

Dynamic Instruction Issue

Data flow graph

Block Compilation (cont)

Data flow graph

Mapping onto GPA

Block Compilation (cont)

Mapping onto GPA

Load store queues Load store queues

Block termination Logic

Instruction Buffers - frames

Instruction Buffers form a logical z-dimension in each node

4 logical frames each with 16 instruction slots

Using frames for Speculation and ILP

16 total frames (4 sets of 4)

Results GPA Instructions per Cycle

8x8 GPA 4x4 GPA 4 issue Superscalar

Using frames for Thread-Level Parallelism

Using frames for Data-Level Parallelism

(1) (2) (3) loop N/8 times

GPA Represents a Promising Technology Direction

Enhance Compilation and Scheduling Tools

Compatibility bridge to existing architectures Hardware Prototype (currently in planning stage)

You might also like