0% found this document useful (0 votes)

27 views21 pages

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

The lecture covers systolic arrays and their application in sorting and matrix algorithms, emphasizing dense computation and data flow through compute units. It details sorting on a linear array, control mechanisms at each processor, and various comparison strategies, including lower bounds for performance. Additionally, it discusses matrix-vector and matrix-matrix multiplication, highlighting the efficiency and complexity of these algorithms in parallel processing environments.

Uploaded by

aayushkumarbce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views21 pages

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

Uploaded by

aayushkumarbce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Lecture: Systolic Arrays I

• Topics: sorting and matrix algorithms

1
Dense Computation

• Distribute memory across multiple chips; sufficient on-chip

wiring to feed computational units

• How do we design the compute units?

• GPU (too general-purpose)
• DaDianNao’s NFU (custom SIMD)
• Eyeriss’ spatial architecture (basic tile, operand network)
• ISAAC (analog)

• Systolic arrays: dense compute units; data flows through

these units with low rd/wr costs; loose connection with the
brain; effective for image processing, pattern recog, etc.
2
Sorting on a Linear Array

• Each processor has bidirectional links to its neighbors

• All processors share a single clock (asynchronous designs

will require minor modifications)

• At each clock, processors receive inputs from neighbors,

perform computations, generate output for neighbors, and
update local storage

input

output

3
Control at Each Processor

• Each processor stores the minimum number it has seen

• Initial value in storage and on network is “*”, which is

bigger than any input and also means “no signal”

• On receiving number Y from left neighbor, the processor

keeps the smaller of Y and current storage Z, and passes
the larger to the right neighbor

4
Sorting Example

5
Result Output

• The output process begins when a processor receives

a non-*, followed by a “*”

• Each processor forwards its storage to its left neighbor

and subsequent data it receives from right neighbors

• How many steps does it take to sort N numbers?

• What is the speedup and efficiency?

6
Output Example

7
Bit Model

• The bit model affords a more precise measure of

complexity – we will now assume that each processor
can only operate on a bit at a time

• To compare N k-bit words, you may now need an N x k

2-d array of bit processors

8
Pipelined Comparison

Input numbers: 3 4 2
0 1 0
1 0 1
1 0 0

9
Comparison Strategies

• Strategy 1: Bits travel horizontally, keep/swap signals

travel vertically; if inputs arrive from the left, the array is
sorted in 2N + k steps

• Strategy 2: Use a tree to communicate information on

which number is greater – can set up a pipeline so the
sorting happens in 2N + logk steps

10
Lower Bounds

• Input/Output bandwidth: Nk bits are being input/output

with k pins – requires W(N) time

• Diameter: the comparison at processor (1,1) influences

the value of the bit stored at processor (N,k) – for
example, N-1 numbers are 011..1 and the last number is
either 00…0 or 10…0 – it takes at least N+k-2 steps for
information to travel across the diameter

• Bisection width: if processors in one half require the

results computed by the other half, the bisection bandwidth
imposes a minimum completion time
11
Counter Example

• N 1-bit numbers that need to be sorted with a binary tree

• Since bisection bandwidth is 2 and each number may be

in the wrong half, will any algorithm take at least N/2 steps?

12
Counting Algorithm

• It takes O(logN) time for each intermediate node to add

the contents in the subtree and forward the result to the
parent, one bit at a time

• After the root has computed the number of 1’s, this

number is communicated to the leaves – the leaves
accordingly set their output to 0 or 1

• Each half only needs to know the number of 1’s in the

other half (logN-1 bits) – therefore, the algorithm takes
W(logN) time

• Careful when estimating lower bounds!

13
Matrix Algorithms

• Consider matrix-vector multiplication:

yi = Sj aijxj

• The sequential algorithm takes 2N2 – N operations

• With an N-cell linear array, can we implement

matrix-vector multiplication in O(N) time?

14
Matrix Vector Multiplication

Number of steps = 2N – 1 15
Matrix-Matrix Multiplication

Number of time steps = 3N – 2

16
Complexity

• The algorithm implementations on the linear arrays have

speedups that are linear in the number of processors – an
efficiency of O(1)

• It is possible to improve these algorithms by a constant

factor, for example, by inputting values directly to each
processor in the first step and providing wraparound edges
(N time steps)

17
Dataflow for Convolution

For a 3x3 kernel with strides of 1, every input pixel is involved in 9 ops

Pixels 7 4 1

Pixels 8 5 2

Pixels 9 6 3

1
2 2
3 3 3
4 4 4 This will produce partial results
5 5 that have to be consumed later.
6
18
Comparison with Eyeriss Convolution

19
References

• “Introduction to Parallel Algorithms and Architectures,” Leighton

• Figure credits: Mitsu Ogihara

20
Title

• Bullet

Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
FPGA Design and Implementation
No ratings yet
FPGA Design and Implementation
20 pages
Week 1
No ratings yet
Week 1
133 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
412 pages
Linear Array: Jyotika Jain
No ratings yet
Linear Array: Jyotika Jain
22 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Ca 3
No ratings yet
Ca 3
34 pages
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
No ratings yet
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
15 pages
Real Time Signal Processing: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
No ratings yet
Real Time Signal Processing: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
18 pages
Chapter 02
No ratings yet
Chapter 02
47 pages
Mc4101 - Adsa Notes
No ratings yet
Mc4101 - Adsa Notes
142 pages
Lect 1 Overview
No ratings yet
Lect 1 Overview
17 pages
Parallel Algorithms Course Guide
No ratings yet
Parallel Algorithms Course Guide
13 pages
Unit 2 - 2.2 (Basic Algorithms)
No ratings yet
Unit 2 - 2.2 (Basic Algorithms)
8 pages
Parallel Computing A Comparative
No ratings yet
Parallel Computing A Comparative
65 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
PPC Index
No ratings yet
PPC Index
6 pages
Lec 9
No ratings yet
Lec 9
86 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Overheads
No ratings yet
Overheads
139 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Turing Machine Construction SEO
No ratings yet
Turing Machine Construction SEO
9 pages
EE5902R Chapter 1 Slides
No ratings yet
EE5902R Chapter 1 Slides
46 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Session
No ratings yet
Session
51 pages
Sorting On A Mesh-Connected Parallel Computer
No ratings yet
Sorting On A Mesh-Connected Parallel Computer
30 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Algo - 1
No ratings yet
Algo - 1
54 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
Architectures For Parrallel Computation
No ratings yet
Architectures For Parrallel Computation
40 pages
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
No ratings yet
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
12 pages
PRAM Parallel Computing Algorithms
No ratings yet
PRAM Parallel Computing Algorithms
49 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Pda 3
No ratings yet
Pda 3
90 pages
Design and Analysis of Algorithms - Part1
No ratings yet
Design and Analysis of Algorithms - Part1
50 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
103 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
No ratings yet
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
170 pages
CH 01
No ratings yet
CH 01
28 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
21 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Hour Wise Lesson Plan of SS
No ratings yet
Hour Wise Lesson Plan of SS
4 pages
Reconfigurable Computing ES ZG554 Session 1: BITS Pilani
No ratings yet
Reconfigurable Computing ES ZG554 Session 1: BITS Pilani
18 pages
Lec 01
No ratings yet
Lec 01
53 pages
System Design
No ratings yet
System Design
29 pages
Chuong7 Fpga
No ratings yet
Chuong7 Fpga
74 pages
1 Introduction
No ratings yet
1 Introduction
77 pages
Lec Array LinearSearch BinarySearch
No ratings yet
Lec Array LinearSearch BinarySearch
26 pages
Chap 1
No ratings yet
Chap 1
12 pages
PC 1
No ratings yet
PC 1
53 pages
Parallel Computer Structures
No ratings yet
Parallel Computer Structures
23 pages
Debugging, Profiling, Performance Analysis, Optimization PDF
No ratings yet
Debugging, Profiling, Performance Analysis, Optimization PDF
56 pages
Onur 447 Spring15 Lecture1 Intro Afterlecture
No ratings yet
Onur 447 Spring15 Lecture1 Intro Afterlecture
88 pages
Gambling and The National Lottery Lesson Plan
No ratings yet
Gambling and The National Lottery Lesson Plan
3 pages
Marcom Midterm Final Exam
No ratings yet
Marcom Midterm Final Exam
2 pages
Basic SOLO by MGFB
100% (1)
Basic SOLO by MGFB
80 pages
Milestone 7
No ratings yet
Milestone 7
3 pages
Black Sects and Cults (Joseph R. Washington, JR.) (Z-Library)
No ratings yet
Black Sects and Cults (Joseph R. Washington, JR.) (Z-Library)
183 pages
DWC Ucie Phy
No ratings yet
DWC Ucie Phy
2 pages
Vietnamese Adjective Order Practice
No ratings yet
Vietnamese Adjective Order Practice
9 pages
Board of Secondary Education: Andhra Pradesh
No ratings yet
Board of Secondary Education: Andhra Pradesh
1 page
LiveWire Brochure 2025
No ratings yet
LiveWire Brochure 2025
16 pages
Year 8 Spring Foundation MS 2020
No ratings yet
Year 8 Spring Foundation MS 2020
3 pages
Grade 8 - Linear Equation in One Variable - Maths Worksheet
No ratings yet
Grade 8 - Linear Equation in One Variable - Maths Worksheet
4 pages
Grade 6 Lesson Plan About Exponent
No ratings yet
Grade 6 Lesson Plan About Exponent
8 pages
Lesson 6 Consonant TT
No ratings yet
Lesson 6 Consonant TT
14 pages
Portfolio Project Explanation WGU
No ratings yet
Portfolio Project Explanation WGU
3 pages
CSE225.7 Course Outline
No ratings yet
CSE225.7 Course Outline
3 pages
Bdi-9611 User
No ratings yet
Bdi-9611 User
32 pages
The National Music Artist of The Philippines
100% (2)
The National Music Artist of The Philippines
38 pages
Day - 3 DataUtility Customization
No ratings yet
Day - 3 DataUtility Customization
16 pages
1 The Dawn of A New Architecture 1 The Core Transformer Architecture: An Overview 2
No ratings yet
1 The Dawn of A New Architecture 1 The Core Transformer Architecture: An Overview 2
189 pages
FSF1D - ISU Rubric - NOV2020
No ratings yet
FSF1D - ISU Rubric - NOV2020
2 pages
Fuhll Text
No ratings yet
Fuhll Text
290 pages
Grade 6 Math Exam
No ratings yet
Grade 6 Math Exam
6 pages
Chatbots
No ratings yet
Chatbots
10 pages
Academic Literacy Quiz
100% (1)
Academic Literacy Quiz
2 pages
A Midsummer Night
100% (1)
A Midsummer Night
10 pages
The Internet: Pros and Cons
No ratings yet
The Internet: Pros and Cons
4 pages
UNIX System Overview and Structure
75% (4)
UNIX System Overview and Structure
70 pages
PM Debug Info
No ratings yet
PM Debug Info
131 pages
Lasmathhs017 Math 8 Q1 W3 LC1
No ratings yet
Lasmathhs017 Math 8 Q1 W3 LC1
6 pages
Algebra Solutions for Math Students
No ratings yet
Algebra Solutions for Math Students
4 pages

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

Uploaded by

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

Uploaded by

Lecture: Systolic Arrays I

• Topics: sorting and matrix algorithms

• Distribute memory across multiple chips; sufficient on-chip

• How do we design the compute units?

• Systolic arrays: dense compute units; data flows through

• Each processor has bidirectional links to its neighbors

• All processors share a single clock (asynchronous designs

• At each clock, processors receive inputs from neighbors,

• Each processor stores the minimum number it has seen

• Initial value in storage and on network is “*”, which is

• On receiving number Y from left neighbor, the processor

• The output process begins when a processor receives

• Each processor forwards its storage to its left neighbor

• How many steps does it take to sort N numbers?

• What is the speedup and efficiency?

• The bit model affords a more precise measure of

• To compare N k-bit words, you may now need an N x k

• Strategy 1: Bits travel horizontally, keep/swap signals

• Strategy 2: Use a tree to communicate information on

• Input/Output bandwidth: Nk bits are being input/output

• Diameter: the comparison at processor (1,1) influences

• Bisection width: if processors in one half require the

• N 1-bit numbers that need to be sorted with a binary tree

• Since bisection bandwidth is 2 and each number may be

• It takes O(logN) time for each intermediate node to add

• After the root has computed the number of 1’s, this

• Each half only needs to know the number of 1’s in the

• Careful when estimating lower bounds!

• Consider matrix-vector multiplication:

• The sequential algorithm takes 2N2 – N operations

• With an N-cell linear array, can we implement

Number of time steps = 3N – 2

• The algorithm implementations on the linear arrays have

• It is possible to improve these algorithms by a constant

• “Introduction to Parallel Algorithms and Architectures,” Leighton

You might also like