0% found this document useful (0 votes)
27 views21 pages

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

The lecture covers systolic arrays and their application in sorting and matrix algorithms, emphasizing dense computation and data flow through compute units. It details sorting on a linear array, control mechanisms at each processor, and various comparison strategies, including lower bounds for performance. Additionally, it discusses matrix-vector and matrix-matrix multiplication, highlighting the efficiency and complexity of these algorithms in parallel processing environments.

Uploaded by

aayushkumarbce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views21 pages

Lecture: Systolic Arrays I: Topics: Sorting and Matrix Algorithms

The lecture covers systolic arrays and their application in sorting and matrix algorithms, emphasizing dense computation and data flow through compute units. It details sorting on a linear array, control mechanisms at each processor, and various comparison strategies, including lower bounds for performance. Additionally, it discusses matrix-vector and matrix-matrix multiplication, highlighting the efficiency and complexity of these algorithms in parallel processing environments.

Uploaded by

aayushkumarbce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Lecture: Systolic Arrays I

• Topics: sorting and matrix algorithms

1
Dense Computation

• Distribute memory across multiple chips; sufficient on-chip


wiring to feed computational units

• How do we design the compute units?


• GPU (too general-purpose)
• DaDianNao’s NFU (custom SIMD)
• Eyeriss’ spatial architecture (basic tile, operand network)
• ISAAC (analog)

• Systolic arrays: dense compute units; data flows through


these units with low rd/wr costs; loose connection with the
brain; effective for image processing, pattern recog, etc.
2
Sorting on a Linear Array

• Each processor has bidirectional links to its neighbors

• All processors share a single clock (asynchronous designs


will require minor modifications)

• At each clock, processors receive inputs from neighbors,


perform computations, generate output for neighbors, and
update local storage

input

output

3
Control at Each Processor

• Each processor stores the minimum number it has seen

• Initial value in storage and on network is “*”, which is


bigger than any input and also means “no signal”

• On receiving number Y from left neighbor, the processor


keeps the smaller of Y and current storage Z, and passes
the larger to the right neighbor

4
Sorting Example

5
Result Output

• The output process begins when a processor receives


a non-*, followed by a “*”

• Each processor forwards its storage to its left neighbor


and subsequent data it receives from right neighbors

• How many steps does it take to sort N numbers?

• What is the speedup and efficiency?

6
Output Example

7
Bit Model

• The bit model affords a more precise measure of


complexity – we will now assume that each processor
can only operate on a bit at a time

• To compare N k-bit words, you may now need an N x k


2-d array of bit processors

8
Pipelined Comparison

Input numbers: 3 4 2
0 1 0
1 0 1
1 0 0

9
Comparison Strategies

• Strategy 1: Bits travel horizontally, keep/swap signals


travel vertically; if inputs arrive from the left, the array is
sorted in 2N + k steps

• Strategy 2: Use a tree to communicate information on


which number is greater – can set up a pipeline so the
sorting happens in 2N + logk steps

10
Lower Bounds

• Input/Output bandwidth: Nk bits are being input/output


with k pins – requires W(N) time

• Diameter: the comparison at processor (1,1) influences


the value of the bit stored at processor (N,k) – for
example, N-1 numbers are 011..1 and the last number is
either 00…0 or 10…0 – it takes at least N+k-2 steps for
information to travel across the diameter

• Bisection width: if processors in one half require the


results computed by the other half, the bisection bandwidth
imposes a minimum completion time
11
Counter Example

• N 1-bit numbers that need to be sorted with a binary tree

• Since bisection bandwidth is 2 and each number may be


in the wrong half, will any algorithm take at least N/2 steps?

12
Counting Algorithm

• It takes O(logN) time for each intermediate node to add


the contents in the subtree and forward the result to the
parent, one bit at a time

• After the root has computed the number of 1’s, this


number is communicated to the leaves – the leaves
accordingly set their output to 0 or 1

• Each half only needs to know the number of 1’s in the


other half (logN-1 bits) – therefore, the algorithm takes
W(logN) time

• Careful when estimating lower bounds!


13
Matrix Algorithms

• Consider matrix-vector multiplication:

yi = Sj aijxj

• The sequential algorithm takes 2N2 – N operations

• With an N-cell linear array, can we implement


matrix-vector multiplication in O(N) time?

14
Matrix Vector Multiplication

Number of steps = 2N – 1 15
Matrix-Matrix Multiplication

Number of time steps = 3N – 2


16
Complexity

• The algorithm implementations on the linear arrays have


speedups that are linear in the number of processors – an
efficiency of O(1)

• It is possible to improve these algorithms by a constant


factor, for example, by inputting values directly to each
processor in the first step and providing wraparound edges
(N time steps)

17
Dataflow for Convolution

For a 3x3 kernel with strides of 1, every input pixel is involved in 9 ops

Pixels 7 4 1

Pixels 8 5 2

Pixels 9 6 3

1
2 2
3 3 3
4 4 4 This will produce partial results
5 5 that have to be consumed later.
6
18
Comparison with Eyeriss Convolution

19
References

• “Introduction to Parallel Algorithms and Architectures,” Leighton


• Figure credits: Mitsu Ogihara

20
Title

• Bullet

21

You might also like