RCC 15
RCC 15
• Scheduling
• ASAP
• ALAP
• List scheduling
• Force Directed Scheduling
2
BITS Pilani, Pilani Campus
Today’s Lecture
• Online Placement
• Routing
3
BITS Pilani, Pilani BITS
Campus
Online Placement
4
BITS Pilani, Pilani Campus
Partial Reconfigurable System
System Model: Operating system incl.
• Scheduler: Assigns resources to incoming tasks.
• Module Database: Contains the physical implementation
of the tasks.
• Placer: Assigns space on the reconfigurable device to run
tasks sent by the scheduler.
7
BITS Pilani, Pilani Campus
Kamer Algorithm
In order to avoid the quadratic space requirement and increased running time, we can keep only linear number of
empty rectangles in terms of the number of modules. The strategy consist of keeping only the non-overlapping
empty rectangles, thus reducing the number of empty rectangles to be managed. This allows a linear run-time at
the cost of the quality. The problem with this approach is that several non-overlapping rectangles representation
of a configuration may exists leading to multiple possible choices in the representation .
• whenever a new component v1 is placed in a nonoverlapping rectangle, two possibilities exist to split
that rectangle. A horizontal split that is done using segment Sa and a vertical split using the segment
Sb. Choosing to select one of the two split directions may have a negative impact on the placement of
the next components. Assuming for example that the algorithm keeps selecting the horizontal split,
after the placement of module v1, the free rectangles left are (A,D) and (C,E). Assume now that a new
module, whose width is the same as (C,E) but with a height slightly bigger than that of (C,E) is chosen.
The algorithm will not be able to place the component, although enough free space is available on the
chip to accommodate the new module.
15
BITS Pilani, Pilani Campus
Keeping Non-Overlapping ERs
o KAMER always finds a rectangle to place a new component, if one exists.
o But position to place the component within the rectangle must be selected from a set of points
• whose number is the area of the rectangle in worst case
In [Bazargan]:
o Bottom-left position
• Note: No communication is assumed between the rectangle to be placed and those
already placed.
If communication exists:
o If a position in the middle of the selected rectangle is better adapted for optimizing the goal
seeked, then the component should be placed there, no mater if the number of empty rectangle
increases.
o An optimal algorithm should consider any single point where the placement is possible and
develop a strategy to choose the best position.
o Instead of managing the free space, the strategy consists of managing the occupied space on the
device and use the set of running components for computing the best position to place the
incoming component.
17
BITS Pilani, Pilani Campus
Routing Challenges
How to handle
communication between
increasing number of
components on chip?
Switches: switches or routers are used for routing data coming from one PE to another PE based
on the destination information in the data
Links: Used to interconnect switches and PEs. Follow some particular protocol when they
transmit data between these switches.
flit packet
35
BITS Pilani, Pilani Campus
Campus
• Packets delivered by routers are partitioned in a flit-by-flit basis.
• Each flit of a packet arrives at a router and stored in a memory buffer
until it can traverse to the next hop of the route.
• The first flit in the buffer memory will be processed by the control
logic to determine whether it is allowed to be forwarded and which
output direction it should proceed to.
• The decision made by the control unit is based on the computation
result of routing, arbitration, and the downstream buffer space.
• After the control setup is done, the flit passes through the crossbar
switch to its desired output direction.
• Since all the decisions are settled by the control logic at the input side,
flits will never be stalled at the output ports.
When 2 packets trying to use the same link at the same time is called as
contention
33
BITS Pilani, Pilani Campus
Routing Algorithms
34
BITS Pilani, Pilani Campus
Dimension Ordered Routing
• Always generate the same single routing path for given pair of source
and destination addresses, typically a shortest one.
• When source routing is used, the source node implements pure
routing function returning a unique path without considering any
information about the traffic.
• For distributed routing, routers make unique decisions in every
intermediate node.
• Deterministic distributed routing is used in most commercially
available parallel machines.
• It is simple, fast, and performs well under uniform traffic assumption.
• Used in mesh-based machines,
• It is known to be deadlock-free in meshes and is easy to implement in
HW.
• Nonuniform traffic requires some degree of adaptivity.
35
BITS Pilani, Pilani Campus
Oblivious Routing
37
BITS Pilani, Pilani Campus
38
BITS Pilani, Pilani Campus
Minimal and Non-minimal Routing
39
BITS Pilani, Pilani Campus
Deadlock Avoidance
40
BITS Pilani, Pilani Campus
Flow Control
– Circuit Switching
– Store-and-Forward
– Virtual Cut-Through
– Wormhole Routing
1. The packet is copied entirely into the network router before moving to the next node.
2. the routing information is examined to determine which output channel to direct the packet.
3. the packet is sent to the neighbor.
4. Flow control unit is the entire packet.
Latency:
Nr * tr
– Nr: number of routers through which the packet must travel
– tr: time to transfer the packet between the routers
A B
– Leads to high per-packet latency
– Requires buffering for entire packet in each node.
• The flit can depart the current node as soon as there is sufficient buffering for this flit.
• While wormhole flowcontrol uses buffers effectively, it makes inefficient use of link bandwidth.
• Throughput suffers because other packets queued behind the blocked packet are unable to use the idle physical
links.
• Wormhole flow control reduces packet latency by allowing a flit to leave the router as soon as a downstream
buffer is available (in the absence of contention, the latency is the same as virtual cut through).
• Additionally, wormhole flow control requires much smaller buffers than packet-based techniques.
• Due to the tight area and power constraints of on-chip networks, wormhole flow control is the predominant
technique adopted thus far.
51
BITS Pilani, Pilani Campus
Q1—scheduling algorithm
• For the edge weighted sequence
graph shown in fig, weights on
edges indicate execution delay of
nodes in terms of clock cycles.
Schedule operations as per
– ASAP
– ALAP (latency constraint of 6 cycles)
– Find mobility
– List schedule under resource
constraint of 2=a*, 2=a+/-, 1=a< with
ALAP schedule times as priority for
nodes.
– List schedule under latency
constraint of 6
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
ASAP AND ALAP
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
mobili
ty
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
List schedule-resource constraint
t = 0,
– U0,* = {a,b,c,d}, U0,+/- = {e}, U0,< = ∅
– T0,* = ∅, T0,+/- = ∅, T0,< = ∅
– For +/-, easy to select Y = {e}
– For *, we have a choice. ALAP times for a,b,c,d
t=1
are 0,0,1,3, respectively So most urgent are Y = {a,b}
– U1,* = {c,d}, U1,+/- = ∅, U1,< = {i}
– For <, there is nothing to schedule Y = ∅
– T1,* = {a,b}, T1,+/- = ∅, T1,< = ∅ – S(a) = 0, S(b) = 0, S(e) = 0
– For +/-, Y = ∅
– For *, Y = ∅ (all resources busy) t=3 t=4
– For <, Y = {i} -- U3, * = {d}, U3,+/- = ∅, U3,<= ∅ -- U4, * = {d,g}, U4,+/- = (j}, U4,<= ∅
– S(i) = 1 -- T3, * ={c,f}, T3,+/-= ∅, T3,<= ∅ -- T4, * ={∅}, T4,+/-= ∅, T3,<= ∅
t=2 For +/-, Y = ∅ For +/-, Y = {j}
– U2,* = {c,d,f}, U2,+/- = ∅, U2,< = ∅ For *, Y= ∅ (all resources busy) For *, Y= {d,g}
– T2,* = ∅, T2,+/- = ∅, T2,< = ∅ For <. Y= ∅ For <. Y= ∅
– For +/-, Y = ∅ S(0) = 3 S(d) = 4. S(g) =4, S(j)=4
– For *, ALAP times for c,d,f are 1,3,2
Similarly,
respectively.Y = {c,f}
T=5, S(0)
– For <, Y = ∅
T=6, S(k)
– S(c) = 2, S(f) = 2
T=7, sink node
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Latency = 7
we take once cycle longer than ASAP (but can
use half the number of multipliers)
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
List schedule-latency constraint of 6 cycles
B
I
T
S
P
i
l
a Slack values from ALAP schedule
n
i
Assume one resource of each type to begin
, with.
P
i t=0
t=1
l – U0,* = {a,b,c,d}, U0,+/- = {e}, U0,< = ∅
a – U1,* = {c,d}, U1,+/- = ∅, U1,< = {i}
n – T0,* = ∅, T0,+/- = ∅, T0,< = ∅
– T1,* = {a,b}, T1,+/- = ∅, T1,< = ∅
i – sa = 0, sb = 0, sc = 1, sd = 3, se = 4
C – sc = 0, sd = 2, si = 3
a – For *, k1 = {a,b};
m – For *, k1 = {c};
– a* = 2;
p – a* = 3;
u – For +/-, k2 = {e};
s – For +/-, k2 = ∅; for <, k3 = {i}
For <, k3= ∅
– S(c) = 1, S(i) = 1
– S(a) = 0, S(b) = 0, S(e) = 0
t=2
Contd…………
– U2,* = {f,d}, U2,+/- = ∅, U2,< = ∅
..
– T2,* = {c}, T2,+/- = ∅, T2,< = ∅
– sf = 0, sd = 1
– For *, k1 = {f} schedule has the same
– all resource constraints unchanged latency as ASAP, but
requires 3 rather than 4
– For +/-, k2 = ∅; for <, k3 = ∅ multipliers and 2 ALUs and
– S(f) = 2, one comparator
Q2
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
sol
Compute cofactors of f with respect to a
(first variable in ordering)
f = abc + ab’c + a’bc’ + a’b’c’
fa’ = bc’ + b’c’
fa = bc + b’c
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
f(a,b,c,d) = (a+c+d).(ab+bc) .(bc+ac)
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Variable ordering matters
f(x1,x2,x3,x4,x5,x6) = (x1+x2).(x3+x4).(x5+x6)
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Q3
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Sol 3
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
• Using the 1Kx32 memories, each
data word is stored in one of the 32
blocks. 32:1 decoder 32x1 MUX
• For reads, part of the read address
lines (0 to 9) chooses the one word
out of 1k words inside the memory
and the rest of the read address
lines (10 to 14) chooses between
the data from the blocks.
• For writes the data-In goes to a
block whose write-enable is
asserted, determined by a decoder.
• So, the 1Kx32 implementation
requires an extra 32:1 decoder and
a 32-bit wide 32:1 mux; this is a lot
of extra hardware and performance
is also worse because of the extra
delay through the decoder on
writes and mux on reads.
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Q5
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Sol
5
BITS Pilani,
Pilani, Pilani
Pilani Campus
Campus
Chortle -crf
Apply Chortle-crf LUT-mapping algorithm for the expression below with K=4. Use two-level
decomposition followed by multi-level decomposition using bin packing algorithm to map the
design on the available LUTs.