FOLDING
The objective of folding is to reduce hardware by sharing. The idea is to
fold the same operation in an algorithm to a functional unit. Of course,
the performance will suffer. Therefore, folding is a method to trade speed
for space.
e.g. y(n) = a(n) + b(n) + c(n)
Folded design by sharing the two addition
operations with one addition unit
Direct Implementation Folding factor, N=2
Note: Nl+n is an instance at which the switch is connected
A delay/storage element is automatically required l – lth iteration
N – folding order
1
Folding Transform Basic Idea
!
Given D(U→V) = w(e) and Hu is pipelined by Pu
After folding, at l-th iteration, node U and V are scheduled to
Nl+u and Nl+v respectively
Folding equation is:
!
DF(U→V) = [N(l+w(e))+v] – [Nl+Pu+u]
!
DF(U→V) = Nw(e) – Pu + v – u
Folding set is an ordered set of operations executed by the same functional unit.
Each folding set contains N entries, some of which may be null operations.
Each functional unit has its own folding set but the set may have more than one
possibility.
2
BIQUAD Filter as Example
Note: the folding sets
are typically obtained
from a scheduling and
allocation algorithm
Addition operations and multiplication operations are folded to
one adder unit and one multiplier unit, N=4. The folding order is
also shown given by folding sets:
adder multiplier
S1 = {4,2,3,1} S2 = {5, 8, 6, 7}
Addition takes 1u.t. Multiplication takes 2u.t.
i.e. PA = 1 PM = 2
3
Folding Equation for Each Edge
Folded Biquad Filter:
!
Note: DF(U→V) ≥ 0 must hold for all of the edges
This solution is in fact obtained after re-timing 4
Original Biquad Filter DFG
5
Negative Delay Is Not Valid
Nwr(e) – Pu + v – u ≥ 0 wr(e) = w(e) + r(V) – r(U)
Combined
N(w(e) + r(V) – r(U)) – Pu + v – u ≥ 0
Substituting DF(U→V) for Nw(e)-Pu+v-u and
solving for r(U)-r(V)
"
!!(#→%)
r(U) – r(V) ≤
'
Since retiming values of the nodes
are restricted to be integers,
"
!!(#→%)
r(U) – r(V) ≤
'
6
Constraint Graph for the Set of Inequalities
A solution is r(1)=1, r(2)=0, r(3)=-1, r(4)=0,
r(5)=-1, r(6)=-1, r(7)=-2 and r(8)=-1
Note: the same solution can be found by cutset retiming
7
Further Register Minimization – Lifetime Analysis
Lifetime analysis is a procedure used to compute the
minimum number of registers required to implement a
DSP algorithm in hardware. In lifetime analysis, the
number of ‘live’ variables at each time unit is computed,
and the maximum number of live variables at any time
unit is determined. This is the minimum number of
registers required to implement the DSP.
‘live’ variable – a variable is live from the time it is produced
through the time it is consumed.
8
Lifetime Chart
A DSP has 3 variables, a, b and c, and has a periodicity of 6 cycles
A variable is not live during the clock cycle in which it is produced,
and the variable is live during the clock cycle in which it is
consumed
9
Minimum Number of Registers
Lifetime chart with 3 consecutive iterations
The lifetime of a variable from one iteration may overlap
with the lifetime of variables from other iteration. When
the periodic nature is taken into account, the minimum
number of register is 3.
This can also be found by drawing the lifetime of variables
for the 0-th iteration and letting the number of live
variables at the time partitions n≥N be the sum of the
number of live variables due to the 0-th iteration at cycles
n-kN for all integers k≥0.
10
Data Allocation Using Forward-Backward
Register Minimization
1. Determine the minimum number of registers using lifetime analysis.
2. Input each variable at the time step corresponding to the beginning of
its lifetime. If multiple variables are input in a given cycle, these are
allocated to the initial register and the other variables are allocated to
consecutive registers in decreasing order of lifetime.
3. Each variable is allocated in a forward manner until it is dead or it
reaches the last register. In forward allocation, if the register i holds the
variable in the current cycle, then the register i+1 holds the same
variable in the next cycle. If the register i+1 is not available, then the
variable is allocated to the first available forward register.
4. Since the allocation is periodic, the allocation of the current iteration
also repeats itself in subsequent iterations. Thus, if Rj is occupied with
a variable in cycle l, then Rj would occupy the corresponding variable in
cycle l+N, where N denotes the periodicity of the allocation. Therefore,
we “hash” the position for Rj at the time unit l+N for each j and l.
11
continuing
5. For variables that reach the last register and are not yet dead,
the remaining life period is calculated, and these variables are
allocated to a register in a backward manner on a first-come
first served basis. If multiple registers are available for
backward allocation first try to choose a register such that
backward allocation has already been performed from the last
register to this register. In the case where more than one
register qualifies for backward allocation, choose the register
with the minimum number of forward registers among all
candidate registers that have a sufficient number of forward
registers to complete the allocation of the variable. After a
variable has been allocated backward, allocate it forward until it
is dead or it again reaches the last register.
6. Repeat steps 4 and 5 as required until the allocation is
complete.
12
Example
Transpose operation of a 3x3 matrix
𝑎 𝑏 𝑐
𝑑 𝑒 𝑓
𝑔 ℎ 𝑖
period is 9.
13
Allocation Table
14
Full Procedure to Minimize Register
in Folded Architecture
1. Perform retiming for folding
2. Write the folding equations
3. Use the folding equations to construct a lifetime table
4. Draw the lifetime chart and determine the required
number of registers
5. Perform forward-backward register allocation
6. Draw the folded architecture that uses the minimum
number of registers
15
Biquad Filter Example
Original DFG: Retimed DFG:
Folding equations: Folded architecture without register minimization:
16
Minimize Registers
Lifetime table: Allocation:
Lifetime chart indicting only 2 registers
Implementation:
are required:
17
18