0% found this document useful (0 votes)
81 views73 pages

Unit 5 Digi

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course structure for 'Low Power VLSI Design' for the ECE department. It includes course objectives, prerequisites, a comprehensive syllabus, and outlines various learning outcomes and assessment schedules. Additionally, it provides a lecture plan and activity-based learning components to enhance student understanding of low power design techniques in integrated circuits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views73 pages

Unit 5 Digi

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course structure for 'Low Power VLSI Design' for the ECE department. It includes course objectives, prerequisites, a comprehensive syllabus, and outlines various learning outcomes and assessment schedules. Additionally, it provides a lecture plan and activity-based learning components to enhance student understanding of low power design techniques in integrated circuits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

1

2
Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.

3
Department : ECE

Batch/Year : 2022-2026/ III


Subject Code /Name: 22EC914 LOW
POWER VLSI DESIGN

UNIT 5

Created by,

Dr.D.Rukmani Devi, Professor/ECE, RMDEC

Ms.K.Jeevitha, Associate Professor/ECE, RMKEC

Dr.V.Kannagi, Professor/ECE, RMKCET


CONTENTS

Page
S.No. Contents
No.

1 Course Objectives 6
2 Pre-Requisites 7
3 Syllabus 8
4 Course outcomes 9
5 CO- PO/PSO Mapping 10
6 Lecture Plan 11
7 Activity based learning 12
8 Lecture Notes 13
9 Assignments 95

10 Part A Q & A 96

11 Part B Qs 100

12 Supportive online Certification courses 101

Real time Application in day to day life and


13 102
Industry

14 Assessment Schedule 103

15 Prescribed Text Books & Reference Books 104

16 Mini Project suggestions 105

5
1.Course Objectives
❖ To identify sources of power in an IC.
❖ To identify the power reduction techniques based on technology
independent and technology dependent methods
❖ To identify suitable techniques to reduce the power dissipation
❖ To estimate power dissipation of various MOS logic circuits
❖ To develop algorithms for low power dissipation

6
2. Pre Requisites

VLSI Design
21EC503 Semester -V

Electronic Devices
21EC202 Semester - II

Physics for Electronics


Engineering
Semester - I
21PH102

7
3.SYLLABUS
UNIT I POWER DISSIPATION IN CMOS 9

Hierarchy of Limits of Power – Sources of Power Consumption –


Physics of Power Dissipation in CMOS FET Devices – Basic
Principle of Low Power Design.

UNIT II POWER OPTIMIZATION 9

Logic Level Power Optimization – Circuit Level Low Power


Design – Gate Level Low Power Design –Architecture Level Low
Power Design – VLSI Subsystem Design of Adders, Multipliers,
PLL, Low Power Design

UNIT III DESIGN OF LOW POWER CMOS CIRCUITS 9 Computer


Arithmetic Techniques for Low Power System – Reducing
Power Consumption in Combinational Logic, Sequential Logic,
Memories – Low Power Clock – Advanced Techniques – Special
Techniques, Adiabatic Techniques – Physical Design, Floor
Planning, Placement and Routing.

UNIT IV POWER ESTIMATION 9

Power Estimation Techniques, Circuit Level, Gate Level,


Architecture Level, Behavioral Level, – Logic Power Estimation
– Simulation Power Analysis –Probabilistic Power Analysis

UNIT V SYNTHESIS AND SOFTWARE DESIGN FOR LOWPOWER


CMOS CIRCUITS 9

Synthesis for Low Power – Behavioral Level Transform –


Algorithms for Low Power – Software Design for Low Power.

TOTAL: 45 PERIODS

8
4. Course outcomes

After successful completion of the course, the students will be able to

Highest
Course Outcomes Cognitive
Level
To know the sources of power consumption in
K3
CO1 CMOS circuits

To design and analyze various MOS logic circuits K3


CO2
To apply low power techniques for low power
CO3 K3
dissipation

CO4 To estimate the power dissipation of ICs K3

Able to develop algorithms to reduce power


CO5 K3
dissipation by software

CO6 To learn the design concepts of low power circuits K3

9
5.CO- PO/PSO Mapping

Program
Specific
Program Outcomes (PO)
Course K Outcomes
Outco Leve (PSO)
K3,
mes l of K K K
K3 K5, A3 A2 A3 A3 A3 A3 A2 PS PS PS
4 4 5
(CO) CO K6 O- O- O-
P P P P 1 2 3
PO PO PO PO PO PO PO PO
O- O O O K6 K5 K6
-5 -6 -7 -8 -9 -10 -11 -12
1 -2 -3 -4
CO1 K3 3 2 2 2 2 1 1 1 1 2 1 1 1 1 1
CO2 K3 3 2 2 1 2 1 1 1 1 1 1 1 1 2 1
CO3 K3 3 2 2 2 2 2 1 1 2 1 1 1 1 1 1
CO4 K3 3 2 2 1 2 1 1 1 2 1 1 1 1 1 1
CO5 K3 3 2 2 1 2 1 2 1 1 1 1 1 1 1 1
CO6 K3 3 2 2 2 3 1 1 1 1 1 1 1 1 2 1

10
6. Lecture Plan

UNIT V SYNTHESIS AND SOFTWARE DESIGN FOR LOWPOWER CMOS


CIRCUITS 9
Synthesis for Low Power – Behavioral Level Transform –Algorithms for Low
Power – Software Design for Low Power.

Mode
Actual Taxon
Topic Propos of
Topics to be covered Lectur CO omy
No.9. ed Date Delive
e date Level
ry
Synthesis For Low Power - CO5
1. K3 PPT
Behavioral Level Transforms
CO5
Algorithm Level Transforms for K3 PPT
2. Low Power Circuit

Power-Constrained Least- CO5


K3 PPT
3. Squares Optimization for
Adaptive and Nonadaptive Filters
Circuit Activity Driven CO5
4. Architectural K3 PPT
Transformations
Architecture-Driven Voltage CO5
5. K3 PPT
Scaling
Power Optimization Using CO5
6. K3 PPT
Operation Reduction
Power Optimization Using CO5
7. K3 PPT
Operation Substitution,
Precomputation-Based
Optimization for Low Power
Software Design for Low CO5
8. & K3 PPT
Power - Software Power
Estimation CO6
CO5
9. Software Power Optimizations & K3 PPT
CO6

11
7. Activity Based Learning

CASE STUDY

Listen to these videos carefully. Write 500 words about what you heard from these
Videos.
https://www.xilinx.com/video/application/vivado-design-suite-ultrascale-
architecture.html
https://www.xilinx.com/video/soc/embedded-vision-and-control-solutions-with-the-
zynq-mpsoc.html

12
8.Lecture notes

13
UNIT V SYNTHESIS AND SOFTWARE DESIGN FOR LOWPOWER
CMOS CIRCUITS
SYNTHESIS FOR LOW POWER

In order to meet functionality, performance, cost-efficiency, and other


requirements, automatic synthesis tools for IC design have become indispensable.
The designs are described at the Register Transfer or higher level. A technology-
independent logic/architectural level realization is generated next. The logic level
realization is then mapped to a specific technology library. Finally the technology-
mapped circuits is optimized to ensure that all requirements have been met. The
process has multiple steps as the problem of realizing the specified functionality
with gates in a given library to meet all the requirements is too complex to be
solved in one step.

Tools have been developed to carry out each of these steps automatically. In the
beginning, the synthesis tools attempted to reduce area alone and did not
consider delay. Next came improved algorithms that, in addition to reducing the
area, ensured that the maximum delay through the circuit did not increase or at
least remained within the specified maximum bound. With the exponential growth
in the number of gates that can be accommodated on a chip continuing unabated,
the area has become less of a concern and is increasingly being substituted for by
performance and power dissipation. As we will see later, a large fraction of the
total research effort to reduce power dissipation has been justifiably devoted to
lowering the supply voltage in conjunction with techniques to compensate for
accompanying loss of performance. However, independent of and in addition to
the supply voltage scaling, it is possible to make logic and circuit level
transformations to achieve considerable improvement in power dissipation. These
transformations try to minimize dynamic power dissipation based on the switching
activity of the signals at circuit nodes..

14
Switching activity at circuit nodes is related to input signal probabilities and
activities. The input signal probabilities and activities can be obtained by system
level simulation of the function with real-life inputs. Hence, circuits having the
same functionality but operating in different environments requiring different
input signal probabilities and activities are synthesized differently for improving
power dissipation. In this unit we focus on reduction of dynamic power
dissipation.

The recent trend has been to consider power dissipation at all phases of the
design cycle. The design space can be very efficiently explored starting at any
behavioral and/or algorithmic level. Architectural transforms and tradeoffs can be
conveniently applied at that level to optimize power dissipation. However, efficient
means of accurately estimating power dissipation at the behavioral level are
required so that meaningful transforms can be applied.

5.1 BEHAVIORAL LEVEL TRANSFORMS

Because of the many degrees of freedom available, the behavioral level has the
potential of producing a large improvement in power dissipation. At this level,
since operations have not been assigned, the execution times and hardware
allocation is not yet performed, a design point with given power dissipation, delay,
and area can be found if one exists in the design space at all. Hence, there is a
great need to systematically explore the design space. Traditionally, behavioral
synthesis has targeted optimization of the amount of hardware resources required
and optimization of the average number of clock cycles per task required to
perform a given set of tasks.

5.1.1 Algorithm Level Transforms for Low Power

It has been observed that large improvements in power dissipation are possible at
higher levels of design abstraction while pure combinational logic synthesis can
only produce a moderate level of improvement. This is mainly due to the fact that
the flexibility of making changes to random logic is rather limited. However,
technology combined with innovative circuit design techniques can also produce a
large improvement in power dissipation.

15
Let us consider two algorithm level techniques for improvement in power
dissipation targeted for digital filters; both techniques try to reduce computation
to achieve low power. The first technique uses differential coefficient
representation to reduce the dynamic range of computation while the other
technique optimizes the number of 1’s in coefficients representation to reduce the
number of additions (or switching activity). Other techniques try to use multiplier-
less implementations for low-power and high-performance. Since digital signal
processing techniques are very well represented mathematically, algorithm level
techniques can be easily applied.

5.1.1.1 Differential Coefficients for Finite Impulse Response Filters

One of the most basic operations performed in DSP applications is the finite
impulse response (FIR) computation. As is well known, the output of a linear
time-invariant (LTI) system can be obtained by convolving in the time domain the
input to the system and the transfer function of the system. For discrete-time LTI-
FIR systems, this can be expressed mathematically as

------ (5.1)

The parameters X and Y are the discrete-time input and output signals, respectively,
represented as sequences. The sequence C represents the transfer function of the
system. For FIR filters, C also corresponds to the set of filter coefficients and the
length of the sequence N is called the number of taps or the length of the filter.

16
The differential coefficients method (DCM) is an algorithm level technique for
realization of low-power FIR filters with a large number of taps (N of the order of
hundreds). The DCM relies on reducing computations to reduce power. The
computation of the convolution, in the canonical form for the FIR filter output as
given by Eq. (5.1), by using the multiply-and accumulate sequence as defined
earlier (computing the product of each coefficient with the appropriate input data
and accumulating the products), will be termed direct-form computation. The
algorithms for the DCM use various orders of differences between the coefficients
(the various orders of differences will be precisely defined later on) in conjunction
with stored precomputed results rather than the coefficients themselves to
compute the canonical form convolution. These algorithms result in less
computations per convolution as compared to direct form computation. However,
they require more storage and storage accesses and hence more energy for
storage operations. Net energy savings using the DCM is dependent on various
parameters, like the order of differences used, energy dissipated in a storage
access, and the word widths used for the digitized input data and coefficients.

The DCM can also lead to a reduction in the time needed for computing each
convolution and thus one may obtain an added advantage of higher speed of
computation. Analogous to the savings in energy, the speed enhancement
obtained is dependent on the order of differences used and various other
parameters.

5.1.1.2 Algorithm Using First-Order Differences

The first-order differences algorithm is based on the following observation. The


notation {Pk}t=j will be used to denote the kth product term of the FIR output at
the jth discrete-time instant. For convenience we will also use the shorter notation
time t = j to refer to the jth discrete-time instant. Expanding Eq. (5.1) for any
three consecutive samples of the output Y at times t = j, j + 1, and j + 2, we get

17
𝑌𝑗 = 𝐶0 𝑋𝑗 + 𝐶1 𝑋𝑗−1 + 𝐶2 𝑋𝑗−2 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+1 ------- (5.2)
𝑌𝑗+1 = 𝐶0 𝑋𝑗+1 + 𝐶1 𝑋𝑗 + 𝐶2 𝑋𝑗−1 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+2 -------(5.3)
𝑌𝑗+2 = 𝐶0 𝑋𝑗+2 + 𝐶1 𝑋𝑗+1 + 𝐶2 𝑋𝑗 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+3 ----- (5.4)
Notice that each input data multiplied by every coefficient exactly once in turn
appears as a product term in the sum for successive outputs. Therefore, excepting
the first, each product term in the sum for Yj+1 can be written as the following
identity:

𝐶𝑘 𝑋𝑗−𝑘+1 = 𝐶𝑘−1 𝑋𝑗−𝑘+1 + (𝐶𝑘 −𝐶𝑘−1 )𝑋𝑗−𝑘+1 ------- (5.5)

Since the 𝐶𝑘−1 𝑋𝑗−𝑘+1 terms in identity (5.5) above have already been computed for
the previous output Yj, one needs to only compute (𝐶𝑘 −𝐶𝑘−1 )𝑋𝑗−𝑘+1 terms and
add them to the already computed 𝐶𝑘−1 𝑋𝑗−𝑘+1 terms. The first product term in the
sum for 𝑌𝑗+1 , which is 𝐶0 𝑋𝑗+1 , has to be computed without recourse to this scheme.
Hence, all the N product terms are now available for generating 𝑌𝑗+1 .

Summarizing the above, one can say that, excepting C0 each and every coefficient
can be expressed as the sum of the preceding coefficient and the difference between
it and the preceding coefficient. Therefore, each coefficient can now be expressed as
the recurrence relation

𝐶𝑘 = 𝐶𝑘−1 − δ1𝑘−1Τ𝑘 for k=1,2,…….N-1 ------ (5.6)

where δ1𝑘−1Τ𝑘 is termed the first-order difference between coefficients Ck and Ck-1 as

shown in figure 5.1

Fig 5.1 Different Orders of Differences


18
Equation (5.6) is used to define the first-order difference δ1𝑘−1Τ𝑘 between two

consecutive coefficients for k in the specified range. Substituting in Eq. (5.1) the

expression for Ck from Eq. (5.6) we get all the product terms 𝑃𝑘 𝑡=𝑗+1

excepting the first, for computing the FIR output Yj+1. Therefore the 𝑃𝑘 𝑡=𝑗+1
can be written as

𝑃𝑘 𝑡=𝑗+1 = 𝐶𝑘 𝑋𝑗−𝑘+1 = 𝐶𝑘−1 𝑋𝑗−𝑘+1 + δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1

for k=1,2,…….N-1 ------ (5.7)

The left-hand side of the above equation is called a product term in the convolution
given by Eq. (5.1), whereas the terminology partial product will be used for the term

δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 in the above equation. Both the partial product and the term

𝐶𝑘−1 𝑋𝑗−𝑘+1 in the above equation will be called intermediate results used in the

computation of the product term 𝐶𝑘 𝑋𝑗−𝑘+1 .

The range of the subscripts of the first intermediate result in Eq. (5.7) can be
changed to give an equivalent range as follows:

𝐶𝑘−1 𝑋𝑗−𝑘+1 , for k = 1,..., N — 1 = 𝐶𝑘−1 𝑋𝑗−𝑘 for 0, ……, N-2 -------- (5.8)

They are therefore identical to the first N — 1 product terms of the FIR output at the
immediately preceding instant of time, that is, at time t = j , which are

𝑃𝑘 𝑡=𝑗 + 𝐶𝑘 𝑋𝑗−𝑘 for k=0,…, N-2 ------ (5.9)

Therefore, if they are stored for reuse, one can compute all the product terms

𝑃𝑘 𝑡=𝑗+1 , excepting the first, for the FIR output at time t=j+1, by only computing

the partial products δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 in Eq. (5.7) above and adding them to the

appropriate stored product terms 𝑃𝑘 𝑡=𝑗 computed for the FIR output at time t = j.

19
Thus we see that only one intermediate result storage variable and one extra
addition per product term are needed. Furthermore, since the first coefficient C0

along with the N-1 first-order differences δ1𝑘−1Τ𝑘 are used for computing the FIR
output, instead of storing all the coefficients, one needs to store only C0 and the

δ1𝑘−1Τ𝑘 ’s . The above algorithm is called the first-order differences algorithm for
generating the FIR filter output. The additional storage accesses and additions
incurred using this algorithm will be termed overheads.

If the differences δ1𝑘−1Τ𝑘 are small compared to the coefficients Ck , then in the
multiplication for computing a product term 𝑃𝑘 𝑡=𝑗+1 using this algorithm, one is
trading a long multiplier (with Xj-k as the multiplicand) for a short one and
overheads. The advantage of using this algorithm lies in the fact that if the
computational cost (in terms of energy or delay) of the former is greater than the
net computational cost of the latter then we make a net savings in computation
as compared to using the direct form.

5.1.1.3 Algorithm Using Second-Order Differences

Higher orders of differences can also be used for expressing the coefficients as a

function of differences. Let us define the second-order difference δ2𝑘−2Τ𝑘 (Figure


5.1) between two consecutive first-order differences using the relation

δ2𝑘−2Τ𝑘 = δ1𝑘−1Τ𝑘 - δ1𝑘−2Τ𝑘−1 for k = 2,…, N-1 ---- (5.10)

Using first- and second-order differences only, we can express the coefficients as
follows:

20
Hence, except C0 and Cy , all other coefficients can be expressed compactly as follows:

------ (5.11)

which follows from substituting in Eq. (5.6) the expression for δ1𝑘−1Τ𝑘 from Eq. (5.10).

Multiplying both sides of Eq. (5.11) above by 𝑋𝑗−𝑘+1 , we obtain the product terms

𝑃𝑘 𝑡=𝑗+1 for computing Yj+1 as

Using terminology analogous to that used for the first-order difference-------(5.12)

Using algorithm, let us call the last two terms on the right-hand side of the above
equation as partial products and all three terms on the right-hand side as intermediate

results in the computation of the product term 𝑃𝑘 𝑡=𝑗+1 . The computation of the FIR
output Yj+1 by using the relationship in Eq. (5.12) will be termed the second-order
differences algorithm.

One can take advantage of the above recurrence relation to compute any product term
in the convolution incrementally using the minimally required storage for intermediate
results as follows. We also show below that one needs just two extra storage variables
and two extra additions per product term for computing the FIR output using the
second-order differences algorithm.

Let D[k] and P[k] be the two storage variables used for intermediate results for
computing the k t h product term of the FIR output at time t = j . Since there are N
product terms in the output, we will use two array variables for storage, both of array
size N (D[k] and P[k] , with k — 0 , . . . ,N — 1). Here, D[k] and P[k] will be used for

storing the partial product δ1𝑘−2Τ𝑘−1 𝑋𝑗−𝑘+1 and the intermediate result

𝐶𝑘−1 𝑋𝑗−𝑘+1 , respectively, as computed using Eq. (5.12), both of which will be

intermediate results for the FIR output at the next time step, that is, at time t = j + 2.

21
Of course, in addition to D[k] and P[k] , one has to store C0, δ00Τ1 and the N- 2

second-order differences δ2𝑘−2Τ𝑘 k = 2,…..,N - 1, instead of the N coefficients,

which would have had to be stored for direct-form computation.

Let us begin at time t=j, with the variables D[0] and P[0] both initialized to zero.

The first product term 𝑃1 𝑡=𝑗 = C0 Xj of the FIR output at time t = j has to be
computed directly and is stored in P[0] with the contents of D[0] remaining
unchanged at zero:

At the next time step, at time t=j+1, D[0] and P[0] would be used for computing

the second product term 𝑃2 𝑡=𝑗+1 of the FIR output at time t=j+1, which is C1Xj,
as follows:

Thus at the end of this computation £>[0] contains δ10Τ1 Xj and P[0] contains C1Xj .
At the following time step, at time t=j+2, D[0] and P[0] would be used for

computing the third product term 𝑃3 𝑡=𝑗+2 of the FIR output at time t=j+2, which
is C2Xj , as follows:

Thus at the end of this computation D[0] contains δ11Τ2 Xj and P[0] contains C2Xj .
This process would be continued and D[0] and P[0] would accumulate results for N
time steps, after which they would be reset to zero, and the process would start all
over again.

22
The remaining N — 1 pairs of D and P variables go through the same process, except
that the initialization times of all the N pairs are unique. At any instant of time one and
only one pair is initialized and at the next instant of time the next sequential (modulo
N) pair is initialized.

Thus we see that using this technique only two additional variables per product term
are required to store the intermediate results. Since we have N product terms, we need
a total of 2N extra storage variables, as compared to direct-form computation. Two
more additions per product term (as compared to the direct form), one each to update
D[k] and P[k] , are also needed.

5.1.1.4 Algorithm Using Generalized Mth-Order Differences

We can now generalize the algorithm to use up to mth-order with the mth order
difference defined as

------ (5.13)

We can obtain the recurrence relationship between the coefficients (Figure 5.1) using
mth-order differences as

------ (5.14)

which follows by substituting in Eq. (5.6), the definition for the first-order difference in
terms of the second-order difference as given by Eq. (5.10), then substituting for the
second-order difference in terms of the third-order difference as defined by Eq. (5.13)
above with m = 3, and so on, recursively, up to the mth-order difference. We can again
compute each product term incrementally as shown before for the second-order
differences algorithm. However, we now need to store m intermediate results for each
product term that will be stored using the same technique as for the second-order
differences algorithm in the m array variables D1[k], D2[k] , D3[k] , . . . , Dm-1[k], P[k] .
Each array will be of size N . Therefore we need a total of mN storage variables in
addition to the storage requirements of the direct form. We also need m additions per
product term to update the intermediate results storage variables. Therefore a total of
mN more additions per convolution are needed as compared to the direct form.

23
5.1.1.5 Negative Differences

The differences between coefficients can be positive or negative. When the


differences are negative, the differential coefficients method can still be used
without any major modifications. As an example consider the first-order

differences method for computing 𝑃𝑘 𝑡=𝑗+1 . We are computing

𝑃𝑘 𝑡=𝑗+1 = 𝑃𝑘−1 𝑡=𝑗 + δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 for each product term in the sum for Yj +1

, where δ1𝑘−1Τ𝑘 may be positive or negative. Irrespective of the sign of δ1𝑘−1Τ𝑘 , we

can get the absolute value of the partial product by computing the product

|δ1𝑘−1Τ𝑘 ||𝑋𝑗−𝑘+1 | 1- We can then add it to or subtract it from the term 𝑃𝑘−1 𝑡=𝑗

depending on the product of the signs of δ1𝑘−1Τ𝑘 𝑎𝑛𝑑 𝑋𝑗−𝑘+1 to obtain

𝑃𝑘 𝑡=𝑗+1. We have no control over the sign of the partial product anyway

(irrespective of the sign of the difference δ1𝑘−1Τ𝑘 since it depends on the sign of

𝑋𝑗−𝑘+1 . This technique can also be used for algorithms using greater order
differences.

5.1.1.6 Sorted Recursive Differences

One limitation of DCM is that it can be applied only to systems where the envelope
generated by the coefficient sequence (and various orders of differences) is a
smoothly varying continuous function; thus it was beneficial largely for low-pass FIR
filters. We present an improved version of DCM, called the sorted recursive
differences (SRD) method, which uses recursive sorting of coefficients and various
orders of differences to maximize the computational reduction. This recursive sorting
is made possible by the transposed direct form of FIR output computation. Thus
there are no restrictions on the coefficient sequence to which it is applicable (or the
sequences of various orders of differences). The effective word length reduction
using the DCM was not the same for each coefficient. Instead of pessimistically
using the worst case reduction as mentioned earlier, one can use a simple statistical
estimate for the effective reduction in the number of 1’s in the coefficients.

24
5.1.1.7 Transposed Direct-Form Computation

In a transposed direct form (TDF) FIR computation, all N product terms involving
a particular data are computed before any product term with the next sequential
data is computed. The product terms are accumulated in N different registers
since they belong to N sequential outputs. The effective throughput remains the
same as direct-form computation. Signal flow graphs for direct-form and TDF
computation are shown in Figures 5.2 and 5.3, respectively.

Figure 5.2 Signal flow graph for direct form realization of an even-length FIR system

Figure 5.3 Signal flow graph for transposed direct form realization of the same FIR
system as in Figure 5.2.

One advantage of using TDF computation lies in the fact that it does not matter in
which order we compute the product terms involving a particular data so long as
we accumulate them in the right registers. Therefore we can sort the coefficients
in non-decreasing (or non-increasing) order before taking first-order differences.
It can be shown that this ordering minimizes the sum of the absolute value of
first-order differences.

25
Sorting can also be applied to differences. Thus we can generate the second-order
differences from the sorted set of first-order differences. This could be recursively
done up to any order of differences. The various permutations of the sets of
different orders of differences needed for a correct restoration could be hardwired in
the control unit. This would enable it to accumulate partial products in appropriate
locations so that the correct output is produced. Thus if the DCM is applied in a TDF
realization with recursive sorting, we can eliminate the restrictions that were earlier
imposed on the coefficients and various orders of differences for the DCM to be
viable.

Whereas sorting guarantees that differences of all orders will be nonnegative, the
coefficients can be positive or negative. When two consecutive coefficients in the
sorted sequence are of opposite signs, the magnitude of the algebraic difference
between them is larger than the magnitude of either one. To decrease the range of
the coefficients (hence smaller differences), one can use absolute values to
compute differences and then manipulate the sign.

5.1.2 Power-Constrained Least-Squares Optimization for Adaptive and Non-


adaptive Filters

The power constrained least-squares (PCLS) technique can be applied to both non-
adaptive digital filters. The basic idea in this technique is to reduce the number of
1’s in the representation of the coefficients such that the number of additions in a
multiplier can be reduced, thus achieving low-power dissipation. However, due to
changes in the coefficients representation, the performance of the filter can change.
Hence, changes in the filter coefficients can be allowed within a range such that the
change in performance is within the tolerance limit. Complexity reduction is one of
the oldest topics in the classical signal processing literature. However, the goal
pursued historically targeted making proposed filter implementation possible for
practical state-of-the-art applications. Another motivation has been to reduce the
cost of implementation.

26
Higher speed translates to low power using a voltage scaling approach. Further,
lower complexity in terms of number of operations directly improves power. A
classical measure of complexity has been to compute the total number of add and
multiply operations. Further, power consumption is related to such a measure,
however, only indirectly. A direct estimate considers number of switching events
and a more complex algorithm may consume lower power if it causes less
switching activity. Hence, a low-power algorithm design approach is one that
attempts to reduce the overall switching activity.

Recall that PCLS (Power Constrained Least Squares) coefficients attempt to


reduce the switching activity by reducing the number of 1 bits in the coefficients.
Since each 1 bit corresponds to a shift-and-add (SAA) operation, it consumes
energy. Hence, by constraining 1 bits, we can constrain the dissipated power.

5.1.2.1 Constrained Least-Squares Technique for Non-adaptive Filters

In this section, we will present the constrained least-squares (CLS) approach used
to compute the modified coefficient vector k = [k0, k1, . . . , kM_1]T . The original
coefficient vector is represented by c = [c0, c1 , . . . , cM-1]T. We define the code
class of a number, a, as the number of 1 bits in its binary representation. Then,
the maximum code class allowable per coefficient will be represented by k . The
vector k obtained using the minimization technique replaces c when in the actual
implementation.

To formulate a CLS problem, we first need to decompose the coefficient vectors


into constituent bits. For simplicity we will ignore the sign bits of the coefficient
values and concentrate only on the magnitude. Further, we will assume sign-
magnitude representation of numbers. Now, our goal is to find a vector k that is
closest to a given coefficient vector c in a least squares sense, constrained by a
maximum number of allowable add operations. Clearly solution to this problem
will not invert the sign bit in a component of k relative to the corresponding
component in c, as we can always find another solution yielding a smaller or
equal least-squares estimator (LSE) without having the sign bit inverted.

27
5.1.3 Circuit Activity Driven Architectural Transformations

The digital filters, a very important class of DSP circuits, can be represented by
state equations on which architectural transforms can be applied. It is considered
heuristic transforms based on properties of associativity and commutativity of
linear operations on linear time invariant digital circuits.

Figure 5.4 Data flow graph of IIR filter.

Figure 5.4 shows the data flow graph of an infinite impulse response (IIR) filter.
The computation tree for s1(t+1), for example, is described by Eq.𝑠1 𝑡 + 1 =

𝑐1 𝑠1 (𝑡) + 𝑐3 𝑠2 (𝑡) + 𝑘𝑢 𝑡 .

The filter can be implemented using either word-parallel or bit-serial arithmetic. In


the word-parallel case, each signal (arc) of the data flow graph of Figure 5.4

represents W bits of data comprising a data word. The W bits 𝑏𝑤, 𝑏𝑤−1, …… 𝑏1 ,
are fed in parallel to the respective adders and multipliers. The delays are
designed to hold W bits in parallel. At time t + 1, let z out of W bits have different
logic values than at time t . The signal activity is defined as the ratio of z over W
and is given by β(t)=z/W. The variable β(t) is a random variable for different
values of t and represents a stochastic process. The average activity θ(t,t+N) of a
signal over N consecutive time frames is defined as

------ (5.15)

28
In case of bit-serial arithmetic, the bit values of the data word are transmitted
serially over a single data line over consecutive time steps. Thus it is not inter-
word differences in bit values, but intra-word bit differences that cause node
activity. Experiments over large values of N show that the average activity θ(0, N )
remains constant, showing that the stochastic process is strict sense stationary.
Average power dissipation in this case is proportional to

------ (5.16)

where θi, is the average activity at node i (out of 𝛾 such nodes) and Ci is the
capacitive load on the i th node.

The architectural transforms on the DSP filters are based on the following
observations obtained through extensive simulation. Consider a word-parallel
𝑗=𝑙
computation tree with I inputs i1, i2 , . . . , il , and output 𝑦 = σ𝑗=1 𝑎𝑗 𝑏𝑗 , aj being

rational constants. For simplicity, let I = 2l — 1, where I is the number of levels in a


perfectly balanced adder tree, as shown in Figure 5.5.

Figure 5.5 Balanced computation tree for l = 2 .

29
If input values of the tree are mutually independent, then:

The minimum average value of 0, over all nodes of a balanced adder tree with I
inputs is obtained when (a) a1 ≥ a2 ≥ a3 ••• ≥ aI or (b) a1 ≤ a2 ≤ a3 ••• ≤ aI.
For the case of a linear array of adders as shown in Figure 5.6:

Figure 5.6 Linear array of adders.

The minimum θi over all the nodes of a computation tree is achieved when a1 ≤
a2 ≤ a3 ••• ≤ aI . The observations above are for mutually independent inputs.
Due to reconvergent fanout, signals can become correlated at the internal nodes.
However, the transformations seem to apply reasonably well even with such
correlated signals. The synthesis algorithm is based on the above observations
and on simulation. The given circuit is simulated at the functional level with
random, mutually independent circuit input values. The activities at the inputs to
all adders are noted. By applying the above two hypotheses, the adder trees are
restructured and the average activities recomputed. The above analysis is carried
out until there are no further improvements.

30
The procedure forces additions with high activity to be moved closer to the root of
a computation tree and vice versa. Note that no assumptions were made
regarding the implementation details of the adders or the multipliers. Assuming
that the capacitances at the internal nodes are all equal, improvement of up to
23% in power dissipation can be achieved.

5.1.4 Architecture-Driven Voltage Scaling:

A large improvement in power dissipation can be obtained if the supply voltage is


scaled down, as Vdd appears as a square term in the expression for average
power dissipation. However, one immediate side effect is the increase in circuit
delay due to the voltage reduction. The increase in circuit delay can be possibly
compensated for by scaling down the device sizes. However, in the sub
micrometer range the interconnect capacitances do not scale proportionately and
can become dominant. Hence, it is worth looking at architectural transformations
to compensate for the delay to achieve lower power dissipation by scaling down
the supply voltage. One simple way to reducing the supply voltage is to utilize
parallel or pipelined architecture.

31
Let us consider an example a simple data path with an adder and a
comparator is shown in Figure 5.7. Let the power dissipation for the data
path be given by P = CV2f, where C is the total effective capacitance being
switched per clock cycle. Due to voltage scaling by a factor of 40%, assume
that the data path unit works at half the speed. To maintain the throughput
if one uses the parallel configuration shown in Figure 5.8, the effective
capacitance is almost doubled. In fact, the effective capacitance is more
than doubled due to the extra routing required. Assuming that the
capacitance is increased by a factor of 2.15, the power dissipation is given
by Ppar = (2.15C)(0.58V)2(0.5f) = 0.36P. This method of using parallelism
to reduce power has the overhead of more than twice the area and is not
suitable for area constrained designs.

Figure 5.7 Data path operator

Figure 5.8 Parallel Implementation

32
Figure 5.9 is another approach pipelining which has the advantage of smaller area

overhead. With the additional pipeline latch the critical path becomes

Max [Tadder , Tcomparator], allowing the adder and the comparator to operate at a

slower speed. If one assumes the two delays to be equal, the supply voltage can

again be reduced from 5 to 2.9 V, the voltage at which the delay doubles, with no

loss in throughput. Due to the addition of the extra latch, if the effective capacitance

increases by a factor of 1.15, then Ppipe = (1.15C)(0.58V)2(f) = 039P .As an added

bonus in pipelining, increasing the levels of pipelining has the effect of reducing

logic depth and hence power contributed due to hazards. An obvious extension is to

use a combination of pipelining and parallelism to obtain area and power constrained

design.

Figure 5.9 Pipelined implementation

5.1.5 Power Optimization Using Operation Reduction:

The easiest way to reduce the weighted capacitance being switched is to reduce the
number of operations performed in the data flow graph. Reducing the operations
count reduces the total capacitance associated with the system and, hence, can
reduce the power dissipation. However, reduction in the number of operations can
have adverse effect on critical paths. Let us consider the implementation of
function X2 + AX + B. A straightforward implementation is shown in Figure 5.10a.
However, the implementation of Figure 5.10b has one less multiplier.

33
The critical path lengths for both implementations are the same. Hence, the second

design is preferable from both power and area points of view. However, operations

reduction sometimes increases the critical path length, as shown in Figure 5.11. The

figure computes the polynomial given by X3 + AX2 + BX + C. Figure 5.11a an

implementation with four multipliers and three adders while the implementation of

Figure 5.11b has two multipliers and three adders. The latter implementation is

suitable for area and power; however, the critical path is longer than the former one.

Figure 5.10 Reducing operations maintaining throughput

Figure 5.11 Reducing operations with Less throughput

34
5.1.6 Power Optimization Using Operation Substitution

It is well known that certain operations require more computational energy than
others. In DSP circuits, multiplication and addition are the two most important
operations performed. Multiplication consumes more energy per computation than
addition. Hence, during high-level synthesis, if multiplication can be replaced by
addition, one cannot only save area but also achieve improvement in power
dissipation. However, such transformations are usually associated with an increase
in the critical path of the design.

Let us consider the example shown in Figure 5.12. Using the concept of
distributivity and common subexpression utilization, the circuit of Figure 5.12a can
be transformed into the circuit of Figure 5.12b. The critical path length of the
transformed circuit is longer than the critical path of circuit (a); however, it is
possible to substitute a multiplication operation by an addition operation, thereby
reducing the energy of computation. Other useful transformations can be noted.
The multiplication with constants is widely used in discrete cosine transforms
(DCT), filters, and so on. If the multiplication is replaced by shift-and-add
operations, then considerable savings in power can be achieved. Experiments on an
11-tap FIR filter shows that the power consumed by the execution unit is lowered
by one-eighth of the original power and the power consumed by the registers is
also reduced. A small penalty is paid in the controller power. An overall
improvement of 62% in power dissipation was seen for the FIR filter.

Figure 5.12 Substituting addition for multiplication

35
5.1.7 Precomputation-Based Optimization for Low Power:

Precomputation logic optimization, is a method to trade area for power in a


synchronous digital circuit. The principle of precomputation logic is to identify logical
conditions at some inputs to a combination logic that is invariant to the output. Since
those input values do not affect the output, the input transitions can be disabled to
reduce switching activities.

Basics of Precomputation Logic : One variant of the precomputation architecture is


shown in Figure 5.13 R1 and R2 are registers with a common clock feeding a
combinational logic circuit with a known Boolean function f(X). Due to the nature of
the function f(X), there may be some conditions under which the output of f(X) is
independent of the logic value of R2. Under such condition, we can disable the
register loading of R2 to avoid causing unnecessary switching activities, thus
conserving power. The Boolean function f(X) is correctly computed because it
receives all required value from R1. To generate the load-disable signal to R2, a
precomputation Boolean function g(X) is required to detect the condition at which
f(X) is independent of R2. We will discuss how to identify the independence
condition given an arbitrary Boolean function f(X). By definition, g(X) depends on
the input signals of R1 only because the load-disable condition is independent of R2.
Otherwise, f(X) will depend on the inputs of R2 when the load disable signal is
active.

Figure 5.13 Precomputation architecture

36
However, depending on the complexity of the logic functions f1 and f2 ,
the switching activity at the internal nodes of f1 or f2 can be significant. Hence, for
the precomputation scheme to work effectively, the set of inputs fed to register R2
should be large, while the complexity of the logic blocks f1 and f2 should be small.
One would also like to have the signal probability of f1 + f2 that is, P(f1) + P(f2 ) -
P(f1 )P(f2 ) to be large for the scheme to work effectively on an average. It has
been noted that for some common functions considerable savings in power can be
achieved by properly selecting the set of inputs for register R1 and R2.

A remarkable example of precomputation logic is the Binary Comparator function


f( A, B) that computes A > B, as shown in Figure 5.14. Let inputs A1, ... , An and
B1, ... , Bn be n-bit signals that represent binary values. The output f(A, B) is logic 1
if the binary value A is greater than B. We choose R1 to be (An, Bn) and R2 to be
the other signals. The signals An and Bn are the most significant bits of the input
values. An obvious precomputation function is g(X) = An ⨁ Bn. When An ⨁ Bn = 1,
meaning the two bits differ, the output of f(A, B) can be determined without the
input bits of R2. If An = 1, Bn = 0, the output is 1 independent of the least
significant bits of A or B. Again, if An = 0, Bn = 1, the output is 0 regardless of the
inputs of R2. When An ⨁ Bn = 0, R2 cannot be disabled because its signals are
required to compute the output f(A, B).

A simple statistical analysis may help to convince us that the precomputation logic
implementation of Figure 5.14 is very attractive. Assuming uncorrelated input bits
with uniform random probabilities where every bit has an equal probability of zero or
one. There is a 50% probability that An ⨁ Bn = 1 and the register R2 is disabled in
50% of the clock cycles. Therefore, with only one additional 2-input XOR gate, we
have reduced the signal switching activities of the 2n -2 least significant bits at R2 to
half of its original expected switching frequency. Also, when the load-disable signal
is asserted, the combinational logic of the comparator has fewer switching activities
because the out1 puts of R2 are not switched. The extra power required to compute
An ⨁ Bn is negligible compared to the power saving even for moderate size of n.

37
From the above discussion, it is obvious that a designer needs to have some
knowledge of the input signal statistics to apply the precomputation logic technique. In
Figure 5.14, if the probability of An ⨁ Bn is close to zero, the precomputation logic circuit
may be inferior, in power and area, compared to direct implementation. Experimental results
have shown up to 75% power reduction with an average of 3% area overhead and 1 to 5
additional gate-delay in the worst-case delay path.

Figure 5.14 Binary comparator function using precomputation logic

5.2 SOFTWARE DESIGN FOR LOW POWER

Most efforts in controlling power dissipation of digital systems have been and
continue to be focused on hardware design. There is good reason for this since
hardware is the physical means by which power is converted into useful
computation. However, it would be unwise to ignore the influence of software on
power dissipation. In systems based on digital processors or controllers, it is
software that directs much of the activity of the hardware.

Consequently, the manner in which software uses the hardware can have a
substantial impact on the power dissipation of a system. An analogy drawn from
automobiles can help explain this further. The manner in which one drives can have
a significant effect on total fuel consumption. Until recently, there were no efficient
and accurate tools to estimate the overall effect of a software design on power
dissipation. Without a power estimator there was no way to reliably optimize
software to minimize power.

38
Some software design techniques are already known to reduce certain
components of power dissipation, but the global effect is more difficult to
quantify. For example, it is often advantageous to optimize software to minimize
memory accesses, but there may be energy-consuming side effects, such as an
increase in number of instructions. Not all low-power software design problems
have been solved, but progress has been made.

5.2.1 Sources Of Software Power Dissipation

There are several contributors to CPU power dissipation that can be influenced
significantly by software. The memory system takes a substantial fraction of the
power budget (on the order of one-tenth to one-fourth) for portable computers
and it can be the dominant source of power dissipation in some memory-intensive
DSP applications such as video processing.

System buses (address, instruction, and data) are a high-capacitance component


for which switching activity is largely determined by the software. Data paths in
integer arithmetic logic units (ALUs) and floating-point units (FPUs) are
demanding of power, especially if all operational units (e.g., adder, multiplier) or
pipeline stages are activated at all times. Power overhead for control logic and
clock distribution are relevant since their contribution to a program’s energy
dissipation is proportional to the number of execution cycles of the program.
Power management should further increase the influence of software on power in
at least two ways. As the power of idle components are reduced, there should be
an increase in the proportion of power given to carrying out the work actually
ordered by software. In addition, some processors allow for direct control by
software of some power savings modes.

Memory accesses are expensive for several reasons. Reading or writing a memory
location involves switching on highly capacitive data and address lines going to
the memory, row and column decode logic, and word and data lines within the
memory that have a high fanout.

39
The mapping of data structures into multiple memory banks can influence the degree
to which parallel loading of multiple words are possible, Parallel loads not only
improve performance but are more energy efficient.

The memory access patterns of a program can greatly affect the cache performance
of a system. Unfavorable access patterns (or a cache that is too small) will lead to
cache misses and a lot of costly memory accesses. In multidimensional signal
processing algorithms, the order and nesting of loops can alter memory size and
bandwidth requirements by orders of magnitude. Compact machine code decreases
memory activity by reducing the number of instructions to be fetched and reducing
the probability of cache misses. Cache accesses are more energy efficient than main
memory accesses. The cache is closer to the CPU than is main memory, resulting in
shorter and less capacitive address and data lines. The cache is also much smaller
than main memory, leading to smaller internal capacitance on word and data lines.

Buses in an instruction processing system typically have high load capacitances due to
the number of modules connected to each bus and the length of the bus routes. The
switching activity on these buses is determined to a large degree by software [3].
Switching on an instruction bus is determined by the sequence of instruction op-codes
to be executed. Similarly, switching on an address bus is determined by the sequence
of data and instruction accesses. Both effects can often be accounted for at compile
time. Data related switching is much harder to predict at compile time since most
input data to a program are not provided until execution time.

Data paths for ALUs and FPUs make up a large portion of the logic power dissipation
in a CPU. Even if the exact data input sequences to the ALU and FPU are hard to
predict, the sequence of operations and data dependencies are determined during
software design and compilation. The energy to evaluate an arithmetic expression
might vary considerably with respect to the choice of instructions, A simple example is
the common compiler optimization technique of reduction in strength where an
integer multiplication by 2 could be replaced by a cheaper shift-left operation. Poor
scheduling of operations might result in unnecessary energy-wasting pipeline stalls.

40
Some sources of power dissipation such as clock distribution and control logic
overhead might not seem to have any bearing on software design decisions.
However, each execution cycle of a program incurs an energy cost from such
overhead. The longer a program requires to execute, the greater will be this
energy cost. In fact, it is found that the shortest code sequence was invariably the
lowest energy code sequence for a variety of microprocessor and DSP devices. In
no case was the lower average power dissipation of a slightly longer code
sequence enough to overcome the overhead energy costs associated with the
extra execution cycles. However, this situation may change as power management
techniques are more commonly used. In particular, power management commonly
takes the form of removing the clock from idle components. This reduces the
clock load as well as prevents unwanted computations.

5.2.2 Software Power Estimation

The first step toward optimizing software for low power is to be able to estimate
the power dissipation of a piece of code. This has been accomplished at two basic
levels of abstraction. The lower level is to use existing gate level simulation and
power estimation tools on a gate level description of an instruction processing
system. A higher level approach is to estimate power based on the frequency of
execution of each type of instruction or instruction sequence (i.e., the execution
profile). The execution profile can be used a variety of ways. Architectural power
estimation determines which major components of a processor will be active
during each execution cycle of a program. Power estimates for each active
component are then taken from a look-up table and added into the power
estimate for the program. Another approach is based on the premise that the
switching activity on buses (address, instruction, and data) is representative of
switching activity (and power dissipation) in the entire processor. Bus switching
activity can be estimated based on the sequence of instruction op-codes and
memory addresses.

41
The final approach we will consider is referred to as instruction level power
analysis. This approach requires that power costs associated with individual
instructions and certain instruction sequences be characterized empirically for the
target processor. These costs can be applied to the instruction execution
sequence of a program to obtain an overall power estimate.

5.2.2.1 Gate Level Power Estimation

Gate level power estimation of a processor running a program is the most accurate
method available, assuming that a detailed gate level description including layout
parasitics is available for the target processor. Such a detailed processor description
is most likely not available to a software developer, especially if the processor is not
an in-house design. Even if the details are available, this approach will be too slow
for low-power optimization of a program. Gate level power estimates are important
in evaluating the power dissipation behavior of a processor design and in
characterizing the processor for the more efficient instruction level power estimation
approaches.

5.2.2.2 Architecture Level Power Estimation

Architecture level power estimation is less precise but much faster than gate level
estimation. This approach requires a model of the processor at the major
component level (ALU, register file, etc.) along with power dissipation estimates for
each component. It also requires a model of the specific components which will be
active as a function of the instructions being executed. The architecture level
approach is implemented in a power estimation simulator called ESP (Early design
Stage Power and performance simulator). ESP simulates the execution of a program,
determining which system components are active in each execution cycle and
adding the power contribution of each component.

42
5.2.2.3 Bus Switching Activity

Bus switching activity is another indicator of software power based on a simplified


model of processor power dissipation. In this simplified model, bus activity is
assumed to be representative of the overall switching activity in a processor.
Modeling bus switching activity requires knowledge of the bus architecture of a
processor, op-codes for the instruction set, a representative set (or statistics) of
input data to a program, and the manner in which code and data are mapped into
the address space. By simulating the execution of a program, one can determine the
sequence of op-codes, addresses, the data values appearing on the various buses.
Switching statistics can be computed directly from these sequences of binary values.

5.2.2.4 Instruction Level Power Analysis

Instruction level power analysis (ILPA) defines an empirical method for


characterizing the power dissipation of short instruction sequences and for using
these results to estimate the power (or energy) dissipation of a program. A detailed
description of the internal design of the target processor is not required. However,
some understanding of the processor architecture is important in order to
appropriately choose the types of instruction sequences to be characterized. This
understanding becomes even more important when looking at ways to optimize a
program based on the power estimate.

The first requirement for ILPA is to characterize the average power dissipation
associated with each instruction or instruction sequence of interest. Three
approaches have been documented for accomplishing this characterization. The
most straightforward method, if the hardware configuration permits, is to directly
measure the current drawn by the processor in the target system as it executes
various instruction sequences. If this is not practical, another method is to first use a
hardware description language model (such as Verilog or VHDL) of the processor to
simulate execution of the instruction sequences. An actual processor can then be
placed in an automated IC tester and exercised using test vectors obtained from the
simulation. An ammeter is used to measure the current draw of the processor in the
test system. A third approach is to use gate level power simulation of the processor
to obtain instruction level estimates.
43
The choice of instruction sequences for characterization is critical to the success of
this method. As a minimum, it is necessary to determine the base cost of individual
instructions. Base cost refers to the portion of the power dissipation of an instruction
that is independent of the prior state of the processor. Base cost excludes the effect
of such things as pipeline stalls, cache misses, and bus switching due to the
difference in op-codes for consecutive instructions. The base cost of an instruction
can be determined by putting several instances of that instruction into an infinite
loop. Average power supply current is measured while the loop executes. The loop
should be made as long as possible so as to minimize estimation error due to loop
overhead (the jump statement at the end of the loop). However, the loop must not
be made so large as to cause cache misses. Power and energy estimates for the
instruction are calculated from the average current draw, the supply voltage, and the
number of execution cycles per instruction.

Base costs for each instruction are not always adequate for a precise software power
estimate. Additional instruction sequences are needed in order to take into account
the effect of prior processor state on the power dissipation of each instruction.
Pipeline stalls, buffer stalls, and cache misses are obvious energy-consuming events
whose occurrence depends on the prior state of the processor. Instruction sequences
can be created that induce each of these events so that current measurements can
be made. However, stalls and cache misses are effects that require a large scale
analysis or simulation of program execution in order to appropriately factor them into
a program’s energy estimate. There are other energy costs that can be directly
attributed to localized processor state changes resulting from the execution of a pair
of instructions. These costs are referred to as circuit state effects.

Consider the following code sequence as an example:

The foremost circuit state effect is probably the energy cost associated with the
change in the state of the instruction bus as the op-code switches from that of an
addition operation to that of a multiplication operation.

44
Other circuit state effects for this example could include switching of control lines
to disable addition and enable multiplication, mode changes within the ALU, and
switching of data lines to reroute signals between the ALU and register file.
Although this example examines a pair of adjacent instructions, it is also possible
for circuit state effects to span an arbitrary number of execution cycles. This could
happen if consecutive activations of a processor component are triggered by
widely separated instructions. In such cases, the state of a component may have
been determined by the last instruction to use that component.

The circuit state cost associated with each possible pair of consecutive
instructions is characterized for instruction level power analysis by measuring
power supply current while executing alternating sequences of the two
instructions in an infinite loop. Unfortunately, it is not possible to separate the cost
of an A → B transition from a B → A transition, since the current measurement is
an average over many execution cycles.

Equation (5.17) concisely describes how the instruction level power


measurements can be used to estimate the energy dissipation of a program:

𝐸𝑃 = σ𝑖 𝐵𝑖 × 𝑁𝑖 + σ𝑖,𝑗 𝑂𝑖,𝑗 × 𝑁𝑖,𝑗 + σ𝑘 𝐸𝑘 ---------- (5.17)

where EP is the overall energy cost of a program, decomposed into base costs,
circuit state overhead, and stalls and cache misses. The first summation
represents base costs, where 𝐵𝑖 is the base cost of an instruction of type 𝑖 and 𝑁𝑖
is the number of type 𝑖 instructions in the execution profile of a program. The
second summation represents circuit state effects where 𝑂𝑖,𝑗 is the cost incurred
when an instruction of type 𝑖 is followed by an instruction of type 𝑗. Because of
the way 𝑂𝑖,𝑗 is measured, we have 𝑂𝑖,𝑗 = 𝑂𝑗,𝑖 . Here, 𝑁𝑖,𝑗 is the number of
occurrences where instruction type 𝑖 is immediately followed by instruction type 𝑗.
The last sum accounts for other effects, such as stalls and cache misses. Each Ek
represents the cost of one such effect found in the program execution profile.

45
5.2.3 Software Power Optimizations

Software optimizations for minimum power or energy tend to fall into one or more
of the following categories: selection of the least expensive instructions or
instruction sequences, minimizing the frequency or cost of memory accesses, and
exploiting power minimization features of hardware.

A prerequisite to optimizing a program for low power must always be to design an


algorithm that maps well to available hardware and is efficient for the problem at
hand in terms of both time and storage complexity. Given such an algorithm, the
next requirement is to write code that maximizes performance, exploiting the
performance features and parallel processing capacity of the hardware as much as
possible. In doing so, we will have gone a long way to minimize the energy
consumption of the program. Maximizing performance also gives increased
latitude to apply other power optimization techniques such as reduced supply
voltages, reduced clock rates, and shutting off idle components.

It should be clarified that the energy optimization objectives differ with respect to
the intended application. In battery-powered systems, the total energy dissipation
of a processing task is what determines how quickly the battery is spent. In
systems where the power constraint is determined by heat dissipation and
reliability considerations, instantaneous or average power dissipation will be an
important optimization objective or constraint. An instantaneous power dissipation
constraint could lead to situations where one would need to deliberately degrade
system performance through software or hardware techniques. A hardware
technique is preferable since hardware performance degradation can be achieved
through energy-saving techniques (lower voltage, lower clock rate). Software
performance degradation is likely to increase energy consumption by increasing
the number of execution cycles of a program.

46
5.2.3.1 Algorithm Transformations to Match Computational Resources

The general problem of efficient algorithm design is a large area of study that
goes well beyond the scope of this chapter. However, many algorithm design
approaches have significant implications for software-related power dissipation.
One impact of algorithm design is on memory requirements, but we will defer that
problem to the next section. Another impact is on the efficient use of
computational resources. This problem has been addressed extensively in parallel
processor applications and in lower-power DSP synthesis.

In parallel processor applications, a typical problem is to structure software in a


way that maximizes the available parallelism. Parallel computing resources can
then be used to speed up program execution. In low-power DSP synthesis, a
typical problem is to design an algorithm to allow a circuit implementation that
minimizes power dissipation given throughput and area constraints. Often a low-
power DSP design will also exploit parallelism in an algorithm, but the objective is
to shorten critical paths so that supply voltages can be lowered while maintaining
overall performance. A similar trade-off is possible in multiple-processor or
superscalar systems. If a piece of software makes sufficient use of parallel
resources, it may become possible to lower the supply voltage and clock rate of
the processor to achieve power savings. This is already an option if one is
designing an application-specific processor. Researchers are working to make such
a trade-off possible in general-purpose processors through dynamic scaling of
supply voltage and clock rate under control of the operation system.

Reduction operations are an example of a common class of operations that lend


themselves to a trade-off between resource usage and execution time. A simple
example is the summation of a sequence. Figures 5.15 – 5.17 illustrate what can
be done as the number of adder units is changed.

47
If only one adder is available, then Figure 5.15 is a sensible approach. Parallelizing
the summation would only force us to use additional registers to store
intermediate sums. If two adders are available, then the algorithm illustrated in
Figure 8.2 makes sense because it permits two additions to be performed
simultaneously. Similarly, Figure 8.3 fits a system with four adders by virtue of
making four different independent operations available at each time step.

In the general case, one cannot manipulate the parallelism of an algorithm quite
so conveniently. However, the principle is still applicable. The basic principle is to
try to match the degree of parallelism in an algorithm to the number of parallel
resources available.

Figure 5.15 Summation with single adder.

Figure 5.16 Summation with two adders.

48
Figure 5.17 Summation with four adders

5.2.3.2 Minimizing Memory Access Costs

Since memory often represents a large fraction of a system’s power budget, there
is a clear motivation to minimize the memory power cost. Happily, software design
techniques that have been used to improve memory system efficiency can also be
helpful in reducing the power or energy dissipation of a program. This is largely
due to memory being both a power and a performance bottleneck. If memory
accesses are both slow and power hungry, then software techniques that
minimize memory accesses tend to have both a performance and a power benefit.
Lower memory requirements also help by allowing total RAM to be smaller
(leading to lower capacitance) and improving the percentage of operations that
only need to access registers or cache.

Power minimization techniques related to memory concentrate on one or more of


the following objectives:

➢ Minimize the number of memory accesses required by an algorithm.

➢ Minimize the total memory required by an algorithm.

49
➢ Put memory accesses as close as possible to the processor. Choose registers first,
cache next, and external RAM last.

➢ Make the most efficient use of the available memory bandwidth; for example, use
multiple-word parallel loads instead of single-word loads as much as possible.

For multidimensional signal processing applications, the nesting of loops controlling


array operations and the order of operations can substantially influence the number
of memory transfers and the total storage requirements.

Figure 5.18 Data path example for dual memory load

In general-purpose systems it may not be practical to craft the software and


hardware so finely together, but there are techniques that are more readily
applicable to a wide variety of problems. One of these techniques is to maximize
parallel loads of multiple words from memory. Dual loads are beneficial for energy
reduction but not necessarily for power reduction because the instantaneous power
dissipation of a dual load is somewhat higher than for two single loads. A technique
for energy savings of as much as 47% by maximizing dual loads is demonstrated.
Achieving this savings required two phases. The first phase optimized the memory
allocation of a program to maximize opportunities for using dual loads. Memory
accesses were then combined as much as possible. Suppose that the data path
depicted in Figure 5.18 is required to evaluate the expression (X x Y) + Z. Suppose
also that there are two memory banks, all operations must be performed on register
values, and the processor supports dual loads and packing together of ALU and
memory operations. Depending on the memory allocation and objective chosen, the
optimal instruction sequence choice can vary.
50
The several practices that minimize memory bandwidth requirements are listed
below:

➢ Register allocation to minimize external memory references.

➢ Cache blocking, that is, transformations on array computations so that blocks of


array elements only have to be read into cache once.

➢ Register blocking, similar to cache blocking, except that redundant register


loading of array elements is eliminated.

➢ Recurrence detection and optimization, that is, use of registers for values that are
carried over from one level of recursion to the next.

➢ Compact multiple memory references into a single reference.

The approach that evaluates the benefit of loop unrolling on overall performance
and on the degree to which memory reference compaction (also called memory
access coalescing) opportunities are uncovered. Loop unrolling transforms several
iterations of a loop into a block of straight-line code. The straight-line code is then
iterated fewer times to accomplish the same work as the original loop. The
purpose of the transformation is to make it easier to analyze data dependencies
when determining where to combine memory references. It is found that the
performance benefit of loop unrolling and memory access coalescing varied a
great deal depending on the processor.

5.2.3.3 Instruction Selection and Ordering

The specific techniques that have evaluated for the energy savings are:
instruction packing, minimizing circuit state effects, and operand swapping.

51
Given the benefits of high-performance code with respect to energy savings,
optimizations that improve speed should be especially helpful. Code size
minimization may be necessary, especially in embedded systems, but it is not
quite as directly linked to energy savings. Regarding performance and energy
minimization, cache performance is a greater concern. Large code and a small
cache can lead to frequent cache misses with a high power penalty. Code size and
cache performance are concerns that motivate the effort to maximize code
density for embedded processors.

Instruction packing is a feature of a number of processors that permits an ALU


operation and a memory data transfer to be packed into a single instruction. Much
of the cost of each individual instruction is overhead that is not duplicated when
operations are executed in parallel. Consequently, there can be nearly a 50%
reduction in energy associated with the packing of two instructions. Similar
benefits can be achieved by parallel loading of multiple words from different
banks of memory. Optimization of memory accesses is a broad topic that
illustrates the benefit of parallel loads and instruction packing. Concurrent
execution of integer and floating-point operations represents still another
opportunity to minimize the total energy of a group of operations. All of these
techniques are really special cases of the types of coding that can be done in
VLIW (very long instruction word) and superscalar architectures where
opportunities for instruction packing will be much more extensive.

Instruction ordering for low power attempts to minimize the energy associated
with the circuit state effect. Circuit state effect is the energy dissipated as a result
of the processor switching from execution of one type of instruction to another.
On some types of processors, especially DSP, the magnitude of the circuit state
effect can vary substantially depending on the pair of instructions involved. The
techniques for measuring or estimating the circuit state effect for different pairs
of consecutive instructions are described. Given a table of circuit state costs, one
can tabulate the total circuit state cost for a sequence of instructions.

52
Instruction ordering then involves reordering instructions to minimize the total
circuit state cost without altering program behavior. Researchers have found that
in no case do circuit state effects outweight the benefit of minimizing program
execution cycles. Circuit state effects are found to be much more significant for
DSPs than for general-purpose architectures.

Accumulator spilling and mode switching are two DSP behaviors that are sensitive
to instruction ordering and are likely to have a power impact. In some DSP
architectures, the accumulator is the only general-purpose register. With a single
accumulator, any time an operation writes to a new program variable, the
previous accumulator value will need to be spilled to memory, incurring the
energy cost of a memory write. In some DSPs, operations are modal. Depending
on the mode setting (such as the sign extension mode), data path operations will
be handled differently.

Operand swapping attempts to swap operands to ALU or floating-point operations


in a way that minimizes the switching activity associated with the operation. One
way operand swapping could help is to minimize switching related to the
sequence of input values. If latches are located at the inputs of a functional unit
to keep the inputs from switching during idle periods, then there could be an
advantage to swapping operands to minimize switching at the inputs to the
functional unit. For example, if two consecutive additions have one operand that
stays the same (say x + 7 and y + 7), then switching would be minimized if the
common operand was applied to the same adder input both times.

Ordering of operands can have an even greater effect if the operands to a


commutative operation are not treated symmetrically. For example, the number of
additions and subtractions performed in a Booth multiplier depends on the bit
pattern of the second operand. The number of additions and subtractions
required is referred to as the recoding weight of the second operand. Given a pair
of values to be multiplied, power should be reduced if we put the value with the
lower recoding weight into the second input of the Booth multiplier.

53
5.2.3.4 Power Management

Most of the low-power software optimizations that we have discussed already


exploit specific hardware features to achieve maximum benefit. Many of these
features are not unique to low-power designs. However, considerable hardware
design effort has been applied to designing processors that avoid dissipation of
power in circuitry that is idle.

The degree to which software has control over power management varies from
one processor to another. It is possible for power management to be entirely
controlled by hardware, but typically software has the ability to at least initiate a
power-down mode. For event-driven applications such as user interfaces, system
activity typically comes in bursts. If the time a system is idle exceeds some
threshold, it is likely that the system will continue to be idle for an extended time.
Based on this phenomenon, a common approach is to initiate a shutdown when
the system is idle for a certain length of time. Unfortunately, the system continues
to draw power while waiting for the threshold to be exceeded. Predictive
techniques take into account the recent computation history to estimate the
expected idle time. This permits the power down to be initiated at the beginning
of the period of idle time.

Several processors support varying levels of software control over power


management. Some examples include the Fujitsu SPARClite MB86934, Hitachi
SH3, Intel486SL, and the PowerPC 603 and 604. The SPARClite has a power-down
register that masks or enables clock distribution to the SDRAM interface, DMA
module, FPU, and floating-point queues. The Hitachi SH3 provides two power
reduction modes, standby and sleep, along with control over the CPU and
peripheral clock frequencies. In the sleep mode, all operations except the real-
time clock stop. In the standby mode, the CPU core is stopped, but the peripheral
controller, but controller, and memory refresh operations are maintained.

54
The Intel486SL provides a System Management Mode (SMM), a distinct operating
mode that is transparent to the operating system and user applications. The SMM
is entered by means of an asynchronous interrupt. Software for the SMM is able
to enable, disable, and switch between fast and slow clocks for the CPU and ISA
bus. The PowerPC 603 and 604 offer both dynamic and static power management
modes. Dynamic power management removes clocking from execution units if it
is not needed to support currently executing instructions. This allows for
significant savings even while the processor is fully up and running. Savings on
the order of 8-16% have been observed for some common benchmarks. No
software intervention is needed, except to put the processor into the power
management mode. The PowerPC has three static power management modes:
“doze,” “nap,” and “sleep”. They are listed here in order of increasing power
savings and increasing delay to restart the processor. The doze mode shuts off
most functional units but keeps bus snooping enabled in order to maintain data
cache coherency. The nap mode shuts off bus snooping and can set a timer to
automatically wake up the processor. It keeps a phase-locked loop running to
permit quick restart of clocking. The sleep mode shuts off the phase-locked loop.
Nap and sleep modes can each be initiated by software, but a hardware
handshake is required to protect data cache coherency.

A trade-off must be made regarding the merits of software control versus


hardware control of power management. Software-based power management has
the advantage of additional information upon which to base power management
decisions. Purely hardware-based power management has to make its decisions
by monitoring processor activity. An incorrect decision to power down can hurt
performance and power dissipation due to the cost of restoring power and system
state. Processors with integrated power management provide ways to efficiently
save and restore the processor state in the event of a power-down. Program
execution cycles committed to power management represent a down side to
software-based power management.

55
5.2.4 Automated Low-power Code Generation

A variety of optimizations and software design approaches that can minimize energy
consumption are described. To be used extensively, those techniques need to be
incorporated into automated software development tools. There are two levels of
tools: tools that generate code from an algorithm specification and compiler level
optimizations. We have not located any comprehensive high-level tools intended for
low-power code development, but several technologies appear to be available to
build such tools, especially for DSP applications. Graphical and textual languages
have been available for some time that enable one to specify a DSP algorithm in a
way that does not obscure the natural parallelism and data flow. HYPER-LP is a DSP
data path synthesis system that incorporates several algorithm level transformations
to uncover parallelism and minimize critical paths so that the data path supply
voltage can be minimized. Even though HYPER-LP is targeted for data path circuit
synthesis, the same types of transformations should be useful in adapting an
algorithm to exploit parallel resources in a DSP processor or multiprocessor system
(in fact HYPER has been used to evaluate algorithm choices prior to choice of
implementation platform). MASAI is a tool that reorganizes loops in order to
minimize memory transfers and size.

Compiler technology for low power appears to be further along than the high-level
tools, in part because well-understood performance optimizations are adaptable to
energy minimization. This is true more for general-purpose processors than for DSP
processors. Digital signal processing compiler technology faces difficulties in dealing
with a small register set (possibly just an accumulator), irregular data paths, and
making full use of parallel resources. The rest of this section describes two examples
of power reduction techniques that have been incorporated into compilers.

56
A cold scheduling algorithm which is an instruction scheduling algorithm that
reduces bus switching activity related to the change in state when execution
moves from one instruction type to another. The algorithm is a list scheduler that
prioritizes the selection of each instruction based on the power cost (bus
switching activity) of placing that instruction next into the schedule. Following is
the ordering of compilation phases that is proposed in order to incorporate cold
scheduling:

1. Allocate registers.

2. Pre-assemble: Calculate target addresses, index symbol table, and transform


instructions to binary.

3. Schedule instructions using cold scheduling algorithm.

4. Post-assemble: Complete the assembly process.

The assembly process was split into two phases to accommodate the cold
scheduler. Cold scheduling could not be performed prior to preassembly because
the binary codes for instructions are needed to determine the effect of an
instruction on switching activity. Scheduling could not be the last phase since
completion of the assembly process requires an ordering of instructions.

One limitation of this approach is that it becomes difficult to schedule instructions


across basic block boundaries because of the early determination of target
addresses. Cold scheduling was found to obtain a 20-30% switching activity
reduction with a performance loss of 2-4%. It is easy to imagine that the
instruction level power model could also be used to prioritize instruction selection
during cold scheduling.

57
A code generation and optimization methodology is proposed that encompasses
several of the techniques discussed earlier. Following is the sequence of phases
that they proposed:

1. Allocate registers and select instructions.

2. Build a data flow graph (DFG) for each basic block.

3. Optimize memory bank assignments by simulated annealing.

4. Perform as soon as possible (ASAP) packing of instructions.

5. Perform list scheduling of instructions (similar to cold scheduling).

6. Swap instruction operands where beneficial.

The optimization phases appear to be sequenced in order from greatest to least


incremental benefit. Overall energy savings on four benchmarks ranged from 26
to 73% compared to results that did not incorporate instruction packing or
optimize memory bank assignments.

5.2.5 Codesign For Low Power

The emphasis of this unit has been on software design techniques for low power.
However, it is probably obvious by now that many of these techniques are
dependent in some way upon the hardware in order for a power savings to be
realized. Hardware/software codesign for low power is a more formal term for this
problem of crafting a mixture of hardware and software that provides required
functionality, minimizes power consumption, and satisfies objectives that could
include latency, throughput, and area. Instruction set design and implementation
seems to be one of the most well defined of the codesign problems. In this section,
we will first look at instruction set design techniques and their relationship to low
power design. Another variation of the codesign problem is to use a processor for
which some portion of the interconnect and logic is reconfigurable. Reconfigurable
processors allow much of the codesign problem to be delayed until the software
design and compilation phase.

58
For a nonreconfigurable processor, there is much less opportunity to optimize the
hardware once the processor design is fixed. Finally, we will survey the
hardware/software trade-offs that have come up in our consideration of power
optimization of software.

5.2.5.1 Instruction Set Design

The PEAS-I system (Practical Environment for ASIP development-version I) is a


system that synthesizes an HDL (hardware description language) description of a
CPU core along with a C compiler, assembler, and simulator for that CPU. Inputs
to PEAS-I are a set of design constraints (including chip area and power
consumption), a hardware module database, and a sample application program
and program dataset. PEAS-I optimizes the instruction set and instruction
implementations for the application program and dataset that was given.

PEAS-I defines an extended set of instructions by starting with a core instruction


set that would be needed to implement any C program. The core set is
augmented with instructions corresponding to any C operators not already
represented by a single instruction. The set could be further expanded to include
any predefined or user-defined functions. For each possible instruction, alternative
implementations are defined, including hardware, microprogram, and software
options. Optimal selection of implementation methods for each instruction is
defined as an integer program and solved by a branch-and-bound algorithm. The
power and area contribution of each instruction implementation is estimated and
added together to include in power and area constraints. The number of
execution cycles of each instruction is multiplied by the frequency of occurrence in
the sample application and added together to form the objective function for the
optimization.

59
Huang and Despain’s approach similar to PEAS-I, is that it optimizes the
instruction set for sample applications. The most significant difference is that this
approach groups micro-operations (MOPS) together to form higher level
instructions. Each benchmark is expressed as a set of MOPS with dependencies to
be satisfied. MOPS are merged together as a byproduct of the scheduling process.
MOPS that are scheduled to the same clock cycle are combined. For any candidate
schedule and instruction set, instruction width (bits), instruction set size, and
hardware resource requirements are constrained. The scheduling problem is
solved by simulated annealing with an objective of minimizing execution cycles
and instruction set size. Either approach relies on instruction level application
profiles to prioritize decisions regarding the instruction set. The challenging task
would be determining appropriate instruction level power estimates and circuit
state effects for an unsynthesized CPU. Op-code selection is another design
decision that could be optimized for low power, considering the impact of the op-
codes on circuit state effects.

5.2.5.2 Reconfigurable Computing

“Reconfigurable computing” is a growing area of research in processor design that


could motivate radical new requirements for software design tools. It also
presents new opportunities and challenges for low-power systems design. In
reconfigurable processors, some portion of the interconnect and logic can be
modified at run time. This should allow the processor to be closely optimized to
each of a wide variety of applications. Reconfiguration can be implemented at any
of several levels of granularity, ranging from the gate level up to architectural
level changes. The reconfigurable components are commonly implemented using
field programmable gate arrays. The portion of the processor given to be
reconfigurable can also vary widely.

60
The reconfigurable portion can be limited to a coprocessor or it can encompass a
large portion of the processor architecture.

Software and compiler design for reconfigurable architectures is an open area of


research. Software design methodologies will vary depending on the
archictecture. Nevertheless, many of the hardware/software trade-offs we have
discussed should still be relevant. The biggest change is that the software
designer will have a much greater opportunity to tailor the software and processor
to fit each other even after the hardware design is fixed.

61
9. ASSIGNMENTS

Assignments (For higher level learning and Evaluation -


Examples: Case study, Comprehensive design, etc.,)

CT BT
Q. No.
Questions Level Level
The Below link describes the Software power estimation
techniques.
1. Listen this video Lecture and submit what you understand from CO6 K3
this about 1000 words.
https://www.youtube.com/watch?v=dqcfYTePRxQ
The Below link discusses the FPGA power Optimization. Write
2. 1000 words from this link. CO6 K3
https://www.youtube.com/watch?v=FGfjIvbyU40

62
10. Part A Q & A (with K level and CO)

CT BT
Q. No. Questions
Level Level

What are two algorithm level techniques used for


improvement in power dissipation?
The two algorithm level techniques try to reduce computation to
achieve low power. The first technique uses differential coefficient
1. CO5 K2
representation to reduce the dynamic range of computation while
the other technique optimizes the number of 1’s in coefficients
representation to reduce the number of additions (or switching
activity).

What is meant by differential coefficients method ?


The differential coefficients method (DCM) is an algorithm level
technique for realization of low-power FIR filters with a large
number of taps (N of the order of hundreds). The DCM relies on
reducing computations to reduce power. The algorithms for the
2. CO5 K2
DCM use various orders of differences between the coefficients
(the various orders of differences will be precisely defined later
on) in conjunction with stored precomputed results rather than
the coefficients themselves to compute the canonical form
convolution.

What is the limitation of differential coefficients method ?


One limitation of DCM is that it can be applied only to systems
3. where the envelope generated by the coefficient sequence (and CO5 K2
various orders of differences) is a smoothly varying continuous
function; thus it was beneficial largely for low-pass FIR filters.

What is sorted recursive differences (SRD) method?


The sorted recursive differences (SRD) method is the improved
version of DCM which uses recursive sorting of coefficients and
various orders of differences to maximize the computational
4. reduction. This recursive sorting is made possible by the
transposed direct form of FIR output computation. Thus there are
no restrictions on the coefficient sequence to which it is
applicable (or the sequences of various orders of differences).

63
10. Part A Q & A (with K level and CO)

CT BT
Q. No. Questions
Level Level

What is the idea used in the Power-Constrained Least-


Squares Optimization technique for Adaptive and Non-
adaptive Filters?
The power constrained least-squares (PCLS) technique can be
applied to both non-adaptive digital filters. The basic idea in this
technique is to reduce the number of 1’s in the representation of
5. the coefficients such that the number of additions in a multiplier CO5 K2
can be reduced, thus achieving low-power dissipation. However,
due to changes in the coefficients representation, the
performance of the filter can change. Hence, changes in the filter
coefficients can be allowed within a range such that the change
in performance is within the tolerance limit.

List the practices used to minimize memory bandwidth


requirements.
➢ Register allocation to minimize external memory references.
➢ Cache blocking, that is, transformations on array computations
so that blocks of array elements only have to be read into
cache once.
6. CO6 K2
➢ Register blocking, similar to cache blocking, except that
redundant register loading of array elements is eliminated.
➢ Recurrence detection and optimization, that is, use of registers
for values that are carried over from one level of recursion to
the next.
➢ Compact multiple memory references into a single reference.

What are the objectives of Power minimization


techniques related to memory ?
➢ Minimize the number of memory accesses required by an
algorithm.
➢ Minimize the total memory required by an algorithm.
➢ Put memory accesses as close as possible to the processor.
7. CO6 K2
Choose registers first, cache next, and external RAM last.
➢ Make the most efficient use of the available memory
bandwidth; for example, use multiple-word parallel loads
instead of single-word loads as much as possible.

64
10. Part A Q & A (with K level and CO)

CT BT
Q. No. Questions
Level Level

How do the software power optimization techniques are


categorized?
Software optimizations for minimum power or energy tend to fall
8. into one or more of the following categories: selection of the CO6 K2
least expensive instructions or instruction sequences, minimizing
the frequency or cost of memory accesses, and exploiting power
minimization features of hardware.

What are the techniques used for energy savings under


software optimization?
9. The specific techniques that have evaluated for the energy CO6 K2
savings are: instruction packing, minimizing circuit state effects,
and operand swapping.

What is Power Optimization Using Operation Reduction


and its impact.?
The easiest way to reduce the weighted capacitance being
switched is to reduce the number of operations performed in the
10. data flow graph. Reducing the operations count reduces the total CO5 K2
capacitance associated with the system and, hence, can reduce
the power dissipation. However, reduction in the number of
operations can have adverse effect on critical paths.

A 32 bit off-chip bus operating at 5V and 66MHz clock


rate is driving a capacitance of 25pF/bit. Each bit is
estimated to have a toggling probability of 0.25 at each
clock cycle. What is the power dissipation in operating
the bus?
11. Soln: CO5 K2
C = 32 x 25 = 800pF
V = 5.0V
f= 0.25 x 66 = I6.5MHz
P = 800pFX5 2 V2 XI6.5MHz = 330mW

73
10. Part A Q & A (with K level and CO)

CT BT
Q. No. Questions
Level Level

What are the architecture level design decisions for low


power operation?
Typical architecture level design decisions involve the selection
12. CO5 K2
and organization of the functional macro of the design. The
choice of clocking strategy, parallelism, pipelining, component
organization, etc.,

What is instruction packing?


Instruction packing is a feature of a number of processors that
permits an ALU operation and a memory data transfer to be
packed into a single instruction. Much of the cost of each
13. CO6 K2
individual instruction is overhead that is not duplicated when
operations are executed in parallel. Consequently, there can be
nearly a 50% reduction in energy associated with the packing of
two instructions.

What is meant by Operand swapping?


Operand swapping attempts to swap operands to ALU or floating-
point operations in a way that minimizes the switching activity
associated with the operation. One way operand swapping could
14. help is to minimize switching related to the sequence of input CO6 K2
values. If latches are located at the inputs of a functional unit to
keep the inputs from switching during idle periods, then there
could be an advantage to swapping operands to minimize
switching at the inputs to the functional unit.

What is Reconfigurable computing?


“Reconfigurable computing” is a growing area of research in
processor design that could motivate radical new requirements
for software design tools. It also presents new opportunities and
challenges for low-power systems design. In reconfigurable
processors, some portion of the interconnect and logic can be
15. CO6 K2
modified at run time. This should allow the processor to be
closely optimized to each of a wide variety of applications.
Reconfiguration can be implemented at any of several levels of
granularity, ranging from the gate level up to architectural level
changes. The reconfigurable components are commonly
implemented using field programmable gate arrays.

74
11. PART B Questions

CT BT
Q. No. Questions
Level Level

Explain about algorithm level techniques for improvement in


1. CO5 K3
power dissipation.

2. Elaborate on the use of pipelining and parallelism for low power. CO5 K2

Explain the various methods used to minimize power based on


3. CO6 K2
software power optimization.

4. Discuss the principle of pre-computation logic for reducing power


CO5 K2
with a suitable example.

Describe the concept of Automated Low-power Code Generation


5. CO6 K2
in detail.

Explain about power management techniques proposed for


6. CO6 K2
software power optimizations.

Discuss about Software Power Estimation techniques used in


7. CO6 K2
optimizing software for low power.

8. Explain about Architecture driven voltage scaling. CO5 K2

Describe how Power-Constrained Least-Squares Optimization


9. techniques are applied for Adaptive and Non-adaptive Filters to CO5 K2
reducing power.

Write short notes on: i) Sorted Recursive Differences ii)


10. CO3 K2
Transposed direct form computation.

75
12. Supportive online Certification courses (NPTEL,
Swayam, Coursera, Udemy, etc.,)

Design and Analysis of VLSI Subsystems - NPTEL


https://onlinecourses.nptel.ac.in/noc23_ee40/preview

VLSI Physical Design - NPTEL


https://onlinecourses.nptel.ac.in/noc21_cs12/preview

68
13. Real time Applications in day to day life and to
Industry

Real time Application Videos:

https://www.cadence.com/en_US/home/solutions/low-
power-solution.html

https://www.synopsys.com/implementation-and-
signoff/signal-and-power-integrity.html

69
14. Assessment Schedule

ASSESMENT PROPOSED ACTUAL DATE


DATE
UNIT TEST 1
FIAT

SIAT

MODEL

70
15.Prescribed Text Books

TEXT BOOKS:

1. Kaushik Roy and S.C.Prasad, Low power CMOS VLSI circuit


design, John Wiley & Sons, 2013.

2. Dimitrios Soudris, Christians Pignet, Costas Goutis, Designing


CMOS Circuits for Low Power, Springer,2011

REFERENCES:

1. A.P.Chandrasekaran and R.W.Brodersen, Low power digital


CMOS design, Springer US, 2012.

2. Gary Yeap, Practical low power digital VLSI design, Springer


US, 2012.

3. Abdelatif Belaouar, Mohamed.I.Elmasry, Low power digital VLSI


design: Circuits and Systems, Springer Verlag, 2012.

4. James B.Kulo, Shih-Chia Lin, Low voltage SOI CMOS VLSI


devices and Circuits, John Wiley & sons,2011.

5. Steven M.Rubin, Computer Aids for VLSI Design, 3rd edition,


R.L. Ranch Press, 2012.

NPTEL LINK:

https://archive.nptel.ac.in/courses/106/105/106105034/#

71
16. Mini Project Suggestions

Name of the Project


S.No. Design and Implement the following by using Low
power tools
1. Ripple carry Adder
2. Carry Select Adder
3. Multiplier
4. Arithmetic Logic Unit
5. High Speed Adders

72
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group
of Educational Institutions. If you have received this document through email in error, please
notify the system manager. This document contains proprietary information and is intended
only to the respective group / learning community as intended. If you are not the addressee
you should not disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete this
document from your system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.

73

You might also like