Unit 5 Digi
Unit 5 Digi
2
Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
3
Department : ECE
UNIT 5
Created by,
Page
S.No. Contents
No.
1 Course Objectives 6
2 Pre-Requisites 7
3 Syllabus 8
4 Course outcomes 9
5 CO- PO/PSO Mapping 10
6 Lecture Plan 11
7 Activity based learning 12
8 Lecture Notes 13
9 Assignments 95
10 Part A Q & A 96
11 Part B Qs 100
5
1.Course Objectives
❖ To identify sources of power in an IC.
❖ To identify the power reduction techniques based on technology
independent and technology dependent methods
❖ To identify suitable techniques to reduce the power dissipation
❖ To estimate power dissipation of various MOS logic circuits
❖ To develop algorithms for low power dissipation
6
2. Pre Requisites
VLSI Design
21EC503 Semester -V
Electronic Devices
21EC202 Semester - II
7
3.SYLLABUS
UNIT I POWER DISSIPATION IN CMOS 9
TOTAL: 45 PERIODS
8
4. Course outcomes
Highest
Course Outcomes Cognitive
Level
To know the sources of power consumption in
K3
CO1 CMOS circuits
9
5.CO- PO/PSO Mapping
Program
Specific
Program Outcomes (PO)
Course K Outcomes
Outco Leve (PSO)
K3,
mes l of K K K
K3 K5, A3 A2 A3 A3 A3 A3 A2 PS PS PS
4 4 5
(CO) CO K6 O- O- O-
P P P P 1 2 3
PO PO PO PO PO PO PO PO
O- O O O K6 K5 K6
-5 -6 -7 -8 -9 -10 -11 -12
1 -2 -3 -4
CO1 K3 3 2 2 2 2 1 1 1 1 2 1 1 1 1 1
CO2 K3 3 2 2 1 2 1 1 1 1 1 1 1 1 2 1
CO3 K3 3 2 2 2 2 2 1 1 2 1 1 1 1 1 1
CO4 K3 3 2 2 1 2 1 1 1 2 1 1 1 1 1 1
CO5 K3 3 2 2 1 2 1 2 1 1 1 1 1 1 1 1
CO6 K3 3 2 2 2 3 1 1 1 1 1 1 1 1 2 1
10
6. Lecture Plan
Mode
Actual Taxon
Topic Propos of
Topics to be covered Lectur CO omy
No.9. ed Date Delive
e date Level
ry
Synthesis For Low Power - CO5
1. K3 PPT
Behavioral Level Transforms
CO5
Algorithm Level Transforms for K3 PPT
2. Low Power Circuit
11
7. Activity Based Learning
CASE STUDY
Listen to these videos carefully. Write 500 words about what you heard from these
Videos.
https://www.xilinx.com/video/application/vivado-design-suite-ultrascale-
architecture.html
https://www.xilinx.com/video/soc/embedded-vision-and-control-solutions-with-the-
zynq-mpsoc.html
12
8.Lecture notes
13
UNIT V SYNTHESIS AND SOFTWARE DESIGN FOR LOWPOWER
CMOS CIRCUITS
SYNTHESIS FOR LOW POWER
Tools have been developed to carry out each of these steps automatically. In the
beginning, the synthesis tools attempted to reduce area alone and did not
consider delay. Next came improved algorithms that, in addition to reducing the
area, ensured that the maximum delay through the circuit did not increase or at
least remained within the specified maximum bound. With the exponential growth
in the number of gates that can be accommodated on a chip continuing unabated,
the area has become less of a concern and is increasingly being substituted for by
performance and power dissipation. As we will see later, a large fraction of the
total research effort to reduce power dissipation has been justifiably devoted to
lowering the supply voltage in conjunction with techniques to compensate for
accompanying loss of performance. However, independent of and in addition to
the supply voltage scaling, it is possible to make logic and circuit level
transformations to achieve considerable improvement in power dissipation. These
transformations try to minimize dynamic power dissipation based on the switching
activity of the signals at circuit nodes..
14
Switching activity at circuit nodes is related to input signal probabilities and
activities. The input signal probabilities and activities can be obtained by system
level simulation of the function with real-life inputs. Hence, circuits having the
same functionality but operating in different environments requiring different
input signal probabilities and activities are synthesized differently for improving
power dissipation. In this unit we focus on reduction of dynamic power
dissipation.
The recent trend has been to consider power dissipation at all phases of the
design cycle. The design space can be very efficiently explored starting at any
behavioral and/or algorithmic level. Architectural transforms and tradeoffs can be
conveniently applied at that level to optimize power dissipation. However, efficient
means of accurately estimating power dissipation at the behavioral level are
required so that meaningful transforms can be applied.
Because of the many degrees of freedom available, the behavioral level has the
potential of producing a large improvement in power dissipation. At this level,
since operations have not been assigned, the execution times and hardware
allocation is not yet performed, a design point with given power dissipation, delay,
and area can be found if one exists in the design space at all. Hence, there is a
great need to systematically explore the design space. Traditionally, behavioral
synthesis has targeted optimization of the amount of hardware resources required
and optimization of the average number of clock cycles per task required to
perform a given set of tasks.
It has been observed that large improvements in power dissipation are possible at
higher levels of design abstraction while pure combinational logic synthesis can
only produce a moderate level of improvement. This is mainly due to the fact that
the flexibility of making changes to random logic is rather limited. However,
technology combined with innovative circuit design techniques can also produce a
large improvement in power dissipation.
15
Let us consider two algorithm level techniques for improvement in power
dissipation targeted for digital filters; both techniques try to reduce computation
to achieve low power. The first technique uses differential coefficient
representation to reduce the dynamic range of computation while the other
technique optimizes the number of 1’s in coefficients representation to reduce the
number of additions (or switching activity). Other techniques try to use multiplier-
less implementations for low-power and high-performance. Since digital signal
processing techniques are very well represented mathematically, algorithm level
techniques can be easily applied.
One of the most basic operations performed in DSP applications is the finite
impulse response (FIR) computation. As is well known, the output of a linear
time-invariant (LTI) system can be obtained by convolving in the time domain the
input to the system and the transfer function of the system. For discrete-time LTI-
FIR systems, this can be expressed mathematically as
------ (5.1)
The parameters X and Y are the discrete-time input and output signals, respectively,
represented as sequences. The sequence C represents the transfer function of the
system. For FIR filters, C also corresponds to the set of filter coefficients and the
length of the sequence N is called the number of taps or the length of the filter.
16
The differential coefficients method (DCM) is an algorithm level technique for
realization of low-power FIR filters with a large number of taps (N of the order of
hundreds). The DCM relies on reducing computations to reduce power. The
computation of the convolution, in the canonical form for the FIR filter output as
given by Eq. (5.1), by using the multiply-and accumulate sequence as defined
earlier (computing the product of each coefficient with the appropriate input data
and accumulating the products), will be termed direct-form computation. The
algorithms for the DCM use various orders of differences between the coefficients
(the various orders of differences will be precisely defined later on) in conjunction
with stored precomputed results rather than the coefficients themselves to
compute the canonical form convolution. These algorithms result in less
computations per convolution as compared to direct form computation. However,
they require more storage and storage accesses and hence more energy for
storage operations. Net energy savings using the DCM is dependent on various
parameters, like the order of differences used, energy dissipated in a storage
access, and the word widths used for the digitized input data and coefficients.
The DCM can also lead to a reduction in the time needed for computing each
convolution and thus one may obtain an added advantage of higher speed of
computation. Analogous to the savings in energy, the speed enhancement
obtained is dependent on the order of differences used and various other
parameters.
17
𝑌𝑗 = 𝐶0 𝑋𝑗 + 𝐶1 𝑋𝑗−1 + 𝐶2 𝑋𝑗−2 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+1 ------- (5.2)
𝑌𝑗+1 = 𝐶0 𝑋𝑗+1 + 𝐶1 𝑋𝑗 + 𝐶2 𝑋𝑗−1 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+2 -------(5.3)
𝑌𝑗+2 = 𝐶0 𝑋𝑗+2 + 𝐶1 𝑋𝑗+1 + 𝐶2 𝑋𝑗 ⋯ + 𝐶𝑁−1 𝑋𝑗−𝑁+3 ----- (5.4)
Notice that each input data multiplied by every coefficient exactly once in turn
appears as a product term in the sum for successive outputs. Therefore, excepting
the first, each product term in the sum for Yj+1 can be written as the following
identity:
Since the 𝐶𝑘−1 𝑋𝑗−𝑘+1 terms in identity (5.5) above have already been computed for
the previous output Yj, one needs to only compute (𝐶𝑘 −𝐶𝑘−1 )𝑋𝑗−𝑘+1 terms and
add them to the already computed 𝐶𝑘−1 𝑋𝑗−𝑘+1 terms. The first product term in the
sum for 𝑌𝑗+1 , which is 𝐶0 𝑋𝑗+1 , has to be computed without recourse to this scheme.
Hence, all the N product terms are now available for generating 𝑌𝑗+1 .
Summarizing the above, one can say that, excepting C0 each and every coefficient
can be expressed as the sum of the preceding coefficient and the difference between
it and the preceding coefficient. Therefore, each coefficient can now be expressed as
the recurrence relation
where δ1𝑘−1Τ𝑘 is termed the first-order difference between coefficients Ck and Ck-1 as
consecutive coefficients for k in the specified range. Substituting in Eq. (5.1) the
expression for Ck from Eq. (5.6) we get all the product terms 𝑃𝑘 𝑡=𝑗+1
excepting the first, for computing the FIR output Yj+1. Therefore the 𝑃𝑘 𝑡=𝑗+1
can be written as
The left-hand side of the above equation is called a product term in the convolution
given by Eq. (5.1), whereas the terminology partial product will be used for the term
δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 in the above equation. Both the partial product and the term
𝐶𝑘−1 𝑋𝑗−𝑘+1 in the above equation will be called intermediate results used in the
The range of the subscripts of the first intermediate result in Eq. (5.7) can be
changed to give an equivalent range as follows:
𝐶𝑘−1 𝑋𝑗−𝑘+1 , for k = 1,..., N — 1 = 𝐶𝑘−1 𝑋𝑗−𝑘 for 0, ……, N-2 -------- (5.8)
They are therefore identical to the first N — 1 product terms of the FIR output at the
immediately preceding instant of time, that is, at time t = j , which are
Therefore, if they are stored for reuse, one can compute all the product terms
𝑃𝑘 𝑡=𝑗+1 , excepting the first, for the FIR output at time t=j+1, by only computing
the partial products δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 in Eq. (5.7) above and adding them to the
appropriate stored product terms 𝑃𝑘 𝑡=𝑗 computed for the FIR output at time t = j.
19
Thus we see that only one intermediate result storage variable and one extra
addition per product term are needed. Furthermore, since the first coefficient C0
along with the N-1 first-order differences δ1𝑘−1Τ𝑘 are used for computing the FIR
output, instead of storing all the coefficients, one needs to store only C0 and the
δ1𝑘−1Τ𝑘 ’s . The above algorithm is called the first-order differences algorithm for
generating the FIR filter output. The additional storage accesses and additions
incurred using this algorithm will be termed overheads.
If the differences δ1𝑘−1Τ𝑘 are small compared to the coefficients Ck , then in the
multiplication for computing a product term 𝑃𝑘 𝑡=𝑗+1 using this algorithm, one is
trading a long multiplier (with Xj-k as the multiplicand) for a short one and
overheads. The advantage of using this algorithm lies in the fact that if the
computational cost (in terms of energy or delay) of the former is greater than the
net computational cost of the latter then we make a net savings in computation
as compared to using the direct form.
Higher orders of differences can also be used for expressing the coefficients as a
Using first- and second-order differences only, we can express the coefficients as
follows:
20
Hence, except C0 and Cy , all other coefficients can be expressed compactly as follows:
------ (5.11)
which follows from substituting in Eq. (5.6) the expression for δ1𝑘−1Τ𝑘 from Eq. (5.10).
Multiplying both sides of Eq. (5.11) above by 𝑋𝑗−𝑘+1 , we obtain the product terms
Using algorithm, let us call the last two terms on the right-hand side of the above
equation as partial products and all three terms on the right-hand side as intermediate
results in the computation of the product term 𝑃𝑘 𝑡=𝑗+1 . The computation of the FIR
output Yj+1 by using the relationship in Eq. (5.12) will be termed the second-order
differences algorithm.
One can take advantage of the above recurrence relation to compute any product term
in the convolution incrementally using the minimally required storage for intermediate
results as follows. We also show below that one needs just two extra storage variables
and two extra additions per product term for computing the FIR output using the
second-order differences algorithm.
Let D[k] and P[k] be the two storage variables used for intermediate results for
computing the k t h product term of the FIR output at time t = j . Since there are N
product terms in the output, we will use two array variables for storage, both of array
size N (D[k] and P[k] , with k — 0 , . . . ,N — 1). Here, D[k] and P[k] will be used for
storing the partial product δ1𝑘−2Τ𝑘−1 𝑋𝑗−𝑘+1 and the intermediate result
𝐶𝑘−1 𝑋𝑗−𝑘+1 , respectively, as computed using Eq. (5.12), both of which will be
intermediate results for the FIR output at the next time step, that is, at time t = j + 2.
21
Of course, in addition to D[k] and P[k] , one has to store C0, δ00Τ1 and the N- 2
Let us begin at time t=j, with the variables D[0] and P[0] both initialized to zero.
The first product term 𝑃1 𝑡=𝑗 = C0 Xj of the FIR output at time t = j has to be
computed directly and is stored in P[0] with the contents of D[0] remaining
unchanged at zero:
At the next time step, at time t=j+1, D[0] and P[0] would be used for computing
the second product term 𝑃2 𝑡=𝑗+1 of the FIR output at time t=j+1, which is C1Xj,
as follows:
Thus at the end of this computation £>[0] contains δ10Τ1 Xj and P[0] contains C1Xj .
At the following time step, at time t=j+2, D[0] and P[0] would be used for
computing the third product term 𝑃3 𝑡=𝑗+2 of the FIR output at time t=j+2, which
is C2Xj , as follows:
Thus at the end of this computation D[0] contains δ11Τ2 Xj and P[0] contains C2Xj .
This process would be continued and D[0] and P[0] would accumulate results for N
time steps, after which they would be reset to zero, and the process would start all
over again.
22
The remaining N — 1 pairs of D and P variables go through the same process, except
that the initialization times of all the N pairs are unique. At any instant of time one and
only one pair is initialized and at the next instant of time the next sequential (modulo
N) pair is initialized.
Thus we see that using this technique only two additional variables per product term
are required to store the intermediate results. Since we have N product terms, we need
a total of 2N extra storage variables, as compared to direct-form computation. Two
more additions per product term (as compared to the direct form), one each to update
D[k] and P[k] , are also needed.
We can now generalize the algorithm to use up to mth-order with the mth order
difference defined as
------ (5.13)
We can obtain the recurrence relationship between the coefficients (Figure 5.1) using
mth-order differences as
------ (5.14)
which follows by substituting in Eq. (5.6), the definition for the first-order difference in
terms of the second-order difference as given by Eq. (5.10), then substituting for the
second-order difference in terms of the third-order difference as defined by Eq. (5.13)
above with m = 3, and so on, recursively, up to the mth-order difference. We can again
compute each product term incrementally as shown before for the second-order
differences algorithm. However, we now need to store m intermediate results for each
product term that will be stored using the same technique as for the second-order
differences algorithm in the m array variables D1[k], D2[k] , D3[k] , . . . , Dm-1[k], P[k] .
Each array will be of size N . Therefore we need a total of mN storage variables in
addition to the storage requirements of the direct form. We also need m additions per
product term to update the intermediate results storage variables. Therefore a total of
mN more additions per convolution are needed as compared to the direct form.
23
5.1.1.5 Negative Differences
𝑃𝑘 𝑡=𝑗+1 = 𝑃𝑘−1 𝑡=𝑗 + δ1𝑘−1Τ𝑘 𝑋𝑗−𝑘+1 for each product term in the sum for Yj +1
can get the absolute value of the partial product by computing the product
|δ1𝑘−1Τ𝑘 ||𝑋𝑗−𝑘+1 | 1- We can then add it to or subtract it from the term 𝑃𝑘−1 𝑡=𝑗
𝑃𝑘 𝑡=𝑗+1. We have no control over the sign of the partial product anyway
(irrespective of the sign of the difference δ1𝑘−1Τ𝑘 since it depends on the sign of
𝑋𝑗−𝑘+1 . This technique can also be used for algorithms using greater order
differences.
One limitation of DCM is that it can be applied only to systems where the envelope
generated by the coefficient sequence (and various orders of differences) is a
smoothly varying continuous function; thus it was beneficial largely for low-pass FIR
filters. We present an improved version of DCM, called the sorted recursive
differences (SRD) method, which uses recursive sorting of coefficients and various
orders of differences to maximize the computational reduction. This recursive sorting
is made possible by the transposed direct form of FIR output computation. Thus
there are no restrictions on the coefficient sequence to which it is applicable (or the
sequences of various orders of differences). The effective word length reduction
using the DCM was not the same for each coefficient. Instead of pessimistically
using the worst case reduction as mentioned earlier, one can use a simple statistical
estimate for the effective reduction in the number of 1’s in the coefficients.
24
5.1.1.7 Transposed Direct-Form Computation
In a transposed direct form (TDF) FIR computation, all N product terms involving
a particular data are computed before any product term with the next sequential
data is computed. The product terms are accumulated in N different registers
since they belong to N sequential outputs. The effective throughput remains the
same as direct-form computation. Signal flow graphs for direct-form and TDF
computation are shown in Figures 5.2 and 5.3, respectively.
Figure 5.2 Signal flow graph for direct form realization of an even-length FIR system
Figure 5.3 Signal flow graph for transposed direct form realization of the same FIR
system as in Figure 5.2.
One advantage of using TDF computation lies in the fact that it does not matter in
which order we compute the product terms involving a particular data so long as
we accumulate them in the right registers. Therefore we can sort the coefficients
in non-decreasing (or non-increasing) order before taking first-order differences.
It can be shown that this ordering minimizes the sum of the absolute value of
first-order differences.
25
Sorting can also be applied to differences. Thus we can generate the second-order
differences from the sorted set of first-order differences. This could be recursively
done up to any order of differences. The various permutations of the sets of
different orders of differences needed for a correct restoration could be hardwired in
the control unit. This would enable it to accumulate partial products in appropriate
locations so that the correct output is produced. Thus if the DCM is applied in a TDF
realization with recursive sorting, we can eliminate the restrictions that were earlier
imposed on the coefficients and various orders of differences for the DCM to be
viable.
Whereas sorting guarantees that differences of all orders will be nonnegative, the
coefficients can be positive or negative. When two consecutive coefficients in the
sorted sequence are of opposite signs, the magnitude of the algebraic difference
between them is larger than the magnitude of either one. To decrease the range of
the coefficients (hence smaller differences), one can use absolute values to
compute differences and then manipulate the sign.
The power constrained least-squares (PCLS) technique can be applied to both non-
adaptive digital filters. The basic idea in this technique is to reduce the number of
1’s in the representation of the coefficients such that the number of additions in a
multiplier can be reduced, thus achieving low-power dissipation. However, due to
changes in the coefficients representation, the performance of the filter can change.
Hence, changes in the filter coefficients can be allowed within a range such that the
change in performance is within the tolerance limit. Complexity reduction is one of
the oldest topics in the classical signal processing literature. However, the goal
pursued historically targeted making proposed filter implementation possible for
practical state-of-the-art applications. Another motivation has been to reduce the
cost of implementation.
26
Higher speed translates to low power using a voltage scaling approach. Further,
lower complexity in terms of number of operations directly improves power. A
classical measure of complexity has been to compute the total number of add and
multiply operations. Further, power consumption is related to such a measure,
however, only indirectly. A direct estimate considers number of switching events
and a more complex algorithm may consume lower power if it causes less
switching activity. Hence, a low-power algorithm design approach is one that
attempts to reduce the overall switching activity.
In this section, we will present the constrained least-squares (CLS) approach used
to compute the modified coefficient vector k = [k0, k1, . . . , kM_1]T . The original
coefficient vector is represented by c = [c0, c1 , . . . , cM-1]T. We define the code
class of a number, a, as the number of 1 bits in its binary representation. Then,
the maximum code class allowable per coefficient will be represented by k . The
vector k obtained using the minimization technique replaces c when in the actual
implementation.
27
5.1.3 Circuit Activity Driven Architectural Transformations
The digital filters, a very important class of DSP circuits, can be represented by
state equations on which architectural transforms can be applied. It is considered
heuristic transforms based on properties of associativity and commutativity of
linear operations on linear time invariant digital circuits.
Figure 5.4 shows the data flow graph of an infinite impulse response (IIR) filter.
The computation tree for s1(t+1), for example, is described by Eq.𝑠1 𝑡 + 1 =
𝑐1 𝑠1 (𝑡) + 𝑐3 𝑠2 (𝑡) + 𝑘𝑢 𝑡 .
represents W bits of data comprising a data word. The W bits 𝑏𝑤, 𝑏𝑤−1, …… 𝑏1 ,
are fed in parallel to the respective adders and multipliers. The delays are
designed to hold W bits in parallel. At time t + 1, let z out of W bits have different
logic values than at time t . The signal activity is defined as the ratio of z over W
and is given by β(t)=z/W. The variable β(t) is a random variable for different
values of t and represents a stochastic process. The average activity θ(t,t+N) of a
signal over N consecutive time frames is defined as
------ (5.15)
28
In case of bit-serial arithmetic, the bit values of the data word are transmitted
serially over a single data line over consecutive time steps. Thus it is not inter-
word differences in bit values, but intra-word bit differences that cause node
activity. Experiments over large values of N show that the average activity θ(0, N )
remains constant, showing that the stochastic process is strict sense stationary.
Average power dissipation in this case is proportional to
------ (5.16)
where θi, is the average activity at node i (out of 𝛾 such nodes) and Ci is the
capacitive load on the i th node.
The architectural transforms on the DSP filters are based on the following
observations obtained through extensive simulation. Consider a word-parallel
𝑗=𝑙
computation tree with I inputs i1, i2 , . . . , il , and output 𝑦 = σ𝑗=1 𝑎𝑗 𝑏𝑗 , aj being
29
If input values of the tree are mutually independent, then:
The minimum average value of 0, over all nodes of a balanced adder tree with I
inputs is obtained when (a) a1 ≥ a2 ≥ a3 ••• ≥ aI or (b) a1 ≤ a2 ≤ a3 ••• ≤ aI.
For the case of a linear array of adders as shown in Figure 5.6:
The minimum θi over all the nodes of a computation tree is achieved when a1 ≤
a2 ≤ a3 ••• ≤ aI . The observations above are for mutually independent inputs.
Due to reconvergent fanout, signals can become correlated at the internal nodes.
However, the transformations seem to apply reasonably well even with such
correlated signals. The synthesis algorithm is based on the above observations
and on simulation. The given circuit is simulated at the functional level with
random, mutually independent circuit input values. The activities at the inputs to
all adders are noted. By applying the above two hypotheses, the adder trees are
restructured and the average activities recomputed. The above analysis is carried
out until there are no further improvements.
30
The procedure forces additions with high activity to be moved closer to the root of
a computation tree and vice versa. Note that no assumptions were made
regarding the implementation details of the adders or the multipliers. Assuming
that the capacitances at the internal nodes are all equal, improvement of up to
23% in power dissipation can be achieved.
31
Let us consider an example a simple data path with an adder and a
comparator is shown in Figure 5.7. Let the power dissipation for the data
path be given by P = CV2f, where C is the total effective capacitance being
switched per clock cycle. Due to voltage scaling by a factor of 40%, assume
that the data path unit works at half the speed. To maintain the throughput
if one uses the parallel configuration shown in Figure 5.8, the effective
capacitance is almost doubled. In fact, the effective capacitance is more
than doubled due to the extra routing required. Assuming that the
capacitance is increased by a factor of 2.15, the power dissipation is given
by Ppar = (2.15C)(0.58V)2(0.5f) = 0.36P. This method of using parallelism
to reduce power has the overhead of more than twice the area and is not
suitable for area constrained designs.
32
Figure 5.9 is another approach pipelining which has the advantage of smaller area
overhead. With the additional pipeline latch the critical path becomes
Max [Tadder , Tcomparator], allowing the adder and the comparator to operate at a
slower speed. If one assumes the two delays to be equal, the supply voltage can
again be reduced from 5 to 2.9 V, the voltage at which the delay doubles, with no
loss in throughput. Due to the addition of the extra latch, if the effective capacitance
bonus in pipelining, increasing the levels of pipelining has the effect of reducing
logic depth and hence power contributed due to hazards. An obvious extension is to
use a combination of pipelining and parallelism to obtain area and power constrained
design.
The easiest way to reduce the weighted capacitance being switched is to reduce the
number of operations performed in the data flow graph. Reducing the operations
count reduces the total capacitance associated with the system and, hence, can
reduce the power dissipation. However, reduction in the number of operations can
have adverse effect on critical paths. Let us consider the implementation of
function X2 + AX + B. A straightforward implementation is shown in Figure 5.10a.
However, the implementation of Figure 5.10b has one less multiplier.
33
The critical path lengths for both implementations are the same. Hence, the second
design is preferable from both power and area points of view. However, operations
reduction sometimes increases the critical path length, as shown in Figure 5.11. The
implementation with four multipliers and three adders while the implementation of
Figure 5.11b has two multipliers and three adders. The latter implementation is
suitable for area and power; however, the critical path is longer than the former one.
34
5.1.6 Power Optimization Using Operation Substitution
It is well known that certain operations require more computational energy than
others. In DSP circuits, multiplication and addition are the two most important
operations performed. Multiplication consumes more energy per computation than
addition. Hence, during high-level synthesis, if multiplication can be replaced by
addition, one cannot only save area but also achieve improvement in power
dissipation. However, such transformations are usually associated with an increase
in the critical path of the design.
Let us consider the example shown in Figure 5.12. Using the concept of
distributivity and common subexpression utilization, the circuit of Figure 5.12a can
be transformed into the circuit of Figure 5.12b. The critical path length of the
transformed circuit is longer than the critical path of circuit (a); however, it is
possible to substitute a multiplication operation by an addition operation, thereby
reducing the energy of computation. Other useful transformations can be noted.
The multiplication with constants is widely used in discrete cosine transforms
(DCT), filters, and so on. If the multiplication is replaced by shift-and-add
operations, then considerable savings in power can be achieved. Experiments on an
11-tap FIR filter shows that the power consumed by the execution unit is lowered
by one-eighth of the original power and the power consumed by the registers is
also reduced. A small penalty is paid in the controller power. An overall
improvement of 62% in power dissipation was seen for the FIR filter.
35
5.1.7 Precomputation-Based Optimization for Low Power:
36
However, depending on the complexity of the logic functions f1 and f2 ,
the switching activity at the internal nodes of f1 or f2 can be significant. Hence, for
the precomputation scheme to work effectively, the set of inputs fed to register R2
should be large, while the complexity of the logic blocks f1 and f2 should be small.
One would also like to have the signal probability of f1 + f2 that is, P(f1) + P(f2 ) -
P(f1 )P(f2 ) to be large for the scheme to work effectively on an average. It has
been noted that for some common functions considerable savings in power can be
achieved by properly selecting the set of inputs for register R1 and R2.
A simple statistical analysis may help to convince us that the precomputation logic
implementation of Figure 5.14 is very attractive. Assuming uncorrelated input bits
with uniform random probabilities where every bit has an equal probability of zero or
one. There is a 50% probability that An ⨁ Bn = 1 and the register R2 is disabled in
50% of the clock cycles. Therefore, with only one additional 2-input XOR gate, we
have reduced the signal switching activities of the 2n -2 least significant bits at R2 to
half of its original expected switching frequency. Also, when the load-disable signal
is asserted, the combinational logic of the comparator has fewer switching activities
because the out1 puts of R2 are not switched. The extra power required to compute
An ⨁ Bn is negligible compared to the power saving even for moderate size of n.
37
From the above discussion, it is obvious that a designer needs to have some
knowledge of the input signal statistics to apply the precomputation logic technique. In
Figure 5.14, if the probability of An ⨁ Bn is close to zero, the precomputation logic circuit
may be inferior, in power and area, compared to direct implementation. Experimental results
have shown up to 75% power reduction with an average of 3% area overhead and 1 to 5
additional gate-delay in the worst-case delay path.
Most efforts in controlling power dissipation of digital systems have been and
continue to be focused on hardware design. There is good reason for this since
hardware is the physical means by which power is converted into useful
computation. However, it would be unwise to ignore the influence of software on
power dissipation. In systems based on digital processors or controllers, it is
software that directs much of the activity of the hardware.
Consequently, the manner in which software uses the hardware can have a
substantial impact on the power dissipation of a system. An analogy drawn from
automobiles can help explain this further. The manner in which one drives can have
a significant effect on total fuel consumption. Until recently, there were no efficient
and accurate tools to estimate the overall effect of a software design on power
dissipation. Without a power estimator there was no way to reliably optimize
software to minimize power.
38
Some software design techniques are already known to reduce certain
components of power dissipation, but the global effect is more difficult to
quantify. For example, it is often advantageous to optimize software to minimize
memory accesses, but there may be energy-consuming side effects, such as an
increase in number of instructions. Not all low-power software design problems
have been solved, but progress has been made.
There are several contributors to CPU power dissipation that can be influenced
significantly by software. The memory system takes a substantial fraction of the
power budget (on the order of one-tenth to one-fourth) for portable computers
and it can be the dominant source of power dissipation in some memory-intensive
DSP applications such as video processing.
Memory accesses are expensive for several reasons. Reading or writing a memory
location involves switching on highly capacitive data and address lines going to
the memory, row and column decode logic, and word and data lines within the
memory that have a high fanout.
39
The mapping of data structures into multiple memory banks can influence the degree
to which parallel loading of multiple words are possible, Parallel loads not only
improve performance but are more energy efficient.
The memory access patterns of a program can greatly affect the cache performance
of a system. Unfavorable access patterns (or a cache that is too small) will lead to
cache misses and a lot of costly memory accesses. In multidimensional signal
processing algorithms, the order and nesting of loops can alter memory size and
bandwidth requirements by orders of magnitude. Compact machine code decreases
memory activity by reducing the number of instructions to be fetched and reducing
the probability of cache misses. Cache accesses are more energy efficient than main
memory accesses. The cache is closer to the CPU than is main memory, resulting in
shorter and less capacitive address and data lines. The cache is also much smaller
than main memory, leading to smaller internal capacitance on word and data lines.
Buses in an instruction processing system typically have high load capacitances due to
the number of modules connected to each bus and the length of the bus routes. The
switching activity on these buses is determined to a large degree by software [3].
Switching on an instruction bus is determined by the sequence of instruction op-codes
to be executed. Similarly, switching on an address bus is determined by the sequence
of data and instruction accesses. Both effects can often be accounted for at compile
time. Data related switching is much harder to predict at compile time since most
input data to a program are not provided until execution time.
Data paths for ALUs and FPUs make up a large portion of the logic power dissipation
in a CPU. Even if the exact data input sequences to the ALU and FPU are hard to
predict, the sequence of operations and data dependencies are determined during
software design and compilation. The energy to evaluate an arithmetic expression
might vary considerably with respect to the choice of instructions, A simple example is
the common compiler optimization technique of reduction in strength where an
integer multiplication by 2 could be replaced by a cheaper shift-left operation. Poor
scheduling of operations might result in unnecessary energy-wasting pipeline stalls.
40
Some sources of power dissipation such as clock distribution and control logic
overhead might not seem to have any bearing on software design decisions.
However, each execution cycle of a program incurs an energy cost from such
overhead. The longer a program requires to execute, the greater will be this
energy cost. In fact, it is found that the shortest code sequence was invariably the
lowest energy code sequence for a variety of microprocessor and DSP devices. In
no case was the lower average power dissipation of a slightly longer code
sequence enough to overcome the overhead energy costs associated with the
extra execution cycles. However, this situation may change as power management
techniques are more commonly used. In particular, power management commonly
takes the form of removing the clock from idle components. This reduces the
clock load as well as prevents unwanted computations.
The first step toward optimizing software for low power is to be able to estimate
the power dissipation of a piece of code. This has been accomplished at two basic
levels of abstraction. The lower level is to use existing gate level simulation and
power estimation tools on a gate level description of an instruction processing
system. A higher level approach is to estimate power based on the frequency of
execution of each type of instruction or instruction sequence (i.e., the execution
profile). The execution profile can be used a variety of ways. Architectural power
estimation determines which major components of a processor will be active
during each execution cycle of a program. Power estimates for each active
component are then taken from a look-up table and added into the power
estimate for the program. Another approach is based on the premise that the
switching activity on buses (address, instruction, and data) is representative of
switching activity (and power dissipation) in the entire processor. Bus switching
activity can be estimated based on the sequence of instruction op-codes and
memory addresses.
41
The final approach we will consider is referred to as instruction level power
analysis. This approach requires that power costs associated with individual
instructions and certain instruction sequences be characterized empirically for the
target processor. These costs can be applied to the instruction execution
sequence of a program to obtain an overall power estimate.
Gate level power estimation of a processor running a program is the most accurate
method available, assuming that a detailed gate level description including layout
parasitics is available for the target processor. Such a detailed processor description
is most likely not available to a software developer, especially if the processor is not
an in-house design. Even if the details are available, this approach will be too slow
for low-power optimization of a program. Gate level power estimates are important
in evaluating the power dissipation behavior of a processor design and in
characterizing the processor for the more efficient instruction level power estimation
approaches.
Architecture level power estimation is less precise but much faster than gate level
estimation. This approach requires a model of the processor at the major
component level (ALU, register file, etc.) along with power dissipation estimates for
each component. It also requires a model of the specific components which will be
active as a function of the instructions being executed. The architecture level
approach is implemented in a power estimation simulator called ESP (Early design
Stage Power and performance simulator). ESP simulates the execution of a program,
determining which system components are active in each execution cycle and
adding the power contribution of each component.
42
5.2.2.3 Bus Switching Activity
The first requirement for ILPA is to characterize the average power dissipation
associated with each instruction or instruction sequence of interest. Three
approaches have been documented for accomplishing this characterization. The
most straightforward method, if the hardware configuration permits, is to directly
measure the current drawn by the processor in the target system as it executes
various instruction sequences. If this is not practical, another method is to first use a
hardware description language model (such as Verilog or VHDL) of the processor to
simulate execution of the instruction sequences. An actual processor can then be
placed in an automated IC tester and exercised using test vectors obtained from the
simulation. An ammeter is used to measure the current draw of the processor in the
test system. A third approach is to use gate level power simulation of the processor
to obtain instruction level estimates.
43
The choice of instruction sequences for characterization is critical to the success of
this method. As a minimum, it is necessary to determine the base cost of individual
instructions. Base cost refers to the portion of the power dissipation of an instruction
that is independent of the prior state of the processor. Base cost excludes the effect
of such things as pipeline stalls, cache misses, and bus switching due to the
difference in op-codes for consecutive instructions. The base cost of an instruction
can be determined by putting several instances of that instruction into an infinite
loop. Average power supply current is measured while the loop executes. The loop
should be made as long as possible so as to minimize estimation error due to loop
overhead (the jump statement at the end of the loop). However, the loop must not
be made so large as to cause cache misses. Power and energy estimates for the
instruction are calculated from the average current draw, the supply voltage, and the
number of execution cycles per instruction.
Base costs for each instruction are not always adequate for a precise software power
estimate. Additional instruction sequences are needed in order to take into account
the effect of prior processor state on the power dissipation of each instruction.
Pipeline stalls, buffer stalls, and cache misses are obvious energy-consuming events
whose occurrence depends on the prior state of the processor. Instruction sequences
can be created that induce each of these events so that current measurements can
be made. However, stalls and cache misses are effects that require a large scale
analysis or simulation of program execution in order to appropriately factor them into
a program’s energy estimate. There are other energy costs that can be directly
attributed to localized processor state changes resulting from the execution of a pair
of instructions. These costs are referred to as circuit state effects.
The foremost circuit state effect is probably the energy cost associated with the
change in the state of the instruction bus as the op-code switches from that of an
addition operation to that of a multiplication operation.
44
Other circuit state effects for this example could include switching of control lines
to disable addition and enable multiplication, mode changes within the ALU, and
switching of data lines to reroute signals between the ALU and register file.
Although this example examines a pair of adjacent instructions, it is also possible
for circuit state effects to span an arbitrary number of execution cycles. This could
happen if consecutive activations of a processor component are triggered by
widely separated instructions. In such cases, the state of a component may have
been determined by the last instruction to use that component.
The circuit state cost associated with each possible pair of consecutive
instructions is characterized for instruction level power analysis by measuring
power supply current while executing alternating sequences of the two
instructions in an infinite loop. Unfortunately, it is not possible to separate the cost
of an A → B transition from a B → A transition, since the current measurement is
an average over many execution cycles.
where EP is the overall energy cost of a program, decomposed into base costs,
circuit state overhead, and stalls and cache misses. The first summation
represents base costs, where 𝐵𝑖 is the base cost of an instruction of type 𝑖 and 𝑁𝑖
is the number of type 𝑖 instructions in the execution profile of a program. The
second summation represents circuit state effects where 𝑂𝑖,𝑗 is the cost incurred
when an instruction of type 𝑖 is followed by an instruction of type 𝑗. Because of
the way 𝑂𝑖,𝑗 is measured, we have 𝑂𝑖,𝑗 = 𝑂𝑗,𝑖 . Here, 𝑁𝑖,𝑗 is the number of
occurrences where instruction type 𝑖 is immediately followed by instruction type 𝑗.
The last sum accounts for other effects, such as stalls and cache misses. Each Ek
represents the cost of one such effect found in the program execution profile.
45
5.2.3 Software Power Optimizations
Software optimizations for minimum power or energy tend to fall into one or more
of the following categories: selection of the least expensive instructions or
instruction sequences, minimizing the frequency or cost of memory accesses, and
exploiting power minimization features of hardware.
It should be clarified that the energy optimization objectives differ with respect to
the intended application. In battery-powered systems, the total energy dissipation
of a processing task is what determines how quickly the battery is spent. In
systems where the power constraint is determined by heat dissipation and
reliability considerations, instantaneous or average power dissipation will be an
important optimization objective or constraint. An instantaneous power dissipation
constraint could lead to situations where one would need to deliberately degrade
system performance through software or hardware techniques. A hardware
technique is preferable since hardware performance degradation can be achieved
through energy-saving techniques (lower voltage, lower clock rate). Software
performance degradation is likely to increase energy consumption by increasing
the number of execution cycles of a program.
46
5.2.3.1 Algorithm Transformations to Match Computational Resources
The general problem of efficient algorithm design is a large area of study that
goes well beyond the scope of this chapter. However, many algorithm design
approaches have significant implications for software-related power dissipation.
One impact of algorithm design is on memory requirements, but we will defer that
problem to the next section. Another impact is on the efficient use of
computational resources. This problem has been addressed extensively in parallel
processor applications and in lower-power DSP synthesis.
47
If only one adder is available, then Figure 5.15 is a sensible approach. Parallelizing
the summation would only force us to use additional registers to store
intermediate sums. If two adders are available, then the algorithm illustrated in
Figure 8.2 makes sense because it permits two additions to be performed
simultaneously. Similarly, Figure 8.3 fits a system with four adders by virtue of
making four different independent operations available at each time step.
In the general case, one cannot manipulate the parallelism of an algorithm quite
so conveniently. However, the principle is still applicable. The basic principle is to
try to match the degree of parallelism in an algorithm to the number of parallel
resources available.
48
Figure 5.17 Summation with four adders
Since memory often represents a large fraction of a system’s power budget, there
is a clear motivation to minimize the memory power cost. Happily, software design
techniques that have been used to improve memory system efficiency can also be
helpful in reducing the power or energy dissipation of a program. This is largely
due to memory being both a power and a performance bottleneck. If memory
accesses are both slow and power hungry, then software techniques that
minimize memory accesses tend to have both a performance and a power benefit.
Lower memory requirements also help by allowing total RAM to be smaller
(leading to lower capacitance) and improving the percentage of operations that
only need to access registers or cache.
49
➢ Put memory accesses as close as possible to the processor. Choose registers first,
cache next, and external RAM last.
➢ Make the most efficient use of the available memory bandwidth; for example, use
multiple-word parallel loads instead of single-word loads as much as possible.
➢ Recurrence detection and optimization, that is, use of registers for values that are
carried over from one level of recursion to the next.
The approach that evaluates the benefit of loop unrolling on overall performance
and on the degree to which memory reference compaction (also called memory
access coalescing) opportunities are uncovered. Loop unrolling transforms several
iterations of a loop into a block of straight-line code. The straight-line code is then
iterated fewer times to accomplish the same work as the original loop. The
purpose of the transformation is to make it easier to analyze data dependencies
when determining where to combine memory references. It is found that the
performance benefit of loop unrolling and memory access coalescing varied a
great deal depending on the processor.
The specific techniques that have evaluated for the energy savings are:
instruction packing, minimizing circuit state effects, and operand swapping.
51
Given the benefits of high-performance code with respect to energy savings,
optimizations that improve speed should be especially helpful. Code size
minimization may be necessary, especially in embedded systems, but it is not
quite as directly linked to energy savings. Regarding performance and energy
minimization, cache performance is a greater concern. Large code and a small
cache can lead to frequent cache misses with a high power penalty. Code size and
cache performance are concerns that motivate the effort to maximize code
density for embedded processors.
Instruction ordering for low power attempts to minimize the energy associated
with the circuit state effect. Circuit state effect is the energy dissipated as a result
of the processor switching from execution of one type of instruction to another.
On some types of processors, especially DSP, the magnitude of the circuit state
effect can vary substantially depending on the pair of instructions involved. The
techniques for measuring or estimating the circuit state effect for different pairs
of consecutive instructions are described. Given a table of circuit state costs, one
can tabulate the total circuit state cost for a sequence of instructions.
52
Instruction ordering then involves reordering instructions to minimize the total
circuit state cost without altering program behavior. Researchers have found that
in no case do circuit state effects outweight the benefit of minimizing program
execution cycles. Circuit state effects are found to be much more significant for
DSPs than for general-purpose architectures.
Accumulator spilling and mode switching are two DSP behaviors that are sensitive
to instruction ordering and are likely to have a power impact. In some DSP
architectures, the accumulator is the only general-purpose register. With a single
accumulator, any time an operation writes to a new program variable, the
previous accumulator value will need to be spilled to memory, incurring the
energy cost of a memory write. In some DSPs, operations are modal. Depending
on the mode setting (such as the sign extension mode), data path operations will
be handled differently.
53
5.2.3.4 Power Management
The degree to which software has control over power management varies from
one processor to another. It is possible for power management to be entirely
controlled by hardware, but typically software has the ability to at least initiate a
power-down mode. For event-driven applications such as user interfaces, system
activity typically comes in bursts. If the time a system is idle exceeds some
threshold, it is likely that the system will continue to be idle for an extended time.
Based on this phenomenon, a common approach is to initiate a shutdown when
the system is idle for a certain length of time. Unfortunately, the system continues
to draw power while waiting for the threshold to be exceeded. Predictive
techniques take into account the recent computation history to estimate the
expected idle time. This permits the power down to be initiated at the beginning
of the period of idle time.
54
The Intel486SL provides a System Management Mode (SMM), a distinct operating
mode that is transparent to the operating system and user applications. The SMM
is entered by means of an asynchronous interrupt. Software for the SMM is able
to enable, disable, and switch between fast and slow clocks for the CPU and ISA
bus. The PowerPC 603 and 604 offer both dynamic and static power management
modes. Dynamic power management removes clocking from execution units if it
is not needed to support currently executing instructions. This allows for
significant savings even while the processor is fully up and running. Savings on
the order of 8-16% have been observed for some common benchmarks. No
software intervention is needed, except to put the processor into the power
management mode. The PowerPC has three static power management modes:
“doze,” “nap,” and “sleep”. They are listed here in order of increasing power
savings and increasing delay to restart the processor. The doze mode shuts off
most functional units but keeps bus snooping enabled in order to maintain data
cache coherency. The nap mode shuts off bus snooping and can set a timer to
automatically wake up the processor. It keeps a phase-locked loop running to
permit quick restart of clocking. The sleep mode shuts off the phase-locked loop.
Nap and sleep modes can each be initiated by software, but a hardware
handshake is required to protect data cache coherency.
55
5.2.4 Automated Low-power Code Generation
A variety of optimizations and software design approaches that can minimize energy
consumption are described. To be used extensively, those techniques need to be
incorporated into automated software development tools. There are two levels of
tools: tools that generate code from an algorithm specification and compiler level
optimizations. We have not located any comprehensive high-level tools intended for
low-power code development, but several technologies appear to be available to
build such tools, especially for DSP applications. Graphical and textual languages
have been available for some time that enable one to specify a DSP algorithm in a
way that does not obscure the natural parallelism and data flow. HYPER-LP is a DSP
data path synthesis system that incorporates several algorithm level transformations
to uncover parallelism and minimize critical paths so that the data path supply
voltage can be minimized. Even though HYPER-LP is targeted for data path circuit
synthesis, the same types of transformations should be useful in adapting an
algorithm to exploit parallel resources in a DSP processor or multiprocessor system
(in fact HYPER has been used to evaluate algorithm choices prior to choice of
implementation platform). MASAI is a tool that reorganizes loops in order to
minimize memory transfers and size.
Compiler technology for low power appears to be further along than the high-level
tools, in part because well-understood performance optimizations are adaptable to
energy minimization. This is true more for general-purpose processors than for DSP
processors. Digital signal processing compiler technology faces difficulties in dealing
with a small register set (possibly just an accumulator), irregular data paths, and
making full use of parallel resources. The rest of this section describes two examples
of power reduction techniques that have been incorporated into compilers.
56
A cold scheduling algorithm which is an instruction scheduling algorithm that
reduces bus switching activity related to the change in state when execution
moves from one instruction type to another. The algorithm is a list scheduler that
prioritizes the selection of each instruction based on the power cost (bus
switching activity) of placing that instruction next into the schedule. Following is
the ordering of compilation phases that is proposed in order to incorporate cold
scheduling:
1. Allocate registers.
The assembly process was split into two phases to accommodate the cold
scheduler. Cold scheduling could not be performed prior to preassembly because
the binary codes for instructions are needed to determine the effect of an
instruction on switching activity. Scheduling could not be the last phase since
completion of the assembly process requires an ordering of instructions.
57
A code generation and optimization methodology is proposed that encompasses
several of the techniques discussed earlier. Following is the sequence of phases
that they proposed:
The emphasis of this unit has been on software design techniques for low power.
However, it is probably obvious by now that many of these techniques are
dependent in some way upon the hardware in order for a power savings to be
realized. Hardware/software codesign for low power is a more formal term for this
problem of crafting a mixture of hardware and software that provides required
functionality, minimizes power consumption, and satisfies objectives that could
include latency, throughput, and area. Instruction set design and implementation
seems to be one of the most well defined of the codesign problems. In this section,
we will first look at instruction set design techniques and their relationship to low
power design. Another variation of the codesign problem is to use a processor for
which some portion of the interconnect and logic is reconfigurable. Reconfigurable
processors allow much of the codesign problem to be delayed until the software
design and compilation phase.
58
For a nonreconfigurable processor, there is much less opportunity to optimize the
hardware once the processor design is fixed. Finally, we will survey the
hardware/software trade-offs that have come up in our consideration of power
optimization of software.
59
Huang and Despain’s approach similar to PEAS-I, is that it optimizes the
instruction set for sample applications. The most significant difference is that this
approach groups micro-operations (MOPS) together to form higher level
instructions. Each benchmark is expressed as a set of MOPS with dependencies to
be satisfied. MOPS are merged together as a byproduct of the scheduling process.
MOPS that are scheduled to the same clock cycle are combined. For any candidate
schedule and instruction set, instruction width (bits), instruction set size, and
hardware resource requirements are constrained. The scheduling problem is
solved by simulated annealing with an objective of minimizing execution cycles
and instruction set size. Either approach relies on instruction level application
profiles to prioritize decisions regarding the instruction set. The challenging task
would be determining appropriate instruction level power estimates and circuit
state effects for an unsynthesized CPU. Op-code selection is another design
decision that could be optimized for low power, considering the impact of the op-
codes on circuit state effects.
60
The reconfigurable portion can be limited to a coprocessor or it can encompass a
large portion of the processor architecture.
61
9. ASSIGNMENTS
CT BT
Q. No.
Questions Level Level
The Below link describes the Software power estimation
techniques.
1. Listen this video Lecture and submit what you understand from CO6 K3
this about 1000 words.
https://www.youtube.com/watch?v=dqcfYTePRxQ
The Below link discusses the FPGA power Optimization. Write
2. 1000 words from this link. CO6 K3
https://www.youtube.com/watch?v=FGfjIvbyU40
62
10. Part A Q & A (with K level and CO)
CT BT
Q. No. Questions
Level Level
63
10. Part A Q & A (with K level and CO)
CT BT
Q. No. Questions
Level Level
64
10. Part A Q & A (with K level and CO)
CT BT
Q. No. Questions
Level Level
73
10. Part A Q & A (with K level and CO)
CT BT
Q. No. Questions
Level Level
74
11. PART B Questions
CT BT
Q. No. Questions
Level Level
2. Elaborate on the use of pipelining and parallelism for low power. CO5 K2
75
12. Supportive online Certification courses (NPTEL,
Swayam, Coursera, Udemy, etc.,)
68
13. Real time Applications in day to day life and to
Industry
https://www.cadence.com/en_US/home/solutions/low-
power-solution.html
https://www.synopsys.com/implementation-and-
signoff/signal-and-power-integrity.html
69
14. Assessment Schedule
SIAT
MODEL
70
15.Prescribed Text Books
TEXT BOOKS:
REFERENCES:
NPTEL LINK:
https://archive.nptel.ac.in/courses/106/105/106105034/#
71
16. Mini Project Suggestions
72
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group
of Educational Institutions. If you have received this document through email in error, please
notify the system manager. This document contains proprietary information and is intended
only to the respective group / learning community as intended. If you are not the addressee
you should not disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete this
document from your system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
73