A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning

Uploaded by

Terves Alexander

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views4 pages

A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning

Uploaded by

Terves Alexander

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

2018 IEEE Asia Pacific Conference on Circuits and Systems

A High-Speed and Low-Complexity Architecture

for Softmax Function in Deep Learning
Meiqi Wang, Siyuan Lu, Danyang Zhu,
Jun Lin, Member, IEEE, and Zhongfeng Wang, Fellow, IEEE
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
Email: mqwang99@gmail.com, sylu@smail.nju.edu.cn, zhudanyang10@foxmail.com, {jlin, zfwang}@nju.edu.cn

Abstract—Recently, significant improvement has been achieved the exp-log function within a specific range. After the
for hardware architecture design of deep neural networks optimization, the main calculations in the exponential
(DNNs). However, the hardware implementation of one widely unit require only a shift unit, two adders and a constant
used softmax function in DNNs has not been much investigated,
which involves expensive division and exponentiation units. This multiplier. Calculations in the logarithmic unit require
paper performs an efficient hardware implementation of softmax only a shift unit, a leading one detector (LOD) unit and
function. Mathematical transformations and linear fitting are three adders.
used to simplify this function. Multiple algorithmic strength • The constant multiplier is optimized using the algorithmic
reduction strategies and fast addition methods are employed to strength reduction strategy and it is replaced by multiple
optimize the architecture. By using these techniques, complicated
logic units like multipliers are eliminated and the memory adders. Adders are further optimized using carry-save
consumption is largely reduced while the accuracy loss is negli- method [11]- [12].
gible. The proposed design is coded using hardware description The rest of this paper is organized as follows. Section II
language (HDL) and synthesized under the TSMC 28-nm CMOS
gives a brief review of softmax and the domain transformation
technology. Synthesis results show that the architecture achieves a
throughput of 6.976 G/s for 8-bit input data. The power efficiency of this function. Section III and Section IV propose designs
of 463.04 Gb/(mm2 · mW) is achieved and it costs only 0.015mm2 for the exponential unit and the logarithmic unit, respectively.
area resources. To the best of our knowledge, this is the first The proposed architecture of the softmax function is evaluated
work on efficient hardware implementation for softmax in open in Section V. Section VI concludes this paper.
literature.
Index Terms—softmax, hardware implementation, efficient
hardware architecture, deep neural network, VLSI
II. BACKGROUND
The softmax function can be expressed as follows:
I. I NTRODUCTION
exi
Deep neural network (DNN) has achieved great success f (xi ) = PN (i = 1, 2, ..., N ), (1)
in big data area since its resurgence in 2006 [1]. Efficient j=1 exj
hardware architectures for DNNs is one of the hot topics in where x1 , x2 , ..., xN are the input values of the softmax layer
both academic and industrial societies [2]- [6]. The softmax and the output value f (xi ) represents the probability that the
layer is widely used in different DNNs [7]- [9]. However, sample belongs to the i-th category.
most previous works focus on matrix multiplications opti- Mathematical transformation of softmax involving expo-
mization in DNNs and efficient hardware implementation for nential and logarithmic operations is adopted to perform the
softmax function has not been much investigated. Efficient hardware implementation, which is shown in (2).
hardware architectures of softmax layer are desired for high
speed deep learning systems. The exponentiation and division exi
f (xi ) = exp(ln( PN ))
calculations in softmax function are quite expensive, especially xj
j=1 e
in embedded systems. A hardware architecture for softmax XN (2)
has been proposed in [10]. However, no detailed discussion = exp(xi − ln( xj
e ))(i = 1, 2, ..., N ).
about implementation of exponential and the logarithmic units, j=1
which have very high complexity, was presented in [10]. In
this paper, we perform an efficient hardware implementation III. T HE A RCHITECTURE OF THE E XPONENTIAL U NIT
of softmax function. Our main contributions are as follows: The overall architecture for the softmax function are shown
• Mathematical transformations are used to transform the in Fig .1. It consists of multiple exponential units(EXPU),
exponential and logarithmic (exp-log) functions into exp- one adder tree, one natural logarithmic unit(LNU) and some
log operations with limited input range, as well as some other simple units. The most expensive calculations are in the
other simple operations including constant multiplication, EXPU and the LNU. In Section III and Section IV, efficient
shift and addition. Linear fitting is used to calculate architectures for the two units are introduced.

978-1-5386-8240-1/18/$31.00 ©2018 IEEE 223

slightly different. In the second exponential calculation, lnF
(the output of the LNU in Fig. 1 ) needs to be subtracted from
the input value. Moreover, the results of the first exponential
calculation are accumulated and stored, and those of the
second one are taken as the final outputs.

Fig. 2. The proposed design for the EXP Unit.

The range of vi in f (vi ) = 2vi is limited to (0, 1], so the

function f (vi ) = 2vi can be approximated as functions f (x) =
x + d1 or f (x) = x + d2 within this specific range. The
two bias values d1 and d2 correspond to the first and second
Fig. 1. The overall architecture for the softmax function. exponential operations, respectively. In this way, no multiplier
or lookup table is needed in these approximated functions and
the calculation is simplified with negligible accuracy loss.
A. Mathematical Transformation of Exponential Calculation
Exponential operations are composed of a number of mul- C. Improved Architecture for the Constant multiplier
tiplications. Multiple multiplications will consume plenty of The constant multiplication related to log2 e appears many
resources and cause long datapath delays in hardware imple- times in the proposed architecture, and it is the critical
mentation. Also, if the range of input cannot be limited to a path of the system before optimization. In this subsection,
specific range, the value of the exponent will be difficult to optimizations for this calculation at both algorithmic level and
calculate. Considering these difficulties, mathematical trans- architectural level are developed.
formation is applied to the exponential function. First, we use algorithm strength reduction strategy to reduce
First, we use the base number 2 to take place of the base the total number of operations in the constant multiplication.
number e and the exponential function will be changed to: The approximation value of log2 e is 1.0111 (binary), and it
eyi = 2yi ·log2 e (3) can be expressed as: 1.0111 = 1+0.1−0.0001, called ”ADD-
ADD-SUB” operation. In this way, we can simplify the three
Then the value yi · log2 e is split into an integer number ui additions and three shifts in the constant multiplier into one
and a decimal number vi . Note that ui + vi = yi · log2 e, and addition, one subtraction and two shifts.
ui = byi · log2 ec. In the hardware implementation, the value In hardware implementation, we first transform the ”ADD-
yi · log2 e is a fixed point number, which makes it easy to split ADD-SUB” operation into ”ALL-ADD” operation, then we
the integer and the decimal. The calculation can be simplified adopt the carry-save technique to simplify these additions. The
as follows: overall architecture of the constant multiplier is shown in Fig.
2vi << ui 3.

yi ui +vi ui vi ui > 0
e =2 =2 ·2 = vi Supposing there is E=A*1.0111, in our implementation, A is
2 >> (−ui ) ui ≤ 0
(4) an unsigned number, so we have: B = A >> 1, D = A >> 4
Through this transformation, we replace the original oper- and E = A + B − D. The subtraction of D is equivalent to
ation ”f (yi ) = eyi , −∞ < yi < +∞” with the operation adding its two’s complement. Defining that F =∼ D, we have:
”f (vi ) = 2vi , 0 < vi ≤ 1” and a simple shift operation. The E = A + B + F + 1.
operation ”f (vi ) = 2vi , 0 < vi ≤ 1” can be implemented with In our architecture, both A and E have Q bits. The critical
a lookup table or a linear fitting function, which can greatly path of the traditional method is the operation time of 3Q FAs
reduce the hardware complexity. (full adders). If we use carry-save method, the critical path can
be reduced to the operation time of Q + 1 FAs. Also, the total
B. Proposed Design for the EXPU number of FAs can be reduced from 3Q to 2Q.
The hardware architecture of the EXPU is shown in Fig. Compared with the results without optimization, we find
2. It consists of two adders, a constant multiplier, and a shift that not only the critical path is reduced, but also the area
operation. This hardware architecture is designed to support and power consumption are significantly reduced due to the
both exponential operations in (2), and the two operations are simplification. Carry-select method [13]- [14] can be used to

224
log2 k ≈ k − 1, k ∈ [1, 2). (7)
At last, the calculation of lnF can be expressed as:
lnF = ln2(k − 1 + w). (8)
B. Proposed Design for the LNU
According to (8), if we figure out k and w, the value of
lnF will be easy to get. Notice that the value of k is limited
within [1,2), so the calculation of (k − 1 + w) dose not require
any adder. The fractional part of the value of (k − 1 + w) is
equal to that of k, and w is the integer part of the value of
(k − 1 + w).
The hardware architecture for the LNU is described in Fig.
4. Note that the output of the LOD is w, and k = F >> w.
Fig. 3. The architecture for the constant multiplier.
The binary number 0.1011 is adopted as the approximation of
number ln2. The constant multiplier related to 0.1011 involves
replace the Q-bit carry-ripple adder in our design and reduce ”ADD-ADD-ADD” operations. It is implemented with the
the critical path further. Considering the balance of area and same carry-save method as in the EXPU.
speed, it is not adopted.

IV. T HE A RCHITECTURE OF THE NATURAL L OGARITHMIC

U NIT
PN xj
The LNU has a positive input F = j=1 e , and the
output is lnF . By limiting the input range of the logarithmic
function, the calculation in LNU can be simplified.

A. Mathematical Transformation of Logarithmic Calculation Fig. 4. The proposed design for the LNU.

We define k ∈ [1, 2) and w as an integer. The real number

pair (k, w) satisfies the following conditions: V. E XPERIMENTAL R ESULTS

∀F > 0, ∃!(k, w) : F = 2w · k. (5) A. Simulation Results and Discussion on Accuracy

We choose four sets of random values with different ranges,
A leading one detector (LOD) is required to compute k and represented as rand0.1, rand1, rand5 and rand10 as the input
w in the proposed architecture. In the LNU, the input of the data for the experiment. Each set has 4096 numbers within
LOD is F , and the output is the highest bit “1” position of F . the corresponding range. For instance, numbers in set rand10
It will be easy to calculate the value of k and w, if the decimal are random values lie in the range [-10, +10]. In order to
point and the highest one’s position is given. An example of meet the accuracy needs of different application scenarios, our
the relationships between F and LOD’s output (k and w) is architecture is designed to support input and output of both
shown in TABLE I, assuming that the width of F is 8, and 8bit and 16bit. We take 128bit of input data at a time, and use
the decimal point is between the fifth bit and the sixth bit. 16 EXPUs to process the data concurrently. These units can
be reconfigured to support sixteen 8bit inputs or eight 16bit
TABLE I inputs.
A N E XAMPLE OF T HE R ELATIONSHIP B ETWEEN F A ND (k,w)
xi − xmax , (i = 1, 2, ..., N ) are used as the input of the
F (Binary) Highest 1 in F k(Binary) w hardware architecture, which brings two advantages. On the
11011.111 0 1.1011111 4 one hand, all input data are non-positive, ensuring that the
01010.101 1 1.0101010 3 results of the exponential operation are not greater than 1
00001.101 4 1.1010000 0 and thus avoiding numerical overflow. On the other hand, the
00000.011 6 1.1000000 -2 calculations related to input value don’t involve any sign bit,
which saves the storage space and simplifies the calculation.
In the experiment, we preprocess the input data with
lnF = ln2 · log2 F = ln2(w + log2 k)). (6) software to transform the floating point numbers into fixed
point numbers. Also, the software gives the value Nin which
According to (5) and (6), the range of the input of loga- indicates where the decimal point of these fixed point numbers
rithmic operation is limited within [1, 2). To compute log2 k, is. The hardware architecture uses the input data together with
a linear fitting is used: this value Nin to get the output of softmax function. Similarly,

225
the hardware part gives the number Nout to help the software TABLE II
calculate the final results. With this method, the calculation KEY FEATURES OF THE PROPOSED HARDWARE
ARCHITECTURE FOR SOFTMAX
accuracy can be greatly improved with a small increase in
computational complexity and without affecting the speed. Characteristic FPGA TSMC 28nm
The calculation accuracy under a certain error threshold Clock Frequency 436MHz1 2.8GHz
8.00e − 5 are shown in Fig .5. The accuracy is defined as Area 2564LUTs, 0.015mm2
follows: 2794Registers
ncorrect results Power 0.751W 51.60mW
accuracy = , (9) Throughput 6.976 G/s(8bit) 44.8G/s(8bit)
ninput numbers
3.488 G/s(16bit) 22.4G/s(16bit)
where ncorrect results is the number of correct results and Efficiency 57.88G/(mm2 · mW)(8bit)
ninput numbers is the total number of input data. Here cor- - 28.94G/(mm2 · mW)(16bit)
rectness is determined by comparing the results calculated 1 This frequency is a representation of how fast the architecure can
by the architecture proposed using fixed point numbers with run. The frequency in practical applications will be limited by the
speed of data transmission and the clock speed of the development
those calculated by C code using floating point values. If the board.
difference between the corresponding two results are less than
the error threshold, the result is defined as a correct result. As
we can see, our architecture achieves very high accuracy at a datapath delay. In addition, regular multipliers and lookup ta-
low error threshold. bles are eliminated in this design. With all these optimizations,
the architecture proposed achieves a very high efficiency of
463.04Gb/(mm2 · mW). So it demonstrates great potential for
many real-time applications on embedded systems.
R EFERENCES
[1] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006.
[2] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware
hardware architecture of convolutional neural network,” in Design,
Automation & Test in Europe Conference & Exhibition, 2017, pp.
14661471.
[3] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally,
“Deep compression and eie: Efficient inference engine on compressed
deep neural network,” in Hot Chips 28 Symposium, 2017, pp. 16.
[4] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
and Y. Wang, “Ese: Efficient speech recognition engine with sparse lstm
on fpga,” 2017.
Fig. 5. The calculation accuracy under the error threshold 8.00e − 5 for four [5] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks:
sets of data. A memory-efficient approach,” IEEE Transactions on Very Large Scale
Integration Systems, vol. PP, no. 99, pp. 113, 2017.
[6] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang,
B. Synthesis Results “Clstm: Enabling efficient lstm using structured compression techniques
on fpgas,” in Proceedings of the 2018 ACM/SIGDA International
The architecture is evaluated on two platforms, Xilinx Zynq- Symposium on Field-Programmable Gate Arrays. ACM, 2018, pp. 1120.
7000 ZC706 development board and the TSMC 28-nm CMOS [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition, Proceedings of the IEEE, Nov. 1998.
technology. The synthesis results are presented in Table II. The [8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
throughput (T ) in the table is defined as the number of outputs recurrent neural networks,” in International Conference on Machine
per second: Learning, 2014, pp. 17641772.
[9] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-
128bit · fclk to-end attention-based large vocabulary speech recognition,” in Acous-
T = . (9)
data width tics, Speech and Signal Processing (ICASSP), 2016 IEEE International
Conference on. IEEE, 2016, pp. 49454949
The efficiency(E) is defined as: [10] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
T network,” in System-On-Chip Conference, 2017, pp. 323326
E= . (9) [11] T. G. Noll, “Carry-save architectures for high-speed digital signal
area · power processing,” Journal of Vlsi Signal Processing Systems for Signal Image
& Video Technology, vol. 3, no. 1-2, pp. 121140, 1991.
As is shown in the table, the data processing speed is [12] C. Nagendra, M. J. Irwin, and R. M. Owens, “Area-time-power tradeoffs
sufficient for real-time applications. in parallel adders,” IEEE Transactions on Circuits & Systems II Analog
& Digital Signal Processing, vol. 43, no. 10, pp. 689702, 2002.
VI. C ONCLUSION [13] O. J. Bedrij, “Carry-select adder,” Electronic Computers Ire Transactions
on, vol. EC-11, no. 3, pp. 340346, 1962.
In this paper, an efficient hardware implementation of [14] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry se-
softmax function is presented for deep neural networks. lect adder,” IEEE Transactions on Very Large Scale Integration Systems,
Algorithm-hardware co-optimizations, including multiple al- vol. 20, no. 2, pp. 371375, 2012.
gorithm strength reduction methods, linear fitting and compo-
nent reuse are performed to reduce hardware complexity and

226

TEA-S A Tiny and Efficient Architecture For PLAC-Based Softmax in Transformers
No ratings yet
TEA-S A Tiny and Efficient Architecture For PLAC-Based Softmax in Transformers
5 pages
Hardware Implementation of Softmax Function Based On Piecewise LUT
No ratings yet
Hardware Implementation of Softmax Function Based On Piecewise LUT
3 pages
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
No ratings yet
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
4 pages
A Configurable Floating-Point Fused Multiply-Add Design With Mixed Precision For AI Accelerators
No ratings yet
A Configurable Floating-Point Fused Multiply-Add Design With Mixed Precision For AI Accelerators
15 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
1 s2.0 S0045790624001459 Main
No ratings yet
1 s2.0 S0045790624001459 Main
11 pages
LATS 2022 v7
No ratings yet
LATS 2022 v7
4 pages
Analog Implementation of The Soft-Max Function
No ratings yet
Analog Implementation of The Soft-Max Function
4 pages
Electronics 12 00446 v2
No ratings yet
Electronics 12 00446 v2
21 pages
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
No ratings yet
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
9 pages
Electronics 12 00605 v2
No ratings yet
Electronics 12 00605 v2
19 pages
Ijspr 5901 30318
No ratings yet
Ijspr 5901 30318
5 pages
Article 87
No ratings yet
Article 87
4 pages
SoftMax ASAP2020 June14
No ratings yet
SoftMax ASAP2020 June14
9 pages
Floating-Point Hardware Design A Test Perspective
No ratings yet
Floating-Point Hardware Design A Test Perspective
5 pages
Datapath Subsystems
No ratings yet
Datapath Subsystems
29 pages
M.Tech 2015-2016: Low Delay Single Symbol Error Correction Codes Based On Reed Solomon Codes
No ratings yet
M.Tech 2015-2016: Low Delay Single Symbol Error Correction Codes Based On Reed Solomon Codes
4 pages
Algorithm and Design
No ratings yet
Algorithm and Design
6 pages
Approximation of Hardware Accelerators Driven by Machine-Learning Models Embedded Tutorial
No ratings yet
Approximation of Hardware Accelerators Driven by Machine-Learning Models Embedded Tutorial
2 pages
Copma Paper Tamil
No ratings yet
Copma Paper Tamil
5 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
Revised Report Final-1
No ratings yet
Revised Report Final-1
28 pages
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
No ratings yet
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
14 pages
Electronics 14 02337
No ratings yet
Electronics 14 02337
18 pages
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
No ratings yet
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
10 pages
Booth Encoder
No ratings yet
Booth Encoder
8 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
Floating Point Multipliers: Simulation & Synthesis Using VHDL
No ratings yet
Floating Point Multipliers: Simulation & Synthesis Using VHDL
40 pages
24uec229 Exp - 2
No ratings yet
24uec229 Exp - 2
10 pages
Logarithmic Multiplier in Hardware Implementation of Neural Networks
No ratings yet
Logarithmic Multiplier in Hardware Implementation of Neural Networks
12 pages
Paper 6
No ratings yet
Paper 6
5 pages
VHDL FPU Design and Synthesis
No ratings yet
VHDL FPU Design and Synthesis
45 pages
Fix Point Implementation of Elementry Functions
No ratings yet
Fix Point Implementation of Elementry Functions
134 pages
Existing Methodology: I I I-1 I I-1 I I
No ratings yet
Existing Methodology: I I I-1 I I-1 I I
9 pages
ADeltaSigmaModulator BasedStochasticDivider
No ratings yet
ADeltaSigmaModulator BasedStochasticDivider
12 pages
HSECPresentation Lec3 22
No ratings yet
HSECPresentation Lec3 22
28 pages
Floating 2
No ratings yet
Floating 2
5 pages
Symmetry 16 01138
No ratings yet
Symmetry 16 01138
15 pages
Improved Low-Power Cost-Effective DCT Implementation Based On Markov Random Field and Stochastic Logic
No ratings yet
Improved Low-Power Cost-Effective DCT Implementation Based On Markov Random Field and Stochastic Logic
11 pages
Design of Roba Multiplier Using Mac Unit
No ratings yet
Design of Roba Multiplier Using Mac Unit
15 pages
Applsci 14 03323 v2
No ratings yet
Applsci 14 03323 v2
15 pages
IEEE Floating Point Architectures
No ratings yet
IEEE Floating Point Architectures
44 pages
2021 A Hybrid Radix-4 and Approximate Logarithmic Multiplier - Lotric
No ratings yet
2021 A Hybrid Radix-4 and Approximate Logarithmic Multiplier - Lotric
20 pages
Multiplier-less CNN Accelerator Design
No ratings yet
Multiplier-less CNN Accelerator Design
4 pages
VLSI Implementation of Transcendental Function Hyperbolic Tangent For Deep NN
No ratings yet
VLSI Implementation of Transcendental Function Hyperbolic Tangent For Deep NN
10 pages
Efficient Cryptography Hardware
No ratings yet
Efficient Cryptography Hardware
189 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
155 pages
Approximate Arithmetic Circuits Survey
No ratings yet
Approximate Arithmetic Circuits Survey
28 pages
Ullah 2021
No ratings yet
Ullah 2021
14 pages
Area-Efficient Iterative Logarithmic Approximate Multipliers For IEEE 754 and Posit Numbers
No ratings yet
Area-Efficient Iterative Logarithmic Approximate Multipliers For IEEE 754 and Posit Numbers
13 pages
Energy-Efficient Approximate Multiplication For Digital Signal Processing and Classification Applications
No ratings yet
Energy-Efficient Approximate Multiplication For Digital Signal Processing and Classification Applications
5 pages
Decimal Fused Multiply-Add Optimization
No ratings yet
Decimal Fused Multiply-Add Optimization
10 pages
Laboratorio Digitales Ejercicio Alg Boole
No ratings yet
Laboratorio Digitales Ejercicio Alg Boole
20 pages
Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
No ratings yet
Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
12 pages
Existing Methodology
No ratings yet
Existing Methodology
7 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
(IJCST-V9I2P10) :DR - Shine N Das
No ratings yet
(IJCST-V9I2P10) :DR - Shine N Das
6 pages
CompEng 361 - Homework 2 Solutions
No ratings yet
CompEng 361 - Homework 2 Solutions
4 pages
15 Tribon
No ratings yet
15 Tribon
10 pages
Algorithm Complexity Analysis
No ratings yet
Algorithm Complexity Analysis
6 pages
Tray Sdrywer
No ratings yet
Tray Sdrywer
10 pages
Preparation Problem For WRI Course: Hydropower and Pumping Station
100% (1)
Preparation Problem For WRI Course: Hydropower and Pumping Station
2 pages
Plessner's Transcendental Naturalism
No ratings yet
Plessner's Transcendental Naturalism
8 pages
Soil Nitrogen Dynamics Guide
No ratings yet
Soil Nitrogen Dynamics Guide
18 pages
CDB - Mathematics-AI SL 2023-25
No ratings yet
CDB - Mathematics-AI SL 2023-25
6 pages
Additional Problems Final Exam Part 2 Answers
No ratings yet
Additional Problems Final Exam Part 2 Answers
10 pages
TechTopics58 - E-Rated Vs R Rated Fuses
No ratings yet
TechTopics58 - E-Rated Vs R Rated Fuses
2 pages
Equilibrium - Mind Map - PDF
No ratings yet
Equilibrium - Mind Map - PDF
1 page
SAS B.Inggris
No ratings yet
SAS B.Inggris
4 pages
EEE F111 Handout II Sem 2023-2024
No ratings yet
EEE F111 Handout II Sem 2023-2024
3 pages
Project - Time Series Forecasting - Rajendra M Bhat
82% (11)
Project - Time Series Forecasting - Rajendra M Bhat
33 pages
Physics
No ratings yet
Physics
14 pages
Math8 Q4 Week3 Hybrid Version1 1
No ratings yet
Math8 Q4 Week3 Hybrid Version1 1
13 pages
Chemistry IGCSE O LEVELS Revision Notes
No ratings yet
Chemistry IGCSE O LEVELS Revision Notes
3 pages
Waves
No ratings yet
Waves
15 pages
Comp SC Enginnering P2
No ratings yet
Comp SC Enginnering P2
20 pages
Certified Reliability Engineer
No ratings yet
Certified Reliability Engineer
9 pages
Quiz
No ratings yet
Quiz
2 pages
Business Statistics With Solutions in R (Mustapha Abiodun Akinkunmi)
No ratings yet
Business Statistics With Solutions in R (Mustapha Abiodun Akinkunmi)
278 pages
Brachiopoda Morphological Terms
No ratings yet
Brachiopoda Morphological Terms
8 pages
Computerized Building Energy Simulation Handbook PDF
No ratings yet
Computerized Building Energy Simulation Handbook PDF
149 pages
For K 0,1, ..... ..,9
No ratings yet
For K 0,1, ..... ..,9
2 pages
Stress and Strain
No ratings yet
Stress and Strain
8 pages
Sabal - Statistics Activity 2
No ratings yet
Sabal - Statistics Activity 2
4 pages
14 Colpitts Oscillators
No ratings yet
14 Colpitts Oscillators
3 pages
Space and Science Riddles
No ratings yet
Space and Science Riddles
4 pages
XPMC Courses
No ratings yet
XPMC Courses
8 pages
STD IX - Mathematics
No ratings yet
STD IX - Mathematics
6 pages

A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning

Uploaded by

A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning

Uploaded by

2018 IEEE Asia Pacific Conference on Circuits and Systems

A High-Speed and Low-Complexity Architecture

978-1-5386-8240-1/18/$31.00 ©2018 IEEE 223

Fig. 2. The proposed design for the EXP Unit.

The range of vi in f (vi ) = 2vi is limited to (0, 1], so the

IV. T HE A RCHITECTURE OF THE NATURAL L OGARITHMIC

We define k ∈ [1, 2) and w as an integer. The real number

∀F > 0, ∃!(k, w) : F = 2w · k. (5) A. Simulation Results and Discussion on Accuracy

You might also like