2018 IEEE Asia Pacific Conference on Circuits and Systems
A High-Speed and Low-Complexity Architecture
       for Softmax Function in Deep Learning
                                         Meiqi Wang, Siyuan Lu, Danyang Zhu,
                               Jun Lin, Member, IEEE, and Zhongfeng Wang, Fellow, IEEE
                  School of Electronic Science and Engineering, Nanjing University, Nanjing, China
      Email: mqwang99@gmail.com, sylu@smail.nju.edu.cn, zhudanyang10@foxmail.com, {jlin, zfwang}@nju.edu.cn
    Abstract—Recently, significant improvement has been achieved           the exp-log function within a specific range. After the
for hardware architecture design of deep neural networks                   optimization, the main calculations in the exponential
(DNNs). However, the hardware implementation of one widely                 unit require only a shift unit, two adders and a constant
used softmax function in DNNs has not been much investigated,
which involves expensive division and exponentiation units. This           multiplier. Calculations in the logarithmic unit require
paper performs an efficient hardware implementation of softmax             only a shift unit, a leading one detector (LOD) unit and
function. Mathematical transformations and linear fitting are              three adders.
used to simplify this function. Multiple algorithmic strength            • The constant multiplier is optimized using the algorithmic
reduction strategies and fast addition methods are employed to             strength reduction strategy and it is replaced by multiple
optimize the architecture. By using these techniques, complicated
logic units like multipliers are eliminated and the memory                 adders. Adders are further optimized using carry-save
consumption is largely reduced while the accuracy loss is negli-           method [11]- [12].
gible. The proposed design is coded using hardware description           The rest of this paper is organized as follows. Section II
language (HDL) and synthesized under the TSMC 28-nm CMOS
                                                                      gives a brief review of softmax and the domain transformation
technology. Synthesis results show that the architecture achieves a
throughput of 6.976 G/s for 8-bit input data. The power efficiency    of this function. Section III and Section IV propose designs
of 463.04 Gb/(mm2 · mW) is achieved and it costs only 0.015mm2        for the exponential unit and the logarithmic unit, respectively.
area resources. To the best of our knowledge, this is the first       The proposed architecture of the softmax function is evaluated
work on efficient hardware implementation for softmax in open         in Section V. Section VI concludes this paper.
literature.
    Index Terms—softmax, hardware implementation, efficient
hardware architecture, deep neural network, VLSI
                                                                                            II. BACKGROUND
                                                                        The softmax function can be expressed as follows:
                       I. I NTRODUCTION
                                                                                              exi
   Deep neural network (DNN) has achieved great success                            f (xi ) = PN            (i = 1, 2, ..., N ),        (1)
in big data area since its resurgence in 2006 [1]. Efficient                                   j=1   exj
hardware architectures for DNNs is one of the hot topics in           where x1 , x2 , ..., xN are the input values of the softmax layer
both academic and industrial societies [2]- [6]. The softmax          and the output value f (xi ) represents the probability that the
layer is widely used in different DNNs [7]- [9]. However,             sample belongs to the i-th category.
most previous works focus on matrix multiplications opti-               Mathematical transformation of softmax involving expo-
mization in DNNs and efficient hardware implementation for            nential and logarithmic operations is adopted to perform the
softmax function has not been much investigated. Efficient            hardware implementation, which is shown in (2).
hardware architectures of softmax layer are desired for high
speed deep learning systems. The exponentiation and division                                    exi
                                                                             f (xi ) = exp(ln( PN            ))
calculations in softmax function are quite expensive, especially                                       xj
                                                                                                  j=1 e
in embedded systems. A hardware architecture for softmax                                            XN                                 (2)
has been proposed in [10]. However, no detailed discussion                          = exp(xi − ln(           xj
                                                                                                            e ))(i = 1, 2, ..., N ).
about implementation of exponential and the logarithmic units,                                       j=1
which have very high complexity, was presented in [10]. In
this paper, we perform an efficient hardware implementation             III. T HE A RCHITECTURE OF THE E XPONENTIAL U NIT
of softmax function. Our main contributions are as follows:              The overall architecture for the softmax function are shown
   • Mathematical transformations are used to transform the           in Fig .1. It consists of multiple exponential units(EXPU),
     exponential and logarithmic (exp-log) functions into exp-        one adder tree, one natural logarithmic unit(LNU) and some
     log operations with limited input range, as well as some         other simple units. The most expensive calculations are in the
     other simple operations including constant multiplication,       EXPU and the LNU. In Section III and Section IV, efficient
     shift and addition. Linear fitting is used to calculate          architectures for the two units are introduced.
 978-1-5386-8240-1/18/$31.00 ©2018 IEEE                           223
                                                                           slightly different. In the second exponential calculation, lnF
                                                                           (the output of the LNU in Fig. 1 ) needs to be subtracted from
                                                                           the input value. Moreover, the results of the first exponential
                                                                           calculation are accumulated and stored, and those of the
                                                                           second one are taken as the final outputs.
                                                                                       Fig. 2. The proposed design for the EXP Unit.
                                                                              The range of vi in f (vi ) = 2vi is limited to (0, 1], so the
                                                                           function f (vi ) = 2vi can be approximated as functions f (x) =
                                                                           x + d1 or f (x) = x + d2 within this specific range. The
                                                                           two bias values d1 and d2 correspond to the first and second
        Fig. 1. The overall architecture for the softmax function.         exponential operations, respectively. In this way, no multiplier
                                                                           or lookup table is needed in these approximated functions and
                                                                           the calculation is simplified with negligible accuracy loss.
A. Mathematical Transformation of Exponential Calculation
   Exponential operations are composed of a number of mul-                 C. Improved Architecture for the Constant multiplier
tiplications. Multiple multiplications will consume plenty of                 The constant multiplication related to log2 e appears many
resources and cause long datapath delays in hardware imple-                times in the proposed architecture, and it is the critical
mentation. Also, if the range of input cannot be limited to a              path of the system before optimization. In this subsection,
specific range, the value of the exponent will be difficult to             optimizations for this calculation at both algorithmic level and
calculate. Considering these difficulties, mathematical trans-             architectural level are developed.
formation is applied to the exponential function.                             First, we use algorithm strength reduction strategy to reduce
   First, we use the base number 2 to take place of the base               the total number of operations in the constant multiplication.
number e and the exponential function will be changed to:                  The approximation value of log2 e is 1.0111 (binary), and it
                           eyi = 2yi ·log2 e                         (3)   can be expressed as: 1.0111 = 1+0.1−0.0001, called ”ADD-
                                                                           ADD-SUB” operation. In this way, we can simplify the three
Then the value yi · log2 e is split into an integer number ui              additions and three shifts in the constant multiplier into one
and a decimal number vi . Note that ui + vi = yi · log2 e, and             addition, one subtraction and two shifts.
ui = byi · log2 ec. In the hardware implementation, the value                 In hardware implementation, we first transform the ”ADD-
yi · log2 e is a fixed point number, which makes it easy to split          ADD-SUB” operation into ”ALL-ADD” operation, then we
the integer and the decimal. The calculation can be simplified             adopt the carry-save technique to simplify these additions. The
as follows:                                                                overall architecture of the constant multiplier is shown in Fig.
                                        2vi << ui                          3.                                
    yi      ui +vi     ui   vi                        ui > 0
  e =2             =2 ·2 =           vi                                       Supposing there is E=A*1.0111, in our implementation, A is
                                   2 >> (−ui )        ui ≤ 0
                                                             (4)           an unsigned number, so we have: B = A >> 1, D = A >> 4
   Through this transformation, we replace the original oper-              and E = A + B − D. The subtraction of D is equivalent to
ation ”f (yi ) = eyi , −∞ < yi < +∞” with the operation                    adding its two’s complement. Defining that F =∼ D, we have:
”f (vi ) = 2vi , 0 < vi ≤ 1” and a simple shift operation. The             E = A + B + F + 1.
operation ”f (vi ) = 2vi , 0 < vi ≤ 1” can be implemented with                In our architecture, both A and E have Q bits. The critical
a lookup table or a linear fitting function, which can greatly             path of the traditional method is the operation time of 3Q FAs
reduce the hardware complexity.                                            (full adders). If we use carry-save method, the critical path can
                                                                           be reduced to the operation time of Q + 1 FAs. Also, the total
B. Proposed Design for the EXPU                                            number of FAs can be reduced from 3Q to 2Q.
   The hardware architecture of the EXPU is shown in Fig.                     Compared with the results without optimization, we find
2. It consists of two adders, a constant multiplier, and a shift           that not only the critical path is reduced, but also the area
operation. This hardware architecture is designed to support               and power consumption are significantly reduced due to the
both exponential operations in (2), and the two operations are             simplification. Carry-select method [13]- [14] can be used to
                                                                       224
                                                                                                log2 k ≈ k − 1, k ∈ [1, 2).               (7)
                                                                               At last, the calculation of lnF can be expressed as:
                                                                                                 lnF = ln2(k − 1 + w).                    (8)
                                                                             B. Proposed Design for the LNU
                                                                                According to (8), if we figure out k and w, the value of
                                                                             lnF will be easy to get. Notice that the value of k is limited
                                                                             within [1,2), so the calculation of (k − 1 + w) dose not require
                                                                             any adder. The fractional part of the value of (k − 1 + w) is
                                                                             equal to that of k, and w is the integer part of the value of
                                                                             (k − 1 + w).
                                                                                The hardware architecture for the LNU is described in Fig.
                                                                             4. Note that the output of the LOD is w, and k = F >> w.
          Fig. 3. The architecture for the constant multiplier.
                                                                             The binary number 0.1011 is adopted as the approximation of
                                                                             number ln2. The constant multiplier related to 0.1011 involves
replace the Q-bit carry-ripple adder in our design and reduce                ”ADD-ADD-ADD” operations. It is implemented with the
the critical path further. Considering the balance of area and               same carry-save method as in the EXPU.
speed, it is not adopted.
IV. T HE A RCHITECTURE OF THE NATURAL L OGARITHMIC
                             U NIT
                                           PN xj
  The LNU has a positive input F =           j=1 e , and the
output is lnF . By limiting the input range of the logarithmic
function, the calculation in LNU can be simplified.
A. Mathematical Transformation of Logarithmic Calculation                                  Fig. 4. The proposed design for the LNU.
  We define k ∈ [1, 2) and w as an integer. The real number
pair (k, w) satisfies the following conditions:                                              V. E XPERIMENTAL R ESULTS
                ∀F > 0, ∃!(k, w) : F = 2w · k.                         (5)   A. Simulation Results and Discussion on Accuracy
                                                                                We choose four sets of random values with different ranges,
   A leading one detector (LOD) is required to compute k and                 represented as rand0.1, rand1, rand5 and rand10 as the input
w in the proposed architecture. In the LNU, the input of the                 data for the experiment. Each set has 4096 numbers within
LOD is F , and the output is the highest bit “1” position of F .             the corresponding range. For instance, numbers in set rand10
It will be easy to calculate the value of k and w, if the decimal            are random values lie in the range [-10, +10]. In order to
point and the highest one’s position is given. An example of                 meet the accuracy needs of different application scenarios, our
the relationships between F and LOD’s output (k and w) is                    architecture is designed to support input and output of both
shown in TABLE I, assuming that the width of F is 8, and                     8bit and 16bit. We take 128bit of input data at a time, and use
the decimal point is between the fifth bit and the sixth bit.                16 EXPUs to process the data concurrently. These units can
                                                                             be reconfigured to support sixteen 8bit inputs or eight 16bit
                              TABLE I                                        inputs.
     A N E XAMPLE OF T HE R ELATIONSHIP B ETWEEN F A ND (k,w)
                                                                                xi − xmax , (i = 1, 2, ..., N ) are used as the input of the
       F (Binary)      Highest 1 in F         k(Binary)           w          hardware architecture, which brings two advantages. On the
       11011.111             0                1.1011111            4         one hand, all input data are non-positive, ensuring that the
       01010.101             1                1.0101010            3         results of the exponential operation are not greater than 1
       00001.101             4                1.1010000            0         and thus avoiding numerical overflow. On the other hand, the
       00000.011             6                1.1000000           -2         calculations related to input value don’t involve any sign bit,
                                                                             which saves the storage space and simplifies the calculation.
                                                                                In the experiment, we preprocess the input data with
           lnF = ln2 · log2 F = ln2(w + log2 k)).                      (6)   software to transform the floating point numbers into fixed
                                                                             point numbers. Also, the software gives the value Nin which
   According to (5) and (6), the range of the input of loga-                 indicates where the decimal point of these fixed point numbers
rithmic operation is limited within [1, 2). To compute log2 k,               is. The hardware architecture uses the input data together with
a linear fitting is used:                                                    this value Nin to get the output of softmax function. Similarly,
                                                                         225
the hardware part gives the number Nout to help the software                                               TABLE II
calculate the final results. With this method, the calculation                             KEY FEATURES OF THE PROPOSED HARDWARE
                                                                                                 ARCHITECTURE FOR SOFTMAX
accuracy can be greatly improved with a small increase in
computational complexity and without affecting the speed.                            Characteristic         FPGA               TSMC 28nm
   The calculation accuracy under a certain error threshold                          Clock Frequency        436MHz1            2.8GHz
8.00e − 5 are shown in Fig .5. The accuracy is defined as                            Area                   2564LUTs,          0.015mm2
follows:                                                                                                    2794Registers
                              ncorrect results                                       Power                  0.751W             51.60mW
                 accuracy =                    ,           (9)                       Throughput             6.976 G/s(8bit)    44.8G/s(8bit)
                              ninput numbers
                                                                                                            3.488 G/s(16bit)   22.4G/s(16bit)
where ncorrect results is the number of correct results and                          Efficiency                                57.88G/(mm2 · mW)(8bit)
ninput numbers is the total number of input data. Here cor-                                                        -           28.94G/(mm2 · mW)(16bit)
rectness is determined by comparing the results calculated                           1   This frequency is a representation of how fast the architecure can
by the architecture proposed using fixed point numbers with                              run. The frequency in practical applications will be limited by the
                                                                                         speed of data transmission and the clock speed of the development
those calculated by C code using floating point values. If the                           board.
difference between the corresponding two results are less than
the error threshold, the result is defined as a correct result. As
we can see, our architecture achieves very high accuracy at a                   datapath delay. In addition, regular multipliers and lookup ta-
low error threshold.                                                            bles are eliminated in this design. With all these optimizations,
                                                                                the architecture proposed achieves a very high efficiency of
                                                                                463.04Gb/(mm2 · mW). So it demonstrates great potential for
                                                                                many real-time applications on embedded systems.
                                                                                                                R EFERENCES
                                                                                 [1] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for
                                                                                     deep belief nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006.
                                                                                 [2] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware
                                                                                     hardware architecture of convolutional neural network,” in Design,
                                                                                     Automation & Test in Europe Conference & Exhibition, 2017, pp.
                                                                                     14661471.
                                                                                 [3] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally,
                                                                                     “Deep compression and eie: Efficient inference engine on compressed
                                                                                     deep neural network,” in Hot Chips 28 Symposium, 2017, pp. 16.
                                                                                 [4] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
                                                                                     and Y. Wang, “Ese: Efficient speech recognition engine with sparse lstm
                                                                                     on fpga,” 2017.
Fig. 5. The calculation accuracy under the error threshold 8.00e − 5 for four    [5] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks:
sets of data.                                                                        A memory-efficient approach,” IEEE Transactions on Very Large Scale
                                                                                     Integration Systems, vol. PP, no. 99, pp. 113, 2017.
                                                                                 [6] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang,
B. Synthesis Results                                                                 “Clstm: Enabling efficient lstm using structured compression techniques
                                                                                     on fpgas,” in Proceedings of the 2018 ACM/SIGDA International
   The architecture is evaluated on two platforms, Xilinx Zynq-                      Symposium on Field-Programmable Gate Arrays. ACM, 2018, pp. 1120.
7000 ZC706 development board and the TSMC 28-nm CMOS                             [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
                                                                                     applied to document recognition, Proceedings of the IEEE, Nov. 1998.
technology. The synthesis results are presented in Table II. The                 [8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
throughput (T ) in the table is defined as the number of outputs                     recurrent neural networks,” in International Conference on Machine
per second:                                                                          Learning, 2014, pp. 17641772.
                                                                                 [9] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-
                             128bit · fclk                                           to-end attention-based large vocabulary speech recognition,” in Acous-
                       T =                 .                  (9)
                             data width                                              tics, Speech and Signal Processing (ICASSP), 2016 IEEE International
                                                                                     Conference on. IEEE, 2016, pp. 49454949
The efficiency(E) is defined as:                                                [10] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
                                      T                                              network,” in System-On-Chip Conference, 2017, pp. 323326
                          E=                  .                          (9)    [11] T. G. Noll, “Carry-save architectures for high-speed digital signal
                                 area · power                                        processing,” Journal of Vlsi Signal Processing Systems for Signal Image
                                                                                     & Video Technology, vol. 3, no. 1-2, pp. 121140, 1991.
  As is shown in the table, the data processing speed is                        [12] C. Nagendra, M. J. Irwin, and R. M. Owens, “Area-time-power tradeoffs
sufficient for real-time applications.                                               in parallel adders,” IEEE Transactions on Circuits & Systems II Analog
                                                                                     & Digital Signal Processing, vol. 43, no. 10, pp. 689702, 2002.
                          VI. C ONCLUSION                                       [13] O. J. Bedrij, “Carry-select adder,” Electronic Computers Ire Transactions
                                                                                     on, vol. EC-11, no. 3, pp. 340346, 1962.
  In this paper, an efficient hardware implementation of                        [14] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry se-
softmax function is presented for deep neural networks.                              lect adder,” IEEE Transactions on Very Large Scale Integration Systems,
Algorithm-hardware co-optimizations, including multiple al-                          vol. 20, no. 2, pp. 371375, 2012.
gorithm strength reduction methods, linear fitting and compo-
nent reuse are performed to reduce hardware complexity and
                                                                            226