2011
2012
                       Second
                          International
                               International
                                        Conference
                                             Conference
                                                    on Intelligent
                                                        on Intelligent
                                                                   Systems
                                                                       System
                                                                           Design
                                                                              Design
                                                                                  andand
                                                                                      Engineering
                                                                                         Engineering
                                                                                                  Application
                                                                                                     Application
   A Novel Design of 1024-point Pipelined FFT Processor Based on Cordic Algorithm
                                          Shi Jiangyi, Tian Yinghui, Wang Mingxing ,Yang Zhe
                                      Dept Microelectronics Xidian University, Xian, Shannxi, 710071, China
                                    School of Technical Physics Xidian University, Xian, Shaanxi, 710071 China
                                                                nmgtyh@163.com
  Abstract—A novel ASIC design of a 1024-point pipelined FFT                                                            nk
                                                                                                X(k)= X (k) +W X (k)
                                                                                                            1           N     2
                                                                                                                                                (2)
  Processor is introduced in this paper. This is a new FFT
                                                                                                                               nk
  architecture based on the radix-2 algorithm and cordic                                        X (k+N/2) = X (k) -W1X (k)     N      2
                                                                                                                                                (3)
  algorithm. The solution adopted in the design is used to
  achieve a high-frequency FFT processor of which the
                                                                                      Where k refers to 0, 1, 2…, N/2-1.
  frequency could reach as high as 330MHz. It also shows an                           N-point DFT of x(n) could be ascertained from both
  advantage in saving up to fifty percent hardware resources                       equation (2) and (3) with the values of X1(k) and X2 (k)
  over traditional FFT processor.                                                  could be identified by the way mentioned above.
      Keywords-FFT; pepeline; cordic
                               I.   INTRODUCTION
       DFT plays an important role in digital signal processing,
                                                                                              Figure 1. The basic element of butterfly algorithm
  it is a basic operational unit in signal transformation from
  time domain to frequency domain. FFT, a fast algorithm, is
                                                                                      Equation (2) and (3), also known as butterfly algorithm,
  widely used in wireless communication, voice and image
                                                                                   are shown in figure-1.Figure-2 illustrates an 8-point FFT
  processing, spectrum analysis and so on. With the number of
                                                                                   algorithm which is composed of basic butterfly algorithm.
  point increasing, the need of hardware resources multiples
  while the computational speed slows down. The common
  solution used to enhance the performance of FFT processor
  is to increase the number of pipeline-stage, but this will lead
  to the increase in hardware resource and power, and the
  speed will be limited as well. Therefore, we need to reduce
  the scale of hardware resources on the basis of ensuring
  high-performance. Cordic algorithm and a new structure are
  adopted in the design in this paper to reduce the hardware
  resource and achieve a high frequency and better
  performance.                                                                                         Figure 2. 8-point FFT algorithm
       An N-point DFT is defined as:
                        N −1
                        ¦ x ( n )W
                                         nk                                                              II.    ARCHITECTURE
            X (k) =                      N
                                              k=0, 1, 2..., N-1 (1)
                        n=0                                                            The 1024-point FFT pipelined processor presented in this
                         refers to exp (-j 2π nk ) and x (n) is the
                    nk                                                             paper is based on radix-2 algorithm. The system consists of a
      Where    W    N
                                           N                                       serial-to-parallel module, ten FFT computational modules, a
  input signal.                                                                    parallel-to-serial module and a sequence reversion module.
      FFT, a fast algorithm shown in equation (1), could be                        Figure-3 shows the architecture of the FFT processor.
  divided into DIT and DIF in terms of decimation, and could
  also be separated into radix-2, radix-4 etc. with respect to
  radix. X (n) could be divided into odd part and even part
  using radix-2 DIT in equation (1). N/2-point DFT is
  computed in both part to identify the value of X1(k) and.
   X (k)
     2
      Based on the periodicity and symmetry of W nk , the
                                                                   N
  following equations are obtained:                                                        Figure 3. New architecture of 1024-point pipelined FFT
978-0-7695-4608-7/12 $26.00 © 2012
978-0-7695-4608-7/11          2011 IEEE                                       80
DOI 10.1109/ISdea.2012.503
    10.1109/ISdea.2011.126
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on April 03,2024 at 05:05:32 UTC from IEEE Xplore. Restrictions apply.
 A. Serial-to-parallel and stage one module                                        while the result of the subtracter goes into SRAM2 and
     In this paper, the serial to parallel module and state one                    SRAM3. When the number of data coming from the adder
 module are connected together. They work as a whole.                              equals to 128, the result of the adder will go into the
 Because of there is only one butterfly factor in stage one,                       butterfly module directly, simultaneously, the data stored in
 one adder could achieve its function. In serial-to-parallel                       SRAM1 beforehand will go into butterfly module as well,
 module, the frequency of clock controlling read is half of                        completing the butterfly computation. When the butterfly
 that of write, ensuring that the number of input equals to that                   computation is running, the data from subtracter go into
 of the output. Controlled by control module, the data from                        SRAM3, until the butterfly computation of the adder part is
 serial-to-parallel to stage one module (just one adder) are the                   over. Meanwhile, data in SRAM2 and SRAM3 go into the
 data that could be access in, thus the two result are going                       butterfly module synchronously, and data from the adder go
 direct to the adder to accomplish the computation of state                        into SRAM1 consecutively. A control module is
 one. The structure as shown in figure-4 comprises 4 SRAMs,                        incorporated to control the processing above, ensuring the
 using ping-pong operation to make sure the continuous                             two states above work alternately.
 output of data.
    Figure 4. The architecture of Serial-to-parallel and stage one module
 B. Parallel-to-serial module
     This module is similar to the serial-to-parallel module
 except that it is a 2-data-input and 1-data-output module,
 leading to the save of a lot of resources with only four
 registers are needed to implement its function. Every two
 registers work together. Four registers combined comprise
 the ping-pong operation, ensuring the continuous output of
                                                                                     Figure 5. The architecture of Serial-to-parallel and stage one module
 data.
 C. FFT compute module
     The first and the second stage are simplified from the
 third level. In stage one, there is only one butterfly factor,
 thereforeˈwe can use an adder to accomplish the butterfly
 computation. This way of design could spare the use of
 cordic multiplier and rom used to store the butterfly factor.
 The principle of stage two is the same as stage one. Figure-5
 shows the structure of stage three of which the depth of the
 memory is 128. When the stage increases by 1, the depth of
 the memory will decrease by fifty percent accordingly. So
 this structure could save a lot of resource compared to other                                          Figure 6. Compute processing
 designs. Figure-6 and figure-7 illustrates the computing
 processing and the state diagram respectively.
     In stage three, the depth of SRAM1, SRAM2 and
 SRAM3 (SRAM1 and SRAM3 are both single-port rams
 while SRAM2 is dual-port ram used for synchronous read
 and write) are all 128. The output of the previous stage is
 made of two parts, the result of the adder and the result of
 the subtracter. The result of the adder goes into SRAM1
                                                                              81
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on April 03,2024 at 05:05:32 UTC from IEEE Xplore. Restrictions apply.
                                                                                     δ    0   δ    1     δ   2     δ     3   δ   4    δ     5   δ   6        δ   7
                                                                                      0        1         0          -1       -1         0       -1           0
                                                                                     δ   8    δ    9     δ   10    δ   11    δ   12   δ   13    δ   14       δ   15
                                                                                      0        -1        1           1       0          0       1            0
                          Figure 7. State diagram
    In figure-7, s0, s1, s2 and s3 stand for the state while the
 number 1,2,3,4 and 5 stand for the state shift conditions
 which are described in detail in Table1.
     TABLE1 Descriptions of The Shift Conditions
  Number                        Descriptiion
     1                            ready=1
     2                      address=127, ready=1
     3                     address==127, ready=1
     4                      address=127, ready=1
     5              address==127, ready=0, add_rom=3
                                                                                                              Figure 8. State diagram
 D. Cordic module                                                                  E. Inversion sequence module
     The multiplier which is widely used in complex                                   In this paper, a 1024-point FFT processor is designed, so
 multiplication module costs four multipliers and two adders                       a 10-bit-width address bus is needed. To accomplish the
 in each butterfly computation .However, instead, it only                          sequence reversion, we have to reverse the address bus, as is
 costs one cordic module in each butterfly computation, thus,                      shown in figure-9.
 it will save a lot of resource. In this paper, a cordic module is
 used to complete complex multiplication. The structure of
 the whole module is shown in figure-8.
     The cordic-pre module, cordic module, cordic-last
 module has been introduced in many papers, but the cordic
 adjust module may not be adopted frequently. Using this
 module, the accuracy could be improved greatly. The cordic                                                                                                                                                         
 is implemented by iteration which consists of shift and
                                                                                                         Figure 9. Inversion sequence
 addition operations. In the cordic module of this proposed
 design, 18 times iteration is adopted to obtain a higher
 precision. After each operation, the modulus will increase.                                      III.   PERFORMANCE ANALYSIS
 According to the theoretical calculations, the value is a
 constant generated by multiplying 0.60725 to the value                                1024-point FFT pipelined processor is the core element
 output from the cordic-last module. The iterative formula (4)                     on radar signal processing chip. The system has been
 and (5) are used to achieve this goal.                                            verified at RTL level in Modelsim6.5 SE simulation
                                                                                   environment and has passed the verification of DC
                         x = x -δ x 2
                                                      −i
                           i +1     i     i   i
                                                           (4)                     verification platform. From the DC report we know that the
                                                                                   frequency could reach as high as 330MHz and the area is
                         y = y -δ y 2
                            i +1    i     i       i
                                                      −i
                                                           (5)                     less than traditional designs. From the comparison by the
                                                                                   matlab, we know that the precision is 0.5%.
     Where δ refers to 0, 1 or -1.                                                     The input data start from 1 to 1024. Figure-10 shows the
     After the simulation by Matlab, the δ is chosen from                          real part and the imaginary part of the output. The axis-x
 Table2.                                                                           stands for the sequence of the output while the axis-y stands
     The accuracy of the cordic is related to the numbers of                       for its value. Figure-11 shows the output from Matlab. It is
 iteration and the width of the bits. Five protection-bits used                    quite clear that both graphs have the same trend from figure-
 in the design in this paper could enhance the precision                           11, indicating that the result is right. Due to the fact that a
 greatly.                                                                          complex multiplication module uses either a general
                                                                                   multiplier or a cordic multiplier, errors are inevitable.
             TABLE2 Description of The Shift Conditions                            Figure-12 shows the relative error of our proposed FFT
                                                                              82
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on April 03,2024 at 05:05:32 UTC from IEEE Xplore. Restrictions apply.
 processor while figure-13 shows the relative error of other                          The modelsim simulation diagram is shown in figure-14.
 FFT processors that use general multiplier. For both figures,                     From the figure, it is quite clear that this proposed design is
 axis-x stands for the sequence of the output and axis-y stands                    pipelined.
 for the relative error. From figure-12, we can find that the
 maximum value is less than 0.6%. However, from figure13,
 we can see that the maximum value is less than 2.5%. So the
 precision is improved a lot in our proposed design.
                                                                                                                                                            
                                                                                                 Figure 14. The modelsim simulation diagram
                                                                                                          IV.    CONCLUSION
                   Figure 10. The output of FFT processor                              This paper shows the flow of our design, a 1024-point
                                                                                   pipelined processor. By using the new structure, the SRAM
                                                                                   is saved while the function of pipeline is secured. By using
                                                                                   cordic algorithm, the ROM is saved while the frequency and
                                                                                   the precision are enhanced greatly.
                                                                                                           ACKNOWLEDGMENT
                                                                                      Support for this work was provided by Natural Science
                                                                                   Basic Research Plan in Shaanxi Province of China
                                                                                   (2010JM8015).
                                                                                                                REFERENCES
                       Figure 11. The output of Matlab                             [1]  LENORE R. MULLIN and SHARON G. SMALL, "Four Easy Ways
                                                                                        to a Faster FFT”,Journal of Mathematical Modelling and
                                                                                        Algorithmsl:193-21 4; 2002.
                                                                                   [2] DAISUKE           TAKAHASHI,YASUMASA                 KANADA,"High-
                                                                                        Performance adix-2,3 and 5 Parallel1-D Complex FFT Algorithms
                                                                                        for Distributed-Memory Parallel Computers”, Journal of
                                                                                        Supercomputing, 15, 207228(2000)
                                                                                   [3] Volder J E. The CORDIC trigonometric computing technique[J]. IRE
                                                                                        Transactions on Electronic Computers,1959,8(3):330-334
                                                                                   [4] Wang S,Piuri V, Wartzlander E. Hybrid CORDIC algorithms [J].
                                                                                        IEEE
                                                                                         Trans. on Computers, 1997, 46 (11):1202-1207
                                                                                   [5] Baas B M. A Low-Power, High-performance 1024-Point
                                                                                        FFTProcessor[J]. IEEE Journal of Solid-State Circuits 1999, 34 (3 ):
                                                                                        380-387
       Figure 12. The comparison between FFT processor and Matlab                  [6] Cooley J W, Tukey J. Proc IEEE ASSP. IEEE Signal Processing
                                                                                        Society, 1992.
                                                                                   [7] Cooley J W IEEE Signal Processing Magazine, 1992, 9(1)
                                                                                   [8] J. Giacomantone, H. Villagarcia, O. Bria. A fast CORDIC Co-
                                                                                        Processor Architecture for Digital Signal Processing Applications [J].
                                                                                        VI Congreso Argentino de Ciencias de la Computacion (VI CACIC).
                                                                                        October 2000.
                                                                                   [9] Maharatna K, Grass E, Jagdhold U. A 64-Point Fourier Transform
                                                                                        Chip for High-Speed Wireless LAN Application Using OFDM. IEEE
                                                                                        Journal of Solid-State Circuit. 2004, Vol. 39 (3):484~493
                                                                                   [10] Sansaloni T, Pascual A P, Tomes V, et al. FFT Spectrum Analyzer
                                                                                        Project for Teaching Digital Signal Processing With FPGA Devices.
                                                                                       IEEE Transactions on Education, 2007, Vol. 50 (3):229 ϔ 235
       Figure 13. The comparison between FFT processor and Matlab
                                                                              83
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on April 03,2024 at 05:05:32 UTC from IEEE Xplore. Restrictions apply.