0% found this document useful (0 votes)
102 views6 pages

Radix-4 Feed Back Design Using Four Parallel Adders With Efficient Reordering Module For Fast Fourier Transform

The document proposes a new radix-4 feedback design for fast Fourier transforms (FFTs) using four parallel adders and efficient reordering modules. The design uses eight parallel adders to compute four outputs simultaneously, along with a multiplier unit for twiddle factors and memory. The novel data flow uses parallel-in parallel-out buffers to properly reorder outputs at each stage. Compared to previous standard methods, the proposed method generates FFTs using reduced storage, limited complex adders and multipliers.

Uploaded by

Pushpa Kotipalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views6 pages

Radix-4 Feed Back Design Using Four Parallel Adders With Efficient Reordering Module For Fast Fourier Transform

The document proposes a new radix-4 feedback design for fast Fourier transforms (FFTs) using four parallel adders and efficient reordering modules. The design uses eight parallel adders to compute four outputs simultaneously, along with a multiplier unit for twiddle factors and memory. The novel data flow uses parallel-in parallel-out buffers to properly reorder outputs at each stage. Compared to previous standard methods, the proposed method generates FFTs using reduced storage, limited complex adders and multipliers.

Uploaded by

Pushpa Kotipalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Radix-4 Feed Back Design using Four Parallel adders with efficient

Reordering module for Fast Fourier Transform


1Prasanna Kumar Godi, 1Krishna T Battula and 2Pushpa Kotipalli

1ECE, JNTUK University, Kakinada, India

2
Shri Vishnu Engineering College for women, Bhimavarm,India

Abstract:

A new delay and feedback design for Fast Fourier Transform design with efficient reordering modules
for outputs is proposed in this paper, it is designed based on radix-4 four parallel complex adders. The proposed
design has eight parallel adders for computing four outputs simultaneously, multiplier unit for twiddle factor
multiplication and memory unit to store the running data. The proposed design has novel data flow, proper
reordering of outputs at each stage is achieved by using Parallel in Parallel Out (PIPO) buffers. This is the
simple radix-4 architecture using one four parallel processing unit. The flow of data is controlled by control unit.
The comparison of proposed method with previous standard methods shows that the proposed method generates
FFT by using reduced storage, limited complex adder and multipliers.

Key words: Delay, Feedback, FFT, twiddle factor, Four Parallel, PIPO.

1. Introduction

The computations in Discrete Fourier Transform (DFT) is a big problem in practical implementation for
many applications, FFT is the efficient solution for this problem. There many pipelined and parallel pipelined
designs are implemented over the past decade, many of the designs are follows the radix-2 and radix-2m splitting,
in this paper we implemented radix-4 and radix-4m splitting, in which number of stages are reduced. The basis
DFT equation is (1), radix-4 DIF-FFT decomposition is in equation (2) to (5).
𝑋[𝑘] = ∑ 𝑥[𝑛]𝑤 (1)

The applications of FFT [1],[2],[3],[4] in several wide fields is the motive to design of new FFT
architectures, research on FFT’s Hardware architecture has been going on from 1965 , starting from invention
of FFT [5]. The FPGA designs[6],[7] for FFT’s is one of the major developments in the hardware implementation
of FFT. The pipelined designs and parallel designs are very familiar designs [8],[9],[10],[11] these designs are
pipelined designs, in which data flows from left to right , no feedback is required and each stage has separate
processing unit generates the required outputs . The delay feedback designs uses only one path [12], most of the
FFT designs for practical implementation follows the Radix-2, Radix-22 decomposition[13] and recently split
radix[14],[15]. The analyzation of several existing designs gives us a basis for designing a new FFT architecture
of radix-4, generally radix-4 designs are not that much usage because of complexity and requires reordering of
inputs for every stage. The proposed method solves the problem of reordering of inputs by using simplified four
PIPO buffers.
Radix-4 FFT’s architectures are designed by the decomposition of the equation in four constituent sub
components, each subcomponent is summation of four sub parts, equations from [2] to [5] shows the
decomposition of X(4k), X(4k+1), X(4k+2) and X(4k+3).

𝑋[4𝑘] = ∑ {𝑥[𝑛] + 𝑥[𝑛 + 𝑁⁄4] + 𝑥[𝑛 + 2𝑁⁄4] + 𝑥[𝑛 + 3𝑁⁄4]}𝑤 ⁄ (2)

𝑋[4𝑘 + 1] = ∑ {𝑥[𝑛] − 𝑗𝑥[𝑛 + 𝑁⁄4] − 𝑥[𝑛 + 2𝑁⁄4] + 𝑗𝑥[𝑛 + 3𝑁⁄4]}𝑤 𝑤 ⁄ (3)

𝑋[4𝑘 + 2] = ∑ {𝑥[𝑛] − 𝑥[𝑛 + 𝑁⁄4] + 𝑥[𝑛 + 2𝑁⁄4] − 𝑥[𝑛 + 3𝑁⁄4]}𝑤 𝑤 ⁄ (4)

𝑋[4𝑘 + 3] = ∑ {𝑥[𝑛] + 𝑗𝑥[𝑛 + 𝑁⁄4] − 𝑥[𝑛 + 2𝑁⁄4] − 𝑗𝑥[𝑛 + 3𝑁⁄4]}𝑤 𝑤 ⁄ (5)

The butterfly dataflow diagram for 16 inputs is shown in Figure 1, it is a two stage design, there are four
dependable 4-point butterflies in first stage, those are {x(0), x(4),x(8) , x(12)}, {x(1), x(5),x(9) , x(13)}, {x(2),
x(6),x(10) , x(14)} and {x(3), x(7),x(11) , x(15)}. The second stage has four butterflies of 4-input. The multipliers
for twiddle factor multiplications are required after first stage.

x(0) X(0)

x(1) X(4)

x(2) X(8)

x(3) X(12)

x(4) 0 X(1)
1 X(5)
x(5)
2 X(9)
x(6)
3
x(7) X(13)

x(8) 0 X(2)

x(9) 2 X(6)

4 X(10)
x(10)
6 X(14)
x(11)
0 X(3)
x(12)
3 X(7)
x(13)

x(14) 6 X(11)

9 X(15)
x(15)

Figure 1: radix-4 FFT dataflow with 16 input values.

2. Proposed radix-4 Four parallel feedback design:

The proposed design for implementation of FFT is a radix -4 design, in which the main modules are Four
parallel complex adder, Parallel in Serial Out (PISO), demultiplexer and Parallel in Parallel Out (PIPO) registers.
The detailed description of the each sub unit is given in subsequent sections. The proposed four complex adder
design for 16 input values is shown in Figure 2, input data values are given parallelly to four parallel complex
adders through four multiplexers. The four parallel complex adder takes four complex inputs and gives four
complex outputs, it generates the basic four input radix-4 design. The four parallel output data is multiplied by
twiddle factor multipliers, in the design first output does not require multiplication, so three multipliers are
required.
16

16

x(0)
0 0 X0
16
64
1 1
00 PIPO
x(4)
0 Four Parallel P 64
16 0 X1
PIPO
1 Complex I 64
01 1

x(8) Adder S 10
64
PIPO
16 0 X2
0
O 64
1

1 PIPO 16 X3
11 0
x(12) 0 1

16

16

Figure 2: proposed design using four parallel complex adders


The four outputs of the complex adder with corresponding inputs x (0), x(4),x(8) and x(12) are processed
through Parallel in serial Out (PISO) block ,which converts four 16 bit inputs into one 64 bit output. The serial
output from PISO is connected to demultiplexer, it generates four outputs of length 64 bit, each output carries the
four outputs of first stage butterfly. The four outputs of demultiplex is stored in four Buffers of Parallel in Parallel
Out (PIPO), detailed data flow in PIPO is shown in Figure 3.

x1(12) x1(8) x1(4) x1(0)

x1(13) x1(9) x1(5) x1(1)

x1(14) x1(10) x1(6) x1(2)

x1(15) x1(11) x1(7) x1(3)

Figure 3: Data arrangement in PIPO.

The outputs at corresponding positions of 0,4,8 and 12 are stored in PIPO buffer, remaining outputs
stored in other three different PIPO’s, by storing all the 16 values in PIPO buffers complete the stage-1 outputs.
Stage-2 of the radix-4 FFT requires the reordering of outputs, this can be done by reding the buffer data in First
in Fist Out fashion, so that first 16 values of buffers are x1(0), x1(1), x1(2) and x1(3) is feed back to the input of
four parallel complex adder through four demultiplexers. The generated outputs are again stored in PIPO buffers
in the similar fashion as first stage, and the generated outputs are 0,4,8 and 12 , to reorder these outputs again
waiting is required to generate all the outputs , after completion read first 16 values of all the PIPO’s those are
0,1,2 and 3 ,similarly in next cycle four outputs 4,5,6 and 7 are read out, to read out all the 16 outputs the selection
line of demultiplexers have to change its mode.
The generation of 16-input takes over all 16 cycles, complex four parallel adder takes two clock cycles to generate
four outputs, for generating stage 1 outputs it takes overall 4*2=8 cycles. Another 8 cycles to generate stage -2
outputs.

3.Sub Components in the proposed design:

3.1 Four Parallel complex Adder:

The four parallel complex adder unit is basic 4-point radix-4 FFT generation unit, it generates four
outputs parallelly, the detailed design is shown in Figure 4, the basic design of 4-point FFT is shown in figure

0 0

-J 4 4
-1
J

8 8
-1
j

j
-J 12 12

Figure 4: simple 4-point radi-4 design using 8 complex adders.

The four inputs of the parallel adder are 0,4,8 and 12, it consists of total 8 full adders that is 4 adders and
4 subtractors, the data is arranged in the format shown in Figure 4. The first four parallel adders generate necessary
4 outputs, inside there is multiplication with j is required, which is generated by interchanging real part with
imaginary part and vice versa, generates necessary four outputs, and the next four parallel adders takes inputs
from first adders generates necessary outputs.
3.2 Data index reordering unit:

The data index reordering unit is required in the proposed method, because of the output of stage 1 is
obtained in 0,4, 8 and 12 in this way, this problem can be solved by using data buffers and demultiplexer with
four outputs is shown in Figure 5.

64
00 12 8 4 0

64
01 13 9 5 1
64
12 8 4 0 64
10 14 10 6 2

64
15 11 7 3
11

s0
s1

Figure 5: PIPO sub block with demultiplexers.

The input to the demultiplexer is 64-bit output of PISO unit, it makes four 64-bit outputs of overall
sixteen 16-bit outputs, these four outputs are generated by changing the selection line of the demultiplexers
depending on input data, Table 1 shows the selection input with corresponding input.

Table 1:demultiplexers selection lines based on inputs

Input Selection (s0, s1)


{12,8,4,0} 00
{13,9,5,1} 01
{14,10,6,2} 10
{15,11,7,3} 11

4. Proposed Design for Higher order FFT:

The proposed design is very effective for higher order FFT’s, that is for example FFT for 64 input, it also
similar structure as shown in Figure 62. The major modification in the design, PISO and increase in number of
storage elements in the PIPO buffers, PISO has to wait 32 cycles for the generation of all the 64 outputs in the
first stage , each four parallel output write in four index delay manner , that is first four outputs are write in [0 4
8 12] locations and next four are in [1 5 9 13], the detailed data arrangement is displayed in Table 2.

Table 2: PISO data arrangement for 64-point FFT

Location 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1st Four 48 32 16 0
2nd Four 52 48 36 32 20 16 4 0
3rd Four 56 52 48 40 36 32 24 20 16 8 4 0
4th Four 60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0

The architecture for 64 data inputs is displayed in Figure 6, here the PISO stores total 16 outputs, as a
result overall size increases to 256 bits, the data length of outputs in demultiplexer are also changed to 256.
256
60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0
00

256
61 57 53 49 45 41 37 33 29 24 21 17 13 9 5 1
01
256
60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0
256
62 58 54 50 46 42 38 34 30 25 22 18 14 10 6 2
10

256
63 59 55 51 47 43 39 35 31 26 23 19 15 11 7 3
11

s0
s1

Figure 6: proposed design for 64 input values.

5.Results and discussions:

The proposed design of four parallel complex adder is shown in figure, in which the requirement of
hardware components is described from the design. The number of complex adders is 8, that is four parallel
complex adder unit in figure designed by a total of 8 complex adders, number of multipliers are 3. The latency of
the design is log 𝑁, number of storage elements are N, that is in PIPO buffer has N/4 storage capacity, and
there are total four PIPO’s, so this design has N storage elements.
The performance analysis of proposed design in perspective of hardware and latency is displayed in Table 3,
proposed design takes less storage elements with that of previous designs, required complex adders are smaller
than that of existing designs expect one design, multipliers count is also not far away from previous designs. The
proposed four parallel complex adder design requires less storage, limited adders and multipliers and latency is
also lower than previous designs.

Table 3: Hardware utilization of the proposed method with that of standard methods

Type of Design Storage Complex Adders Multipliers Latency LUT’s


R2SDF 2N-1 16 4 31 21824
R2SDC 3N-2 31 8 256 32701
R2SD2F (7N-4)/3 4 4 36 640
R23 SDF 2N-1 16 2 31 18688
𝐍
R-4 Proposed N 8 3 𝐥𝐨𝐠 𝟒 𝐍 482
𝟐

The proposed design using four parallel complex adders is synthesized in FPGA VIRTEX6 -XC6VLX75T,
LUT’s of the proposed design lower than several other designs, and the dynamic power consumption at 100Mhz
is 268uw and at 500Mhz is 128 uw.

6. Conclusions:

Four parallel complex adders-based feedback design for implementation of radix-4 FFT with efficient
data reordering module using PIPO buffers is designed in this paper, proposed method has a major advantage of
lower complex adders that is 8 only, limited multipliers of 3 and the buffer size which is very lower than several
other existing designs. The proposed method uses multiplexers and demultiplexers for effective generation of
outputs, and an effective data reordering scheme is also proposed using PIPO buffers for each stage, this is very
effective for higher order FFT’s. The proposed method finally achieves difficulties of input reordering and
computational complexity in the radix-4 FFT’s in an effective way.

References:
[1] Á. Ordóñez, F. Argüello, and D. B. Heras, “GPU accelerated FFT-based registration of hyperspectral
scenes,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 10, no. 11, pp. 4869–4878, 2017.
[2] R. F. Yazicioglu, P. Merken, R. Puers, and C. Van Hoof, “Low-power low-noise 8-channel EEG front-
end ASIC for ambulatory acquisition systems,” in 2006 Proceedings of the 32nd European Solid-State
Circuits Conference, 2006, pp. 247–250.
[3] S. He and M. Torkelson, “Designing pipeline FFT processor for OFDM (de) modulation,” in 1998 URSI
International Symposium on Signals, Systems, and Electronics. Conference Proceedings (Cat. No.
98EX167), 1998, pp. 257–262.
[4] T. Sansaloni, A. Perez-Pascual, V. Torres, and J. Valls, “Efficient pipeline FFT processors for WLAN
MIMO-OFDM systems,” Electron. Lett., vol. 41, no. 19, pp. 1043–1044, 2005.
[5] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,”
Math. Comput., vol. 19, no. 90, pp. 297–301, 1965.
[6] S. Langemeyer, P. Pirsch, and H. Blume, “A FPGA architecture for real-time processing of variable-
length FFTs,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2011, pp. 1705–1708.
[7] J. Nash, “Distributed-Memory-Based FFT Architecture and FPGA Implementations,” Electronics, vol. 7,
no. 7, p. 116, 2018.
[8] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI
implementations,” IEEE Trans. Comput., no. 5, pp. 414–426, 1984.
[9] M. Ayinala, M. Brown, and K. K. Parhi, “Pipelined parallel FFT architectures via folding transformation,”
IEEE Trans. Very Large Scale Integr. Syst., vol. 20, no. 6, pp. 1068–1081, 2012.
[10] R. Storn, “A novel radix-2 pipeline architecture for the computation of the DFT,” in 1988., IEEE
International Symposium on Circuits and Systems, 1988, pp. 1899–1902.
[11] T.-Y. Sung, “Memory-efficient and high-speed split-radix FFT/IFFT processor based on pipelined
CORDIC rotations,” IEE Proceedings-Vision, Image Signal Process., vol. 153, no. 4, pp. 405–410, 2006.
[12] T. Adiono, M. S. Irsyadi, Y. S. Hidayat, and A. Irawan, “64-point fast efficient FFT architecture using
radix-23 single path delay feedback,” in 2009 International Conference on Electrical Engineering and
Informatics, 2009, vol. 2, pp. 654–658.
[13] A. X. Glittas, M. Sellathurai, and G. Lakshminarayanan, “A normal I/O order radix-2 FFT architecture to
process twin data streams for MIMO,” IEEE Trans. Very Large Scale Integr. Syst., vol. 24, no. 6, pp.
2402–2406, 2016.
[14] P. Duhamel and H. Hollmann, “Split radix’FFT algorithm,” Electron. Lett., vol. 20, no. 1, pp. 14–16,
1984.
[15] W.-C. Yeh and C.-W. Jen, “High-speed and low-power split-radix FFT,” IEEE Trans. Signal Process.,
vol. 51, no. 3, pp. 864–874, 2003.

AUTHORS PROFILE

Prasanna Kumar G Research scholar in University College of Engineering JNTUK, Kakinada and Associate
professor in Vishnu Institute of Technology Bhimavaram, The current research area is signal processing and presently
research on low complexity and power optimized FFT designs.

Krishna B.T gotten Ph.D. degree in Electronics and Communication Engineering from Andhra University,
Vishakhapatnam, Andhra Pradesh, India in 2010. He has been with the college school of Engineering, JNTUK,
Kakinada since 2012. His exploration advantages incorporate signal processing.

Pushpa. K gotten Ph.D. degree in Electronics and Communication Engineering from Osmania University,
Hyderabad, Andhra Pradesh in 2010. She filled in as Engineer in Astra Microwave Products Limited, Hyderabad
during 1992 to 1997. From 1997 onwards, she served in various Engineering Colleges, wound up Associate Professor
in 2000 and Professor in 2007. She is currently a Professor in ECE Department of Shri Vishnu Engineering College
for Women, Bhimavaram, Andhra Pradesh, India.

You might also like