0% found this document useful (0 votes)
25 views8 pages

Shi Wal 95 A

Uploaded by

anuescapist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Shi Wal 95 A

Uploaded by

anuescapist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom

Computing Machines

Nabeel Shirazi, Al Walters, and Peter Athanas

Virginia Polytechnic Institute and State University


Department of Electrical Engineering
Blacksburg, Virginia 24061-0111

shirazi@pequod.ee.vt.edu

Abstract Image and digital signal processing applications


typically require high calculation throughput [2,6]. The
Many algorithms rely on floating point arithmetic arithmetic operators presented here were implemented for
for the dynamic range of representations and require mil- real-time signal processing on the Splash-2 CCM, which
lions of calculations per second. Such computationally include a 2-D fast Fourier transform (FFT) and a systolic
intensive algorithms are candidates for acceleration using array implementation of a FIR filter. Such signal process-
custom computing machines (CCMs) being tailored for the ing techniques necessitate a large dynamic range of num-
application. Unfortunately, floating point operators bers. The use of floating point helps to alleviate the
require excessive area (or time) for conventional imple- underflow and overflow problems often seen in fixed point
mentations. Instead, custom formats, derived for individ- formats. An advantage of using a CCM for floating point
ual applications, are feasible on CCMs, and can be implementation is the ability to customize the format and
implemented on a fraction of a single FPGA. Using algorithm data flow to suit the application’s needs.
higher-level languages, like VHDL, facilitates the devel- This paper examines the implementations of vari-
opment of custom operators without significantly impact- ous arithmetic operators using two floating point formats
ing operator performance or area. Properties, including similar to the IEEE 754 standard [5]. Eighteen and sixteen
area consumption and speed of working arithmetic opera- bit floating point adders/subtracters, multipliers, and
tor units used in real-time applications, are discussed. dividers have been synthesized for Xilinx 4010 FPGAs
[8]. The floating formats used are discussed in Section 2.
1.0 Introduction Sections 3, 4, and 5 present the algorithms, implementa-
tions, and optimizations used for the different operators.
Finally a summary, in terms of size and speed, of the dif-
Until recently, any meaningful floating point arith-
ferent floating point units is given Section 6.
metic has been virtually impossible to implement on
FPGA based systems due to the limited density and speed
of older FPGAs. In addition, mapping difficulties 2.0 Floating Point Format Representation
occurred due to the inherent complexity of floating point
The format which was used is similar to the IEEE
arithmetic. With the introduction of high level languages
754 standard used to store floating point numbers. For
such as VHDL, rapid prototyping of floating point units
comparison purposes, single precision floating point uses
has become possible. Elaborate simulation and synthesis
the 32 bit IEEE 754 format shown in Figure 1.
tools at a higher design level aid the designer for a more
controllable and maintainable product. Although low
level design specifications were alternately possible, the s e f
strategy used in the work presented here was to specify Bit #: 31 30 23 22 0
every aspect of the design in VHDL and rely on automated
Figure 1: 32 Bit Floating Point Format.
synthesis to generate the FPGA mapping.

Presented at the IEEE Symposium on FPGAs for Custom Computing Machines


Napa Valley, California, April 1995
The floating point value (v) is computed by: enough dynamic number range. The 16-bit format is
shown in Figure 3.
v = -1s 2(e - 127)(1.f) s e f
Bit#: 15 14 98 0
In Figure 1, the sign field, s, is bit 31 and is used to spec-
ify the sign of the number. Bits 30 down to 23 are the Figure 3: 16 Bit Floating Point Format.
exponent field. This 8 bit quantity is a signed number rep- The 16 bit floating point value (v) is computed by:
resented by using a bias of 127. Bits 22 down to 0 are
used to store the binary representation of the floating point
number. The leading one in the mantissa, 1.f, does not v = -1s 2(e - 31)(1.f)
appear in the representation, therefore the leading one is The range of real numbers that this 16 bit format can rep-
implicit. For example, -3.625 (dec) or -11.101 (binary) is
resent is ± 8.5815 x 109 to ± 6.985 x 10-10.
stored in the following way:

v = -11 2(128-127)1.1101 3.0 Floating-Point Addition and Subtraction


where:
s = 1, e = 128 (dec) 80 (hex), and f = 680000 (hex). The aim in developing a floating point adder/sub-
Therefore -3.625 is stored as: C0680000 (hex). tracter routine was to pipeline the unit in order to produce
The 18-bit floating point format was developed, in a result every clock cycle. By pipelining the adder, the
the same manner, for the 2-D FFT application[6]. The for- speed increased, however, the area increased as well. Dif-
mat was chosen to accommodate two specific require- ferent coding structures were tried in the VHDL code used
ments: (1) the dynamic range of the format needed to be to program the Xilinx chips in order to minimize size.
quite large in order to represent very large and small, posi-
tive and negative real numbers accurately, and (2) the data 3.1 Algorithm
path width into one of the Xilinx 4010 processors of
Splash-2 is 36 bits wide and two operands were needed to The floating-point addition and subtraction algorithm
be input on every clock cycle. Based on these require- studied here is similar to what is done in most traditional
ments the format in Figure 2 was used. processors, however, the computation is performed in
three stages and is presented in this section. The notation
s e f si, ei and fi are used to represent the sign, exponent and
Bit#: 17 16 10 9 0 mantissa fields of the floating point number, vi. A block
Figure 2: 18 Bit Floating Point Format. diagram of the three-stage adder is shown in Figure 4. The
computations required for each stage are as follows:
The 18 bit floating point value (v) is computed by:
Stage 1:
v = -1s 2(e - 63)(1.f)
• If the absolute value of v1 is less than the absolute
The range of real numbers that this format can represent is
value of v2 then swap v1 and v2. The absolute value is
± 3.6875 x 1019 to ± 1.626 x 10-19.
checked by comparing the exponent and mantissa of
The second floating point format investigated was a
each value.
16-bit representation used by the FIR filter application [7].
Like the FFT application, since multiple arithmetic opera-
• Subtract e2 from e1 in order to calculate the number of
tions needed to be done on a single chip, we chose a 16-bit positions to shift f2 to the right so that the decimal
format for two reasons: (1) local, 16-bit wide memories points are aligned before addition or subtraction in
were used in pipelined calculations allowing single read Stage 2.
cycles only, and (2) more logic was necessary to imple-
ment the FIR taps in addition to the two arithmetic units.
which do complex number operations. The format was
designed as a compromise between data width and a large
Sign Exponent Mantissa
s1 s2 e1 e2 f1 f2
Decimal Binary 18 Bit Format
If necessary, swap v1 & v2 to make v1 > v2.
v1 24.046875 1.1000000011 x 2 4 0 1000011 1000000011

Stage 1: - v2 -25.40625 1.1001011010 x 2 4 1 1000011 1001011010

R1 R7 R7 1 R11 1 R11
Therefore: s1 = 0 e1 = 1000011 1.f1 = 1.1000000011
s2 = 1 e2 = 1000011 1.f2 = 1.1001011010
Shift Value
Shift Right
Stage 1:
+/-
Stage 2:
Bottom 12 • Swap v1 and v2 since e1= e2 and f2 > f1
R1 R7 R12 Now: s1 = 1 e1 = 1000011 1.f1 = 1.1001011010
s2 = 0 e2 = 1000011 1.f2 = 1.1000000011
Exponent
Adjust
Normalize • Since e1 - e2 = 0, 1.f2 does not need to be shifted in
Stage 3: the next stage.
R1 R7 R10
Stage 2:
Rx = x-Bit Register • Since s1 does not equal s2, 1.f3 = 1.f1 - 1.f2.
Figure 4: Three stage 18-bit Floating Point Adder. • Also, s3 = f1 and e3 = e1 since they are the sign and
exponent of the greater value.
After stage 2: s3 = 1 e3 = 1000011 1.f3 = 0.0001010111
Stage 2:
• Shift 1.f2 to the right (e2 - e1) places calculated in the Stage 3:
previous stage. • Normalize f3 by shifting it 5 places to the left.
• Add 1.f1 to 1.f2 if s1 equals s2, else subtract 1.f2 from
1.f1.
• Adjust the Exponent, e3, by subtracting 5 from it.
After final stage: s3 = 1 e3 = 0111111 1.f3 = 1.0101110000
• Set the sign and the exponent of the final result, v3, to
the sign and the exponent of the greater value v1.
The result, v3, after addition is shown as follows:
Stage 3:
Decimal Binary 18 Bit Format
• Normalization of f3 is done by shifting it to the left v3 -1.359375 1.010111 x 20 0 0111111 0101110000
until the high order bit is a one.
• Adjusting exponent of the result, e3, is done by sub-
tracting it by the number of positions that f3 was
shifted left. 3.3 Optimization

The circuits produced by contemporary VHDL syn-


thesis tools are, unfortunately, highly sensitive to the man-
3.2 Eighteen Bit Floating Point Addition Example ner in which the original behavioral or structural
description is expressed. When designing the floating
To demonstrate an 18 bit, 3 stage floating point point adder/subtracter, using different VHDL constructs to
adder we add v1 + v2 = v3, where v1 = 24.046875 describe the same behavior resulted in a faster and smaller
and v2 = -25.40625. Therefore v3 should equal -1.359375. design.
The parts of the adder which caused the bottleneck
were the exponent subtracter, the mantissa adder/sub- The result of using the second method is shown in
tracter and the normalization unit. An 8-bit and a 16-bit Table 1. The variable 11-bit shifter became two times
Xilinx hard-macro adder/subtractor[8] was used in place smaller and three times faster.
of VHDL code written for the exponent and mantissa
Method 1 Method 2 Advantage
computation. This increased the overall speed of the
design even though a smaller 12-bit adder/subtracter was FG Function 85/800 10% 44/800 5% 2x smaller
Generators
replaced with a 16-bit adder/subtracter hard macro. The (used/available)
first cut at the normalization unit resulted in a very slow
Flip Flops 6% 6% same
and large design. VHDL for loops were used for the shift
left and for the code that finds the most significant bit dur- Speed 6.5 MHz 19.0 MHz 2.9x faster
ing normalization. In order to decrease the size and
increase the speed of the design, the for loops were TABLE 1. Optimizations for Variable 11-Bit Barrel
unrolled and if statements used instead. Shifter.
The first method used for shifting the mantissa of the
second operand, f2, a variable number of places was origi- The code used to normalize the result after addition
nally coded in VHDL the following way: or subtraction of f1 and f2 was also initially written using
for loops in VHDL.
-- Shift f2 right ediff places
e_diff_var := e_diff;
f2_var := f2(10 downto 0); -- Shift f_result left until msb = 1
msb := f(10);
for i in 1 to 11 loop f_result_var := f;
if (e_diff_var > zero_8) then e_result_var := e;
f2_var(9 downto 0) := f2_var(10 downto 1);
f2_var(10) := ‘0’; for i in 1 to 11 loop
e_diff_var := e_diff_var - 1; if (msb = ‘0’) then
end if; f_result_var(10 downto 1) := f_result_var(9 downto 0);
end loop; f_result_var(0) := ‘0’;
e_result_var := e_result_var - 1;
f2_result(10 downto 0) <= f2_var; msb := f_result_var(10);
end if;
end loop;
The second method used if statements to check each
individual bit of the shift value and shift f2 accordingly. f_result <= f_result_var(9 downto 0);
e_result <= e_result_var;

-- Shift f2 right ediff places


if ((e_diff(7) = ‘1’) or (e_diff(6) = ‘1’) or The second method calculates the number of places to
(e_diff(5) = ‘1’) or (e_diff(4) = ‘1’)) then
e_diff_var(3 downto 0) := “1111”; shift the mantissa to the left in order to position the most
else significant bit (msb) in the high order location. A series of
e_diff_var(3 downto 0) := e_diff(3 downto 0);
end if; if statements are used to check all possible bit position
-- Sequential Code for shifting f2_var
locations for the msb in order to calculate the shift value.
f2_var := f2(10 downto 0); After the shift value is calculated, a procedure similar to
if (e_diff_var(0) = ‘1’) then the second method for performing a variable shift to the
f2_var(9 downto 0) := f2_var(10 downto 1);
f2_var(10) := ‘0’; right is used to shift the un-normalized value the correct
end if; number of positions to the left in order to normalize it.
if (e_diff_var(1) = ‘1’) then
f2_var(8 downto 0) := f2_var(10 downto 2); Due to the size of the VHDL source code it is not listed
f2_var(10 downto 9) := “00”; here for this method.
end if;
if (e_diff_var(2) = ‘1’) then
f2_var(6 downto 0) := f2_var(10 downto 4); By using the second method, the normalization unit
f2_var(10 downto 7) := “0000”;
end if; became 2.9 times smaller and 2.6 times faster. A summary
if (e_diff_var(3) = ‘1’) then of the result of optimizing the normalization unit is shown
f2_var(2 downto 0) := f2_var(10 downto 8);
f2_var(10 downto 3) := “00000000”; in Table 2.
end if;

f2_result(10 downto 0) <= f2_var;


4.1 Algorithm
Method 1 Method 2 Advantage
The block diagram for the three stage 18 bit float-
FG Function 167/800 20% 58/800 7% 2.9x
ing point multiplier is shown in Figure 5. The algorithm
Generators smaller
for each stage is as follows:
Flip Flops 6% 6% same
Speed 5.1 MHz 13.4MHz 2.6x faster Stage 1:
• The exponents, e1 and e2 are added and the result
TABLE 2. Optimizations for Normalization Unit. along with the carry bit is stored in an 8-bit register.
If the addition of two negative exponents results in a
The overall size and speed of the 16 and 18-bit value smaller than the minimum exponent that can be
floating point adders are given in Section 6. represented, i.e. -63, underflow occurs. In this case
the floating point number is set to zero. If overflow
occurs, the result is set to the maximum number the
format can represent.
4.0 Floating Point Multiplication • If the floating point number is not zero, the implied
one is concatenated to the left side of the f1 and f2
Floating point multiplication is much like integer terms.
multiplication. Because floating-point numbers are stored • The sign bits are only registered in this stage.
in sign-magnitude form, the multiplier needs only to deal
with unsigned integer numbers and normalization. Like Stage 2:
the architecture of the floating point adder, the floating
point multiplier unit is a three stage pipeline that produces • Integer multiplication of the two 11-bit quantities, 1.f2
a result on every clock cycle. The bottleneck of this design and 1.f1, is performed. The top 12 bits of the 22-bit
was the integer multiplier. Four different methods were result is stored in a register.
used to optimize the integer multiplier in order to meet • The exponent is adjusted depending on the high order
speed and size requirements. bit of the multiplication result.
• The sign bits of the two operands are compared. If
they are equal to each other the result is assigned a
Sign Exponent Mantissa positive sign, and if they differ the result is negative.
s1 s2 e1 e2 f1 f2
Stage 3:
Stage 1: + “1” “1” • Normalization of the resulting mantissa is performed.
R1 R1 R8 R11 R11 • The resulting sign, exponent and mantissa fields
placed into an 18-bit floating point word.
11
Exponent
Stage 2: Adjust * 4.2 Optimization
Top 12
R1 R7 R12 Four different methods were used to optimize the
11-bit integer multiplier. The first method used the integer
Normalize
Stage 3: multiply available in the Synopsys 3.0a VHDL compiler.
R7 R10
The second method was a simple array multiplier com-
R1
posed of ten 11-bit carry-save adders [3]. The last two
Rx = x-Bit Register methods involved pipelining the multiplier in order to
increase the speed of the design. The multiplication of the
Figure 5: Three stage 18 bit Floating Point Multiplier. two 11-bit quantities were broken up and multiplied in the
following way:
X6 X5 5.1 Algorithm
* Y6 Y5
X5Y5
X6Y5 The reciprocal of a floating point value can be
X5Y6 accomplished in two steps: (1) reciprocate the mantissa
+ X6Y6 value, and (2) negate the power of the base value. Since
22 Bit Result the floating point representation already has its fields seg-
regated, the task becomes trivial for a processing element
In the third method, the first stage of the multiplier was the which is complemented by a memory bank of at least 2n x
multiplication of the four terms X5Y5, X6Y5, X5Y6, and n bits, where n is the size of the mantissa’s normalized
X6 Y6. The second stage involved adding the results of binary representation. Local memories to the processing
the four multiplications. In method 4, two stages were elements store the reciprocal of each bit combination of
used to sum the multiplication terms. the mantissa.
In order to pipeline the design, three steps prior to
The results of the four methods are summarized the multiplication are necessary: (1) extract the mantissa
in Table 3. The advantage in terms of the number of times from the input as the memory address and negate the
faster and the number of times larger than Method 1 is exponent, (2) provide a delay until the memory data is
shown. valid, and (3) insert the new mantissa. The data word cre-
ated during Stage 3 is passed to the multiplier. The second
Method Method Method Method
1 2 3 4 stage of the pipeline depends on the system being used
which could result in longer delays before the data is made
FG Function 35% 31% 45% 47%
Generators
available from memory. In this case, a Splash 2 imple-
mentation was used which is shown in Figure 6. Memory
Stages 1 1 2 3 reads require a single cycle after the address is presented
Speed 4.9 MHz 3.7 MHz 6.2 MHz 9.4 MHz before the data can be acquired from the memory data
Area 1.0 0.90 1.29 1.34 buffer.
Advantage The k1 value negates the exponent, which still
Speed 1.0 0.75 1.24 1.92 retains the embedded excess value. Note that since the
Advantage reciprocal of the given mantissa value will be less than or
equal 1.0, normalization for the mantissa has to be done,
TABLE 3. Results from Four Methods Used to
v1 k1 v2
Optimize an Integer 11-Bit Multiplier.

Method 1 was used in the floating point multiplier unit


delay e1 f1
since the size of the unit was too large using methods 3 or + Memory
Address
4 to allow an additional floating point unit in the same
chip. The overall size and speed of the 16 and 18-bit float- new e1
Stage 1
ing point multipliers are given in Section 6, Summary and
Conclusions.
delay delay

5.0 Floating Point Division Stage 2

1 / f1 Memory
A floating-point division technique is presented Data
here which utilizes the pipelined multiplier discussed ear-
lier. Division can be done by using the reciprocal of the x Stage 3-5
divisor value so that the equation for division becomes a
multiplication of (A x (1/B) = Q). Independent operations
on the different floating point fields enable the design to be Q
pipelined easily.
Figure 6: Three stage 18 bit Floating Point Divider.
although a special case for 1.0 has to be made. The nor-
malization process is done automatically with k1. Once
the addition is done, the result becomes the new exponent Adder/
passed onto Stage 2. The mantissa in Stage 1 directly goes Subtracter Multiplier Divider
to the memory address buffer to obtain the new mantissa, FG Function 28% 44% 46%
but the old mantissa continues into Stage 2 and is replaced Generators
in Stage 3. Stage 2 of the pipeline waits for the data to Flip Flops 14% 14% 34%
become available from the memory. This occurs at Stage
Stages 3 3 5
3. The new mantissa is inserted into the final operand to
be passed to the multiplier. Although three pipeline stages Speed 8.6 MHz 4.9 MHz 4.7 MHz
are shown here, additional stages occur due to the pipe- Tested Speed 10 MHz 10 MHz 10 MHz
lined multiplier to make a total of five stages.
TABLE 5. Summary of 18 bit Floating Point Units.
6.0 Summary and Conclusions

The aim in designing the floating point units was to


broken up into four 12-bit multipliers, allocating two per
pipeline each unit a sufficient number of times in order to
chip[4]. We found that a 16x16 bit multiplier was the larg-
maximize speed and to minimize area It is important to
est parallel integer multiplier that could fit into a Xilinx
note that once the pipeline is full, a result is output every
4010 chip. When synthesized, this multiplier used 75% of
clock cycle. A summary of the resulting size and speed of
the chip area
the 16 bit and 18 bit floating point units is given in Tables
4 and 5 respectively.
The Synopsys Version 3.0a VHDL compiler was
used along with the Xilinx 5.0 tools to compile the VHDL Input Data Coefficient Data

description of the floating point arithmetic units. The Xil- 16 x 16 Partial Convolution Sum

inx timing tool, xdelay, was used to estimate the speed of 16


the designs. S2 S3

P2_Register P3_Register
.

Adder/
Subtracter Multiplier Divider state mux state mux
FG Function 26 % 36% 38%
Generators +
Flip Flops 13 % 13% 32%
Stages 3 3 5
S3
Speed 9.3 MHz 6.0 MHz 5.9 MHz
Im_register

TABLE 4. Summary of 16 bit Floating Point Units.


S0 S1
state mux
To implement single precision floating point arith- Re_register S0_register
metic units on the Splash-2 architecture, the size of the 16
floating point arithmetic units would increase between 2 result
to 4 times over the 18 bit format. A multiply unit would
require two Xilinx 4010 chips and an adder/subtracter unit
Figure 7: The diagram shows a single Splash-2
PE design for an FIR tap to accomplish complex
multiplication. The architecture can achieve two
floating-point calculations per clock cycle.
LUT for Real LUT for Imaginary
Twiddle Factors Twiddle Factors
k k
WN.re 16 WN.im 16

f(x).im

18
PE 1
.
“1”
f(x).im

18
..
PE 2 “1”
f(x).im

18
PE 3
*
k
f(x).im*WN.re
PE 4
-
result.im

f(x).re

18
. *
f(x).re

18
* k
f(x).im*WN.im
18
“0”
+ “0”
k
f(x).re * WN.im
result .re

*
k
18 f(x).re*WN.re k
16 WkN.re 18 f(x).re WN.im

KEY: 16
* Floating Point Multiply
+ Floating Point Add
- Floating Point Subtract
16 or 18 Bit Multiplexor
18-Bit Delay Register

Figure 8: A block diagram of a four PE Splash-2 design for a complex


floating point multiplier used in a FFT butterfly operation. Six floating
operations are calculated every clock cycle at 10 MHz.

Each of the floating point arithmetic units has been [2] J.A. Eldon and C. Robertson, “A Floating Point Format for
incorporated into two applications: a 2-D FFT [6] and a Signal Processing,” Proceedings IEEE International Con-
FIR filter [7]. The FFT application operates at 10 MHz ference on Acoustics, Speech, and Signal Processing, pp.
and the results of the transform are stored in memory on 717-720, 1982.
the Splash-2 array board. These results were checked by
[3] K. Eshraghian and N.H.E. Weste, Principles of CMOS
doing the same transform on a SPARC workstation An
VLSI Design, A Systems Perspective, 2nd Edition, Addison-
FIR tap design using a floating point adder and multiplier Wesley Publishing Company, 1993.
unit is shown in Figure 7. The complex floating point
multiplier used in the 2-D FFT butterfly calculation is [4] B. Fagin and C. Renard, “Field Programmable Gate Arrays
shown in Figure 8. and Floating Point Arithmetic,” IEEE Transactions on
VLSI, Vol. 2, No. 3, pp. 365-367, September 1994.
Acknowledgments
[5] IEEE Task P754, “A Proposed Standard for Binary Float-
ing-Point Arithmetic,” IEEE Computer, Vol. 14, No. 12, pp.
We wish to express our gratitude to Dr. J. T.
51-62, March 1981.
McHenry and Dr. D. Buell. We would also like to thank
Professor J. A. DeGroat of The Ohio State University for [6] N. Shirazi, Implementation of a 2-D Fast Fourier Trans-
technical advice. form on an FPGA Based Computing Platform, VPI&SU
This research has been supported in part by the Masters Thesis in progress.
National Science Foundation (NSF) under grant MIP-
9308390. [7] A. Walters, An Indoor Wireless Communications Channel
Model Implementation on a Custom Computing Platform,
References VPI&SU Master Thesis in progress.

[1] J.M. Arnold, D.A. Buell and E.G. Davis, “Splash 2,” Pro- [8] Xilinx, Inc., The Programmable Logic Data Book, San Jose,
ceedings of the 4th Annual ACM Symposium on Parallel California, 1993.
Algorithms and Architectures, pp. 316-322, June 1992.

You might also like