DESIGNING WITH FIELD PROGRAMMABLE GATE ARRAYS
Implementing functions in FPGAs
Typically behavioral, RTL, or structural models of designs are created in a language such as VHDL
or Verilog, and automatic CAD software is used to synthesize, map, partition, place, and route the
design into an FPGA. Let us assume that we want to design a 4-to-1 multiplexer using an FPGA
whose logic block is represented by the following figure.
Example Building Blocks for an FPGA (a) With Look up Tables and flip-flops (b) With
Multiplexers
This building block contains two 4-variable function generators, X and Y, and two flip-flops. The
X function generator can generate any functions of X1, X2, X3, and X4. Similarly, the Y function
generator can create any function of Y1, Y2, Y3, and Y4. Latched or unlatched forms of the
generated functions can be brought to the output of the logic block.
The multiplexer inputs are I0, I1, I2, and I3 and that the multiplexer selects are S1 and S0, the
output equation for the multiplexer can be written as follows:
A 4-to-1 multiplexer can be decomposed into three 2-to-1 multiplexers:
Thus
LUT-M1 – 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1
LUT-M2 – 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1
LUT-M3 – 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1
Here consider M3 as M.
Shift register implementation in LUT-based FPGA:
Implementing functions using Shannon’s decomposition
Shannon’s expansion theorem can be used to decompose functions of large numbers of variables
into functions of fewer variables. We decompose a 4-to-1 multiplexer into 2-to-1 multiplexers in
order to implement it in a logic block with 4-variable function generators.
Let us illustrate Shannon’s decomposition for realizing any 6-variable function Z(a, b, c, d, e, f ).
First, expand the function as follows:
Fig. Realization of Functions by Decomposition; (a) 6-Variable Function Using 5-Variable
Function Generators; (b) 6-Variable Function Using 4-Variable Function Generators
As an example, consider the following function:
𝑍 = 𝑎𝑏𝑐𝑑 ′ 𝑒𝑓 ′ + 𝑎′ 𝑏 ′ 𝑐 ′ 𝑑𝑒𝑓 ′ + 𝑏′𝑐𝑑𝑒′𝑓
Setting 𝑎 = 0 gives
𝑍0 = 0. 𝑏𝑐𝑑 ′ 𝑒𝑓 ′ + 1. 𝑏 ′ 𝑐 ′ 𝑑𝑒𝑓 ′ + 𝑏 ′ 𝑐𝑑𝑒 ′ 𝑓 = 𝑏 ′ 𝑐 ′ 𝑑𝑒𝑓 ′ + 𝑏′𝑐𝑑𝑒′𝑓
And setting 𝑎 = 1 gives
𝑍1 = 1. 𝑏𝑐𝑑′ 𝑒𝑓 ′ + 0. 𝑏 ′ 𝑐 ′ 𝑑𝑒𝑓 ′ + 𝑏 ′ 𝑐𝑑𝑒 ′ 𝑓 = 𝑏𝑐𝑑 ′ 𝑒𝑓 ′ + 𝑏′𝑐𝑑𝑒′𝑓
Since Z0 and Z1 are 5-variable functions, each of them needs a 5-input LUT. If only 4-input LUTs
are available, the 5-variable functions should be further decomposed into 4-variable functions.
This can be done by applying Shannon’s expansion theorem twice, first expanding about a and
then expanding about b.
𝑍(𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓) = 𝑎′ 𝑏 ′ . 𝑍(0,0, 𝑐, 𝑑, 𝑒, 𝑓) + 𝑎′ 𝑏. 𝑍(0,1, 𝑐, 𝑑, 𝑒, 𝑓) +
𝑎𝑏 ′ . 𝑍(1,0, 𝑐, 𝑑, 𝑒, 𝑓) + 𝑎𝑏. 𝑍(1,1, 𝑐, 𝑑, 𝑒, 𝑓)
= 𝑎′ 𝑏 ′ . 𝑌0 + 𝑎′ 𝑏. 𝑌1 + 𝑎𝑏 ′ . 𝑌2 + 𝑎𝑏. 𝑌3
Substituting a = b = 0 gives 𝑌0 = 𝑐 ′ 𝑑𝑒𝑓 ′ + 𝑐𝑑𝑒′𝑓
Substituting a = 0, b = 1 gives 𝑌1 = 0
Substituting a = 1, b = 0 gives 𝑌2 = 𝑐𝑑𝑒′𝑓
Substituting a = b = 1 gives 𝑌3 = 𝑐𝑑′𝑒𝑓′
Function implementation using 4-variable function generators
Eg. 1. Implement a 7-variable function using 4-input LUTs and 2-to-1 multiplexers.
Solution:
Shannon’s expansion can be used to obtain the following decompositions:
7-variable function generator = two 6-variable function generators + a 2-to-1 mux … (i)
6-variable function generator = two 5-variable function generators + a 2-to-1 mux … (ii)
5-variable function generator = two 4-variable function generators + a 2-to-1 mux … (iii)
Substituting (iii) into (ii), we obtain
6-variable function generator = Four 4-variable function generators + three 2-to-1 muxes … (iv)
Substituting (iv) into (i), we obtain
7-variable function generator = Eight 4-variable function generators + seven 2-to-1 muxes
Thus a 7-variable function can be implemented as in the following figure
Carry Chains in FPGAs
The most naïve method to create an adder with FPGAs would be to use FPGA logic blocks to
generate the sum and carry for each bit. A 4-variable look-up table (which is currently the standard
building block) can generate the sum, and another LUT4 will typically be required to realize the
carry equation. The carry output from each bit has to be forwarded to the next bit using interconnect
resources.
Carry chains for fast addition based on Xilinx
Each LUT generates the sum bit of the corresponding input bits (a, b, and Carry-in). The carry
chain generates the carry in parallel and feeds it using the dedicated interconnect to the LUT,
implementing the sum of the next bit. Without such a carry chain, an n-bit adder typically will take
2n logic blocks (if a logic block is a LUT4), whereas with the carry chain, n logic blocks (albeit
with additional dedicated circuitry) are sufficient.
Cascade Chains in FPGAs
Some FPGAs contain support for cascading outputs from FPGA blocks in series. The common
types of cascading are the AND configuration and the OR configuration. These are extremely
useful while creating wide-AND and wide-OR gates. Instead of using separate function generators
to perform AND or OR functions of logic block outputs, the output from one logic block can be
directly fed to the cascade circuitry to create AND or OR functions of the logic block outputs.
An FPGA that uses 4-input LUTs for function generation is shown. If an OR operation of 32
variables is desired, one can accomplish it using eight logic blocks. Each logic block will generate
a 4-variable OR and the cascading OR gate can be used to OR the output from the previous logic
block.
Example Cascade Chains (a) AND Cascade Chain (b) OR Cascade Chain
Register Chains in FPGAs
In many FPGAs, the only input to the flip flops in the logic blocks is through the LUTs or logic
elements. Here, the shift register has to use the LUT to simply act as a wire passing the input via
the LUT to the flip flop. Additionally, the logic block in the following figure cannot implement
any other circuitry.
Examples of Logic Blocks in Commercial FPGAs
We provide three examples of commercial FPGA logic blocks. They are from Xilinx, Altera, and
Microsemi. The Xilinx and Altera architectures both use 6-variable look-up tables as their basic
building block. Microsemi has an architecture that uses a 4-variable look-up table and another that
uses multiplexers and gates.
The Xilinx Kintex Configurable Logic Block
The Xilinx Kintex FPGA uses four copies of a basic block called a slice, illustrated in the following
figure, to form a Configurable Logic Block (CLB). CLB is the Xilinx terminology for the
programmable logic block in its FPGAs.
Simplified View of the Xilinx Kintex “Slice” (1/4 of a CLB). Based on Xilinx
The Altera Stratix IV Logic module
Altera’s name for its basic logic block is the logic module (LM). The following figure illustrates a
simplified view of the logic block of the Altera Stratix IV FPGA. Each LM contains two 6-variable
look-up tables (LUTs) and two flip-flops. Each LUT6 has two independent inputs and four shared
inputs.
Simplified View of the Altera Stratix IV Logic Module. Based on Altera.
Dedicated Memory in FPGAs
Designers typically interfaced the FPGAs to external memory chips when memory was desired.
As chip densities have increased, FPGA designers started to incorporate dedicated memory on
FPGA chips, eliminating the need to interface them with external memory chips. Modern FPGAs
include 16K to 10M bits of dedicated memory.
As an example, the Xilinx Virtex-5 contains 1 to 10M bits of dedicated memory. Similarly, the
Altera Stratix II contains 409K to 9M bits of memory. The Microsemi Fusion contains 27 to 270K
bits of memory. The dedicated memory is typically implemented using a few (4–1000) large blocks
of dedicated SRAM located in the FPGA.
Embedded RAMs in FPGAs
Memory created from LUT cells is called Distributed Memory (in Xilinx terminology). As the
term indicates, this memory is distributed throughout the chip inside the logic blocks. In many
FPGAs, the SRAM blocks are of one size (e.g., 18Kb in Xilinx Virtex). In some FPGAs, there are
blocks of different sizes. For example, the Altera FPGAs have embedded memory built from sizes
such as 4Kb, 9Kb, 20Kb, 144Kb, and 512Kb blocks. Many FPGAs have multiple types of memory
building blocks in the same chip; for instance Altera Stratix IV contains 9Kb and 144Kb memory
blocks. A key feature of the dedicated RAM on modern FPGAs is the ability to adjust the width of
the RAM. They can be placed in various ways to achieve different aspect ratios. Let us assume
that there are 32K bits of SRAM provided as blocks of RAM. This RAM can be used as 32K × 1,
16K × 2, 8K × 4, or 4K × 8.
Creating Memory from LUTs. Based on Xilinx
A 4-variable LUT contains 16 bits of storage. One can create small amounts of memory by
combining the storage cells from the LUTs. Two 4-input LUTs can be used to create a 32 × 1
memory or a 16 × 2 memory. When used as a 32 × 1 memory, there must be five address lines and
one data line.
Dedicated Multipliers in FPGAs
Suppose that a designer wants a 16 × 16 multiplier. If dedicated multipliers are not provided,
several programmable logic blocks will be used to create the 16 × 16 multiplier. Such a multiplier
will be expensive in terms of the number of blocks and interconnect resources used; it will also be
slow because of the switches involved in interconnecting the parts of the multiplier. Dedicated
multipliers will be more area efficient and will be faster than multipliers realized using logic
blocks.
When multiplication of numbers larger than 18 bits is required, several of the dedicated built-in
multipliers can be put together.
If A and B are 32 bits, and C, D, E, and F are the 16 bit components of A and B such that:
𝐴 = 𝐶 ∗ 216 + 𝐷
𝐵 = 𝐸 ∗ 216 + 𝐹
Then
𝐴𝐵 = 𝐶𝐸 ∗ 232 + (𝐷𝐸 + 𝐶𝐹) ∗ 216 + 𝐷𝐹
This means that four multipliers are required to generate the partial products CE, DE, CF, and DF,
and several adders are required to add the partial products.
Verilog code for dedicated multipliers:
module multiplier (A, B, C);
input[31:0] A;
input[31:0] B;
output[63:0] C;
wire[63:0] C;
assign C = A * B ;
endmodule
FPGA Capacity: Maximum vs Usable gates
The number of raw gates that have gone into building an FPGA is not an interesting or useful
metric to an FPGA user. What is useful to the user is a count of the circuitry that can fit into a
particular FPGA. This is called the equivalent gate count. Gate counts are estimated in many
different ways. For instance, a 2-to-1 multiplexer is considered to be four gates, and a 3-input XOR
is considered to be six gates. A 4-input XOR is 9 gates and a flip-flop with clear is considered to
be 6-7 gates. An equivalent gate count can be obtained for a programmable logic block in an FPGA
in this fashion, and the total gate count can be estimated by multiplying it with the number of logic
blocks in the FPGA.
The Programmable Electronics Performance Company (PREP) benchmark suite was an early
attempt to facilitate standard benchmark circuits for ASIC and FPGA benchmarking. Assume that
a particular circuitry typically takes 2000 gates in ASIC, and if an FPGA device can fit 20 copies
of that circuitry, an FPGA vendor may estimate the maximum gate count of their FPGA as 40K.
Since the circuit is simply replicated and no actual interconnection exists between the copies, this
count is also likely to be higher than the gate count of practical circuitry that can be realized in the
FPGA.
PREP Benchmarks
The Programmable Electronics Performance Company (PREP) was a nonprofit organization
that gathered and distributed a series of benchmarks for programmable logic chips in the early
days of FPGAs. The nine PREP benchmark circuits in the PREP 1.3 suite were as follows:
i. An 8-bit data path consisting of a 4:1 MUX, a register, and a shift-register
ii. An 8-bit timer-counter consisting of two registers, a 4:1 MUX, a counter, and a
comparator
iii. A small state machine (8 states, 8 inputs, and 8 outputs)
iv. A larger state machine (16 states, 8 inputs, and 8 outputs)
v. An ALU consisting of a 4 × 4 multiplier, an 8-bit adder, and an 8-bit register
vi. A 16-bit accumulator
vii. A 16-bit counter with synchronous load and enable
viii. A 16-bit prescaled counter with load and enable
ix. A 16-bit address decoder
Design Translation
The term synthesis refers to the translation of an abstract high-level design to a circuit description,
typically in the form of a logic schematic. The output from the synthesis tools may be a logic
schematic together with an associated wirelist, which implements the digital system as an
interconnection of gates, flip-flops, registers, counters, multiplexers, adders, and other basic logic
blocks. This representation is called a netlist.
CAD Design Flow
Even if Verilog/VHDL code compiles and simulates correctly, it may not necessarily synthesize
correctly. And even if the Verilog/VHDL code does synthesize correctly, the resulting
implementation may not be very efficient. In general, synthesis tools will accept only a subset of
Verilog as input. In Verilog, a signal may represent the output of a flip-flop or register, or it may
represent the output of a combinational logic block. The synthesis tool will attempt to determine
what is intended from the context.
assign A = B & C;
implies that A should be implemented using combinational logic. On the other hand, if the
sequential statements
always @(posedge CLK)
begin
A = B & C;
end
appear in an always statement, this implies that A represents a register (or flip-flop) that changes
state on the rising edge of the clock.
Synthesis of a case statement
Most modern synthesizers will also perform optimizations to reduce the logic that is generated.
Because the MUX inputs are constants, elimination of the MUX and several gates is possible by
inspection of the truth table. The optimized output equations are:
𝑏1 = 𝑎1′ 𝑎0 = (𝑎1 + 𝑎0′ )′ and 𝑏0 = (𝑎1 𝑎0 ′)′
Verilog code for case example:
module case_example (a, b);
input[1:0] a; output[1:0] b; reg[1:0] b;
always @(a) begin
case (a)
0:
begin
b <= 1 ;
end
1:
begin
b <= 3 ;
end
2:
begin
b <= 0 ;
end
3:
begin
b <= 1 ;
end endcase
end
endmodule
Synthesized circuit before optimization
Logic optimization
a1 a0 b1 b0
0 0 0 1
0 1 1 1
1 0 0 0
1 1 0 1
𝑏1 = 𝑎1′ . 𝑎0 = (𝑎1 + 𝑎0′ )′
𝑏0 = (𝑎1 . 𝑎0′ )′
Synthesized circuit after optimization
Unintended latch creation
In general, when a Verilog signal is assigned a value, it will hold that value until it is assigned a
new value. Because of this property, some Verilog synthesizers will infer a latch when none is
intended by the designer.
module latch_example (a, b);
input[1:0] a;
output b;
reg b;
always @(a)
begin
case (a)
0:
b <= 1’b1 ;
1:
b <= 1’b0 ;
2:
b <= 1’b1 ;
endcase
end
endmodule
Since the value of b is not specified if a is not equal to 0, 1, or 2, the synthesizer assumes that the
value of b should be held in a latch if a = 3. The case statement results in a 4-to-1 multiplexer
whose data inputs are set to the values in each case. The select lines are controlled by the value
of a. Since the value of b is not specified if a is not equal to 0, 1, or 2, the synthesizer assumes
that the value of b should be held in a latch if a = 3. When a = 3, the previous value of b should
be used as the output. This necessitates a latch whose D input = a0. In order to hold the value in
the latch, the latch gate control signal G should be 0 when a = 3.
a1 a0 b
0 0 1
0 1 0
1 0 1
1 1 previous b
Optimized circuit for the table shown
a1 a0 b
0 0 1
0 1 0
1 0 1
1 1 0
𝑏1 = 𝑎0 ′
The latch can be eliminated by adding the Verilog code b <= 1'b0 for a = 3. If this change is made,
most synthesizers will generate only a multiplexer and no latch. If this change is made, most
synthesizers will generate only a multiplexer and no latch.
module latch_example (a, b);
input[1:0] a;
output b;
reg b;
always @(a)
begin
case (a)
0:
begin
b <= 1’b1 ;
End
1:
begin
b <= 1’b0 ;
end
2:
begin
b <= 1’b1 ;
end
3:
begin
b <= 1’b0 ;
end
endcase
end
endmodule
Synthesis of if statement
If they may intend for Nextstate to retain its previous value if A ≠ 1 and the code will simulate
correctly. However, the synthesizer might interpret this code to mean if A ≠ 1, then Nextstate is
unknown (‘X’), and the result of the synthesis may be incorrect. Also, it will result in latches for
Z.
if (A == 1’b1)
begin
Nextstate <= 3;
Z <= 1;
end
However, the synthesizer might interpret this code to mean if A ≠ 1, then Nextstate is unknown
(‘X’), and the result of the synthesis may be incorrect. Also, it will result in latches for Z. For this
reason, it is always best to include an else clause in every if statement. For example,
if (A == 1’b1)
begin
Nextstate <= 3;
Z <= 1;
end
else
begin
Nextstate <= 2;
Z <= 0;
end
Verilog code for a 4-to-1 Mux using if statement:
module if_example (A, B, C, D, E, Z);
input A;
input B;
input[2:0] C;
input[2:0] D;
input[2:0] E;
output[2:0] Z;
reg[2:0] Z;
always @(A or B)
begin
if (A == 1’b1)
begin
Z <= C ;
end
else
if (B == 1’b0)
begin
Z <= D ;
end
else
begin
Z <= E ;
end
end
endmodule
Equivalent truth table:
A B Z
0 0 D
0 1 E
1 0 C
1 1 C
Synthesized hardware for the above code: