Synthesis
Synthesis
Part 1
/amradelm
Introduction
• Logic synthesis is the process of converting the HDL code into logic gates.
• The output of synthesis is called a netlist. A netlist is a textual representation of all the nets and cells in the designs and their connections.
• The aim of this document is to explain the steps within the synthesis flow and show the possible optimizations that can be done.
Synthesized
Logic Netlist
Behavioral Synthesized
Verilog Code Logic Schematic
/amradelm
/amradelm 4
Analyze
/amradelm
/amradelm 5
Analyze
Is this file available?
• The first step in synthesis is to read and analyze the HDL code and the other inputs to ensure
nothing is missing or corrupted.
• The analyze step will check that:
o HDL codes contain no syntax issues. Is the syntax correct?
o All modules/include files/functions/etc referenced inside the codes are available and nothing
is missing.
Example HDL1
/amradelm
/amradelm
/amradelm 7
Elaborate
• The second step is called elaborate and sometimes called translate. It’s the
process of converting the HDL codes into actual logic gates.
• The elaboration step outputs are:
o A technology-independent (generic) netlist without any optimizations
done. The cells referenced in the netlist only have the functional
information but no timing, power, or physical information exist.
o Reports about linting and design issues in the codes such as missing
modules, multi-driven nets, width mismatch, etc.
o Reports about the cells, their types, and counts.
/amradelm
/amradelm 8
Map and Optimize
(Compile)
/amradelm
/amradelm 9
Compile
• Compile is the most interesting and important step. Here we map the generic netlist into a technology-dependent netlist and then do several optimizations
to meet the given constraints.
• The constraints are supplied by the user and fall into 3 categories:
o Timing constraints: The clock frequency, the input/output delays, false paths, multi-cycle paths, max transitions, etc.
o Power constraints: Max dynamic power consumption, max leakage power consumption, etc.
o Area constraints.
• Along with the constraints, the user enables or disables application settings that controls the tool behavior and optimizations.
• The compile step outputs are:
o A technology-dependent and optimized netlist
o Reports about the design timing results, power consumption, area utilization, etc.
/amradelm
/amradelm 10
Compile – Map
• In the case of ASIC, the generic cells are mapped to the cells defined inside the standard cell libraries1. a b AND OR NAND NOR XOR
• In the case of FPGAs, the cells are mapped into FFs, LUTs, DSP blocks, and memory blocks that are 0 0 0 0 1 1 0
implemented within the FPGA fabric. 0 1 0 1 1 0 1
o The LUT (look-up table) is a small memory that can be configured to implement any basic logic 1 0 0 1 1 0 1
gate. 1 1 1 1 0 0 0
In the table shown, if we consider the inputs a, and b as the address, we can implement any function
by storing the truth table inside the memory.
o The DSP blocks are special and fast blocks that are used to implement arithmetic operations.
/amradelm
[1] : You can learn more about standard cell libraries here : Click Me /amradelm 11
Compile – Optimize
• After mapping, the synthesis tool will do many optimizations and tradeoffs to fulfill the design constraints. We will go through the important ones.
Constant Propagation
• This optimization propagates constant values to remove redundant logic. For example, a 2-input AND where one of its inputs is a constant “0” will always produce “0”
regardless of the other input value.
• Consider the example below:
o The 1st XOR gate with both inputs having the same logic value will always produce “0”. Therefore we can remove the first XOR and propagate zero to the next XOR.
o The 2nd XOR has A in one input and “0” in the other. An XOR where one of its inputs is a “0” will pass the value of the other input as it is. Therefore we can remove the 2nd
XOR and propagate A to the next XOR.
o The 3rd XOR has A in both its inputs and so, will always produce 0. The entire circuit can be optimized away and a logic “0” be propagated to its output.
0 A 0
/amradelm
/amradelm 12
Compile – Optimize
Cross-Boundary Optimization
• The RTL code consists of several modules that are connected to each other.
• By default, the synthesis tools will synthesize each module separately and then connect them at the top module.
• In the example below, the inverter pair can be removed. But since the inverters exist in different modules the synthesis tool won’t remove them.
• One solution to this is to enable cross-boundary optimization. This allows the tool to perform the optimizations and also propagate constants across the modules. The
drawback is an increase in runtime.
/amradelm
/amradelm 13
Compile – Optimize
Hierarchy Flattening
• The other solution is to remove the module boundaries altogether. This is called flattening and generally produces better results but takes higher runtime and makes post-
synthesis simulations and the PNR flow more difficult1.
/amradelm
[1] : Because it makes tracing signals and referencing cells more difficult. /amradelm 14
Compile – Optimize
NAND/NOR vs AND/OR
• In CMOS logic, NANDs and NORs have smaller area/power and faster delay than an AND or NOR. This is because an AND is implemented by connecting a NAND with an
inverter, and the same for OR and NOR.
• Because of this, ASIC1 synthesis tools will try to optimize the design to use NANDs and NORs when possible.
• The example on the right shows two circuits that perform the same function. However, the bottom one is better in all aspects (timing, area, power)
BAD
/amradelm
[1] : This is not the case in FPGAs because all gates are implemented with LUTs /amradelm 15
Compile – Optimize
Area vs Timing
• Synthesis tools have switches to control the tradeoff between area vs timing.
• Consider the example below, both circuits perform the same function but:
o The circuit on the left has less area 115 𝜇𝑚2 a but longer critical path 1050 𝑝𝑠 (better for area)
o The circuit on the right has more area 180 𝜇𝑚2 a but shorter critical path 800 𝑝𝑠 (better for timing)
o The synthesis tool will choose which circuit to use based on the user settings.
MUX
𝑇𝑐𝑜𝑚𝑏 = 100𝑝𝑠
𝐴𝑟𝑒𝑎 = 10𝜇𝑚2
Adder
𝑇𝑐𝑜𝑚𝑏 = 700𝑝𝑠
𝐴𝑟𝑒𝑎 = 75𝜇𝑚2
Logic
𝑇𝑐𝑜𝑚𝑏 = 250𝑝𝑠
𝐴𝑟𝑒𝑎 = 20𝜇𝑚2
/amradelm
/amradelm 16
Compile – Optimize
Power vs Timing
• Similarly, synthesis tools have switches to control the tradeoff between power vs timing.
• For example, a timing-driven synthesis may use large cells that have smaller delays but higher power and area.
/amradelm
/amradelm 17
Example
/amradelm
/amradelm 18
Example – Memory
• To learn how synthesis works we will manually synthesize the code on the right.
• The code shows a memory of 16 locations, each location is 8 bits wide. The memory is positive edge
triggered
• Let’s start with just a single bit element (1 Flip flop) and then build up the rest of the memory:
o The always block is positive edge triggered so we need a positive edge FF.
o We don’t have a reset. So we don’t need a FF with a reset pin.
o When the address corresponding to this FF is selected, the FF reads and stores the data, otherwise it
maintains the current stored value.
o To implement this we need a mux in front of the FF. When the MUX select signal is “1” it will read the
data, otherwise, it will keep the current value from the FF.
o FPGAs and some ASIC standard cell libraries have FFs with this MUX implemented inside the FF as a
single cell.
o Now we need to implement the circuit that will read the address and generate the enable signal
/amradelm
/amradelm 19
Example – Memory
o To generate the address we need a decoder that outputs 1 to the corresponding FF and 0 to the others.
o This is implemented using AND gates and inverters (the bubbles). The example below shows an AND gate that will produce “1” if the address is “0000” and
“0” otherwise.
o Now we will expand the FF into 8 FFs to store the 8-bit data. The entire row has the same enable/address signal.
Address
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
/amradelm
/amradelm 20
Example – Memory
o We have 16 rows/locations in the memory. We will repeat the last structure 16 times. Each row has its
unique address generator.
• We now finished the elaboration step. We converted the behavioral HDL into logic gates but we still didn’t do
mapping and optimization yet.
• The initial cell count is:
o 16 × 8 𝐹𝐹𝑠 = 128 𝐹𝐹𝑠
o 16 4-Input ANDs
o 32 Inverters
/amradelm
/amradelm 21
Example – Memory
• We will now do the mapping. Assume we don’t have 4-input AND cells. We will have to break it into 2-inputs ANDs as shown.
• We have shown that NANDs and NORs are preferred over ANDs and ORs. We will use Boolean manipulations to convert the ANDs to NANDs and NORs as
shown below.
/amradelm
/amradelm 22
Example – Memory
• We have another optimization to reduce the cell count:
o Consider the last 4 addresses, the orange gates are all the same so we can remove them and create only one cell thus we save area.
/amradelm
/amradelm 23
Example – Memory
• If we do the same to the entire address generator. We end up with 8 NAND gates instead of 32.
/amradelm
/amradelm 24
Example – Memory
• Consider the first row in the schematic below:
o We have a NAND with 2 bubbles/inverters at its input. That is a total of 3 gates. We can replace them with one OR gate that we will do the same function.
/amradelm
/amradelm 25
Example – Memory
• Our circuit now looks like this.
• Let’s address the timing aspects of this circuit:
• Each NOR gate drives 8 FFs. This load might be big for each gate
We can solve this by upsizing the NOR gates to x4 for example.
• Each NAND drives 4 NOR gates. We might need to upsize it to x2.
• We replaced 2 NANDs with an OR (blue arrows). The speed of
an OR is less than a NAND. We might need to convert the OR
gates to low VT cells.
• The red arrows point to the longest and most
critical paths:
Each path goes through an inverter, a NAND, a NOR
then a FF.
The cells on these paths need to be faster
(if they cause setup violations) .
We can upsize them or change the
VT flavor
/amradelm
/amradelm 26
Example – Memory
• The final optimization: The circuit has no output, hence
we don’t need the entire circuit and we can remove it.
This way the area is reduced to 0 ☺.
• Such issues will get detected by post-synthesis
simulation, but you can save your co-workers lots of time
by doing sanity checks on the synthesis logs and
reviewing the modules and cells that got optimized
away.
• There are cases1 where we want to instruct the synthesis
to avoid optimizing certain cells even if they appear
useless to the synthesis tool. This is done using the
“set_dont_touch” command.
/amradelm
[1] : For example, if you have a physical only or analog cells inside your RTL. /amradelm 27
Logic Synthesis
Part 2a – FPGA Fabric
/amradelm
Introduction
• In the previous part we discussed the main steps of logic synthesis and saw some of the optimizations that can be done on the design to meet the
constraints
• In this part we will discuss the synthesis flow on FPGAs and the possible optimizations and settings.
• In order to fully understand how FPGA synthesis work we need to learn what’s inside an FPGA and how it operates.
• FPGAs differ from one vendor to another. So, in the following slides will focus on the XILINX/AMD Series 7 FPGAs
/amradelm
/amradelm 29
Configurable Logic Block (CLB)
• An FPGA consists of a matrix of configurable blocks interconnected via a grid of programmable wires
• The most common and main block is the CLB which is used to create logic elements as simple as an inverter or as complex as a multiplier.
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other using special vertical routes or using the general routes
(more on this later).
Special Routes
General Routes
/amradelm
/amradelm 30
Slice
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other
using special and fast vertical routes or using general routes.
• The slice consists of:
o 4 Lookup tables (LUT) : Used to create any logic function or small memory elements
o MUX7, MUX8: Used to merge the outputs of multiple LUTs to create larger LUTs
o Carry Chain: Used to implement arithmetic functions
o Storage Elements (FFs/Latches)
LUT
Registers
MUX
Slice
/amradelm
/amradelm 31
Lookup Table
(LUT)
/amradelm
/amradelm 32
Lookup Table (LUT)
• The LUT is the main block inside the slice, it is a small memory that can be configured to perform any a b AND OR NAND NOR XOR
logic function. 0 0 0 0 1 1 0
• Inputs to the LUT serve as addresses to its memory, enabling the implementation of any logic function 0 1 0 1 1 0 1
by pre-storing corresponding output values.
1 0 0 1 1 0 1
• The smallest LUT in the 7 series XILINX FPGAs is a 6-input LUT. However, we can implement a smaller 1 1 1 1 0 0 0
function with, for example, 3 inputs, by using the inputs we need and tying the rest to logic 0.
However, we won’t fully utilize the LUT1.
/amradelm
[1] : This fact will be important when we discuss area optimizations in FPGA synthesis /amradelm 33
MUXES
(F7MUX, F8MUX)
/amradelm
/amradelm 34
LUT7 – LUT8
• A larger LUT can be constructed by combining the outputs of multiple smaller LUTs using multiplexers. The example below shows how two 2-input LUTs can be
made into a 3-input LUT using a multiplexer
LUT_8
• The Multiplexers inside the 7 series FPGAs are:
o MUX7: combines two LUT_6 to form a LUT_7. LUT_6
o MUX8: combines two LUT_7 to form a LUT_8.
• The largest combinatorial logic that can implemented within a single slice is an 8-input logic.
If we want to implement a larger logic we need to combine multiple slices.
x1 x0 Out LUT_7
0 0 0
0 1 0 x2 x1 x0 Out
1 0 0 0 0 0 0
1 1 1 0 0 1 0 MUX8
0 1 0 0
0 1 1 1
1 0 0 0
x1 x0 Out
1 0 1 1
0 0 0
1 1 0 1
0 1 1 MUX7
1 1 1 1
1 0 1
1 1 1
3-Input LUT
Slice
/amradelm
/amradelm 35
LUT Combining
• A LUT6 cell is internally constructed using two LUT5 elements and a multiplexer (MUX).
• If the input (I5) is connected to a dynamic input then LUT6 functions as a 6-input logic gate.
• If the input (I5) is tied to “logic 1” then LUT6 functions as two 5-input gates. However, both gates should have the same inputs.
/amradelm
/amradelm 36
Muxes
• Each LUT6 can be configured as a 4-to-1 Multiplexer (4 Data ports, 2 Selection Ports).
• If we combine 2 LUT6 with a MUX7 we get a 8-to-1 MUX. And if we combine all 4 LUT6
we get a 16-to-1 MUX.
• The largest MUX that can be implemented within one slice is a 16-to-1 MUX. If we
need to implement a larger MUX, we need to combine multiple slices.
/amradelm
/amradelm 37
LUT As Memory
(Distributed RAM)
/amradelm
/amradelm 38
LUT As Memory (Distributed RAM)
• The normal LUT is basically a memory so it can be used as a ROM (Read Only Memory). But can
it be used as a RAM that can be modified dynamically?
• The 7 series FPGAs contain special slices called SLICEM that contain special LUTs with a clock,
write_enable, and write_address pins. These pins allow the LUT to be used as a RAM
• The pins of each LUT are:
o DI: The data bit that will be written inside the LUT
o A[6:1] : Read address.
o WA[6:1] : Write address
o WE: Write enable
o O: The memory output
SliceM
/amradelm
/amradelm 39
Sync vs Async Operation
• By default, write operations to the LUTRAM are synchronous, occurring on the clock edge, while read operations are asynchronous.
• However, synchronous read operations can be enabled by utilizing a pipeline register at the LUT output, which introduces a one-cycle latency but improves
timing performance by reducing combinational delay.
Asynchronous Read
(Data changes with the address after a combinational delay)
Synchronous Read
(Data changes with the clock edge)
/amradelm
/amradelm 40
Multi Ports
• FPGAs allow for the creation of memories with multiple read and write ports. This is particularly useful in [1]
applications requiring simultaneous access to different memory locations.
• The diagram below shows how to have a dual port RAM:
o The write data and address are passed to 2 LUTs in parallel. Any write operation will write to both LUTs
simultaneously.
o The upper LUT: has the read address “A[6:1]” and write address “WA[6:1]” connected to each other1.
This means you can either read or write from this LUT at a time but not both simultaneously
o The lower LUT: has separate read and write address pins. This means you can read and write
simultaneously
o Based on the above, the operations we can do simultaneously are:
▪ Write (Both) and Read (Lower)
▪ Read (Upper) and Read (Lower)
• The slice has 4 LUTs which means it can be
configured to be up to a 4-port (quad) RAM.
o Similarly the write data and address goes to all LUTs
in parallel
o Under this configuration the operations that can
be done simultaneously are:
▪ Write and 3 Reads
▪ 4 Reads Dual Port RAM Quad Port RAM
[1] : The first LUT always has the write and read address connected. So under any configuration, it can either do a write or /amradelm
a read operation at a time but not both. The user guide doesn’t mention why the LUT is made so, but it could be that /amradelm 41
there are no routing resources to accommodate 2 separate ports for read and write addresses.
Deeper Memories
• We can have smaller memories with, for example, 16
locations. We will tie the most significant bit of the address
to logic “0”. But this way we don’t fully utilize the LUT.
• The question now is, can we have deeper memories with
more than 64 locations in a single slice?
• We can combine 2 LUT6 (64x1 RAM) to construct a LUT7
(128x1 RAM) using MUX7 as shown in the figure below.
• If we combine all 4 LUT6 we can construct a LUT8 (256x1
RAM). This is the deepest memory that can be implemented
in a single slice. If we need a deeper memory we have to use
multiple slices.
• We can create a dual port 128x1 RAM by using two LUT7
/amradelm
/amradelm 42
Multi-Bit RAMs
• So far all the memory configurations we saw are 1-bit wide. We will now learn how to construct wider (multi-bit) memories.
• To construct wider memories we use multiple memories in parallel. The address and write enable are passed to all RAMs and each RAM has dedicated data in
and data out ports to store one bit as shown in the diagram.
• Since LUT6 is built internally with 2 LUT5, each LUT6 can be used as a 2-bit wide 32x1 RAM. We can use 2, 3, or 4 LUT6 to build 32 deep by 4, 6, or 8-bit wide
RAMs respectively.
• Other configurations include 64x2, 64x3, 128x2 RAMs etc.
/amradelm
/amradelm 43
Shift Register Logic
(SRL)
/amradelm
/amradelm 44
LUT As Shift Register Logic (SRL)
• The LUT can be configured as a shift register without using the FFs inside the slice.
• Each LUT6 can be configured to implement a chain of 32 shift registers1.
• The output of LUT6 (MC31) can be passed to the LUT next to it to form a longer
chain (64, 96, or 128).
• The address ports can be used to dynamically read out bits from the chain of
registers.
32-Bit Shift Register Reading Bits From the Shift Reg 96-Bit Shift Register
[1] : LUT6 can’t implement 64 shift registers because it consists internally of 2 LUT5 and there is no way to connect the /amradelm
/amradelm
/amradelm 46
Carry Chain
/amradelm
/amradelm 47
Carry Chain
• The slice contains primitive cells to implement efficient carry look-ahead adders.
𝐶4
The carry lookahead algorithm calculates the sum as follows:
𝑆
•
1
𝑃3
o Propagate term: 𝑃𝑖 = 𝐴𝑖 ⨁𝐵𝑖 𝑆3
𝐺3
o Generate term: 𝐺𝑖 = 𝐴𝑖 ⋅ 𝐵𝑖
o Carry term: 𝐶𝑖+1 = 𝐺𝑖 + 𝑃𝐼 ⋅ 𝐶𝑖 𝐶3 𝑆 1
o Sum term: 𝑆𝑖 = 𝑃𝑖 ⨁𝐶𝑖 𝑃2
𝑆2
• The main issue lies with the carry term because each bit (𝐶𝑖+1 ) depends on the previous one 𝐺2
(𝐶𝑖 ). This creates a long chain of logic.
o 𝐶1 = 𝐺0 + 𝑃0 ⋅ 𝐶0 𝐶2 𝑆 1
o 𝐶2 = 𝐺1 + 𝑃1 ⋅ 𝐶1 = 𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 ) 𝑃1
𝑆1
o 𝐶3 = 𝐺2 + 𝑃2 ⋅ 𝐶2 = 𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )) 𝐺1
/amradelm
/amradelm 48
Registers
/amradelm
/amradelm 49
Registers
• There are eight1 storage elements per slice. Four of them can be configured as either
edge-triggered D-type flip-flops or level-sensitive latches.
• There are four additional storage elements that can only be configured as edge-
triggered D-type flip-flops
• Each storage element has the following pins:
o D : The input to the FF/Latch. It comes from a LUT or from the bypass signal 𝐴𝑥
that comes from outside the slice and doesn’t go through a LUT.
o Q: The output of the FF/Latch.
o CE: Write enable signal
o SR: Set/Reset pin.
• Each storage element can be configured during synthesis as follows:
o Is it a latch or FF?
o Does it contain 0 or 1 upon FPGA power-up? (INIT0/INIT1)
o Does the SR pin functions as a set pin (1) or reset pin (0)?
o Is the Set/Reset synchronous or asynchronous? This option affects all storage
elements in the slice and can’t be set individually.
[1]
[1] : There are 8 storage elements although there are just 4 LUTs because if we enable LUT combining, each LUT can /amradelm
provide 2 separate outputs (𝑂5 and 𝑂6 ) so under this configuration we can have 8 different outputs from the slice /amradelm 50
Digital Signal Processing
(DSP48E1)
/amradelm
/amradelm 51
DSP48e1
• The 7 series FPGAs contain lots of DSP slices which are used to implement fast, low-power arithmetic blocks.
• The DSP slice consists of:
o 25-bit pre-adder: (𝐴 ± 𝐷)
o 25 x 18 Multiplier: (𝐴 ± 𝐷) × 𝐵
o Control MUXs
o ALU: ( 𝐴 ± 𝐷 × 𝐵) + C
o Patter Detector ALU
o Optional Pipeline registers : To enhance the delay
and synchronize operands
/amradelm
/amradelm 52
Pre-adder
• The pre-adder adds A (least 25-bit) to D (25-bit) and then passes the result to the multiplier Static Bypass Muxes
(A MULT) or bypasses A concatenated to B (A:B) to the ALU (X MUX) or passes A to another
DSP slice (ACOUT).
Dynamic Bypass Muxes
• The pipeline registers inside the pre-adder can be static or dynamic:
o Static means they are always used and can’t be bypassed during runtime.
o Dynamic means they can be controlled during runtime by control signals to enable the
pipelines or bypass them.
o The dynamic behavior is controlled with the INMODE[4:0] port. The port allows more
dynamic operations other than bypassing the pipeline registers. The table below shows
all the different configurations
/amradelm
/amradelm 53
Control MUXs
• The inputs to the Adder/ALU pass through 3 MUXs. The selections of these MUXs can be controlled dynamically
• MUX X:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)
10 P (Final output from the same DSP slice to implement an accumulator)
11 A:B (Concatenated A and B)
• MUX Y Inputs:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)1
10 48’FFFF_FFFF_FFFF
11 C (Input port)
• MUX Z Inputs:
OPMODE MUX Y Output
000 0
001 PCIN (Final output from the adjacent DSP slice)
010 P (Final output from the same DSP slice to implement an accumulator)
011 C (Input port) Inputs to The ALU
100 P
101 17-bit shift PCIN (To enable wider multiplier implementation)2
110 17-bit shift P (To enable wider multiplier implementation)2
111 Illegal selection
[1] : The multiplier output appears at both MUX X and MUX Y because the multiplier produces 2 partial products that /amradelm
should be summed together to get the final result of the multiplier. So each MUX gets one of the partial products /amradelm 54
[2] : Discussed in the next slide
Wider Multipliers – Partial Products
• We can build wider multipliers using smaller multipliers, shifters, and adders.
• Consider the example on the right:
o We want to multiply two 2-digit numbers but only have a single-digit multiplier.
o We can decompose the 2-digit numbers into smaller single-digit numbers like so:
▪ 35 = 30 + 5
▪ 24 = 20 + 4
▪ 32 × 24 = (30 × 20) + (5 × 20) + (30 × 4) + (5 × 4)
▪ We will ignore the right-hand zeros but we should shift the results by the number
of zeros we ignored
❑ 3 × 2 = 6 then shift by 2 = 600
❑ 5 × 2 = 10 then shift by 1 = 100
❑ 3 × 4 = 12 then shift by 1 = 120
❑ 5 × 4 = 20
▪ Now sum all the terms and get the result = 840
• We can do the same thing in binary 35 100011 × 24 011000 as shown in the example on
the right.
/amradelm
/amradelm 55
ALU
• The ALU unit can be configured dynamically using OPMODE[3:2] and ALUMODE[3:0] to
function as:
o Logic unit: to implement all logic functions (AND, NAND, XOR, etc.). When this mode is used,
the multiplier should not be used.
o Adder/Subtractor
• Single Instruction, Multiple Data (SIMD) Mode:
o The ALU can be split into smaller ALUs to do multiple calculations at once.
o The ALU can be configured as two 24-bit ALUs or four 12-bit ALUs. However, the ALUs
should do the same operation (Single Instruction)
o This configuration is static
/amradelm
/amradelm 56
DSP48e1 – Pattern Detector
• The DSP contains a specialized comparator to detect patterns in the output of the last stage.
• The pattern detector works as follows:
1. Select bits to consider/Mask: a mask is applied on the output to select which bits we want to
check and which we want to ignore. When a MASK bit is set to 1, the corresponding pattern bit
is ignored. When 0, the pattern bit is compared
2. Select comparing pattern: the pattern we want to compare our output to. It can be set statically
as a constant or dynamically using the C port1
3. Select operation on detection: If the pattern is matched, we choose one of many operations.
Such as raising a flag upon pattern detection. Other operations include resetting the P register
when a match occurs or resetting if a match didn’t occur.
• The example on the right shows a 16-bit wide fixed-point number. We want to check whether
the result is both odd (least significant integer bit = 1) and an integer (all fractions = 0):
1. The mask: We only consider the least significant bit of the integer part plus all the fraction bits.
Therefore we apply a mask as shown.
2. Comparing pattern: We want to check that the least significant bit = 1 and all the fraction bits =
0 . Therefore we apply a pattern as shown.
3. Operation on detection: We want to raise a flag when this pattern is detected. By default, the
port “PATDET” is set to “1” when a match occurs so no more work is needed.
[1] : Dynamic mask can only be used when the DSP is configured as a multiplier /amradelm
/amradelm 57
Block RAMs
(BRAM)
/amradelm
/amradelm 58
Block RAM (BRAM)
• The block RAM in Xilinx 7 series FPGAs stores up to 36 Kbits of data and can be configured as either two independent 18 Kb RAMs
• The block RAM can be configured into different depths/widths.
• Since the depth can only be a power of 2. The possible configurations are1:
Depth Width (Bits)
32K 1 Depth Width (Bits)
16K 2 16K 1
8K 4 8K 2
4K 9 4K 4
2K 18 2K 9
1K 36 1K 18
512 722 512 362
36Kbit 18Kbit
• The RAM sizes (36Kbits and 18Kbits) are not powers of 2 because the RAMs can include parity bits for each byte to allow error detection and
correction.
o 32 Bits (4 Bytes) + 4 Parity bits = 36
o 16 Bits (2 Bytes) + 2 Parity bits = 18
• Each block RAM can be configured as dual-port RAM to allow WRITE/WRITE, READ/READ, or WRITE/READ operations.
o In the case of collision, where both ports are trying to write to the same location, the memory location is written with non-deterministic data.
It’s up to the user to avoid or handle collisions.
[1] : The RAMs can be of any width or depth but won’t be fully utilized. /amradelm
[2] : This data width is only allowed with SDP configuration. See next slide for more info. /amradelm 59
True Dual Port (TDP) vs Simple Dual Port (SDP)
• The RAM can be:
o True dual port: A total of 2 ports, each port can read or write this allows you to (READ/READ) or (WRITE/WRITE) or (READ/WRITE)
o Simple dual port: A total of 2 ports, one port can only read and the other can only write. This allows (READ/WRITE) only.
▪ This configuration avoids the (WRITE/WRITE) collision and leaves routing resources for other blocks (less nets and pins).
▪ This configuration allows the write and read ports to have different widths1. For example, write can be 18-bit wide, and read can be 36-bit wide.
[1] : Either the Read or Write port is a fixed width of x32 or x36 for RAM18 and x64 or x72 for RAM36. /amradelm
/amradelm 60
Read and Write Operations
• The write operation is synchronous to the clock edge (can be configured to be positive or negative edge).
• By default the read operation is asynchronous to the clock. This mode is called “Latch mode”
o An optional register can be added to the output port to enable synchronous operation. This is done to separate the long combinational 𝑇𝑐𝑞 delay from the
memory to the output, thus allowing increasing the operating frequency.
Asynchronous Read
(Data changes with the address after a combinational delay)
Synchronous Read
(Data changes with the clock edge)
/amradelm
/amradelm 61
Write Modes
• In true dual port mode each port can do a read or write operation. A question arises, what happens to the output read port (DO) if a write operation is done? Does
it change or remain unchanged?
• The BRAM can have one of 3 write modes:
o Write First: The data that is being written appears at the output read port.
o Read First: The old data that was stored in the same address that will get overwritten appears at the output read port.
o No Change: The output data from the last read operation remains at the port.
/amradelm
/amradelm 62