0% found this document useful (0 votes)

30 views60 pages

Synthesis

Logic synthesis is the process of converting HDL code into a netlist, which represents the design's logic gates and their connections. The document outlines the steps involved in synthesis, including analyzing HDL code, elaborating it into logic gates, and optimizing the design based on user-defined constraints such as timing, power, and area. Various optimization techniques, such as constant propagation and cross-boundary optimization, are discussed to improve the efficiency of the synthesized logic.

Uploaded by

Feroz Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views60 pages

Synthesis

Uploaded by

Feroz Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Logic Synthesis

Part 1

Amr Adel Mohammady

/amradelm

/amradelm
Introduction
• Logic synthesis is the process of converting the HDL code into logic gates.
• The output of synthesis is called a netlist. A netlist is a textual representation of all the nets and cells in the designs and their connections.
• The aim of this document is to explain the steps within the synthesis flow and show the possible optimizations that can be done.

Synthesized
Logic Netlist

Behavioral Synthesized
Verilog Code Logic Schematic

/amradelm

/amradelm 4
Analyze

/amradelm

/amradelm 5
Analyze
Is this file available?
• The first step in synthesis is to read and analyze the HDL code and the other inputs to ensure
nothing is missing or corrupted.
• The analyze step will check that:
o HDL codes contain no syntax issues. Is the syntax correct?

o All modules/include files/functions/etc referenced inside the codes are available and nothing
is missing.

Example HDL1

/amradelm

[1] : https://github.com/pulp-platform/pulpino /amradelm 6

Elaborate

/amradelm

/amradelm 7
Elaborate
• The second step is called elaborate and sometimes called translate. It’s the
process of converting the HDL codes into actual logic gates.
• The elaboration step outputs are:
o A technology-independent (generic) netlist without any optimizations
done. The cells referenced in the netlist only have the functional
information but no timing, power, or physical information exist.
o Reports about linting and design issues in the codes such as missing
modules, multi-driven nets, width mismatch, etc.
o Reports about the cells, their types, and counts.

Example Cell Report from Design Compiler

/amradelm

/amradelm 8
Map and Optimize
(Compile)

/amradelm

/amradelm 9
Compile
• Compile is the most interesting and important step. Here we map the generic netlist into a technology-dependent netlist and then do several optimizations
to meet the given constraints.
• The constraints are supplied by the user and fall into 3 categories:
o Timing constraints: The clock frequency, the input/output delays, false paths, multi-cycle paths, max transitions, etc.
o Power constraints: Max dynamic power consumption, max leakage power consumption, etc.
o Area constraints.
• Along with the constraints, the user enables or disables application settings that controls the tool behavior and optimizations.
• The compile step outputs are:
o A technology-dependent and optimized netlist
o Reports about the design timing results, power consumption, area utilization, etc.

/amradelm

/amradelm 10
Compile – Map
• In the case of ASIC, the generic cells are mapped to the cells defined inside the standard cell libraries1. a b AND OR NAND NOR XOR

• In the case of FPGAs, the cells are mapped into FFs, LUTs, DSP blocks, and memory blocks that are 0 0 0 0 1 1 0
implemented within the FPGA fabric. 0 1 0 1 1 0 1
o The LUT (look-up table) is a small memory that can be configured to implement any basic logic 1 0 0 1 1 0 1
gate. 1 1 1 1 0 0 0
In the table shown, if we consider the inputs a, and b as the address, we can implement any function
by storing the truth table inside the memory.
o The DSP blocks are special and fast blocks that are used to implement arithmetic operations.

LUT as AND LUT as OR

DSP48e1 from Xilinx/AMD

/amradelm

[1] : You can learn more about standard cell libraries here : Click Me /amradelm 11
Compile – Optimize
• After mapping, the synthesis tool will do many optimizations and tradeoffs to fulfill the design constraints. We will go through the important ones.
Constant Propagation
• This optimization propagates constant values to remove redundant logic. For example, a 2-input AND where one of its inputs is a constant “0” will always produce “0”
regardless of the other input value.
• Consider the example below:
o The 1st XOR gate with both inputs having the same logic value will always produce “0”. Therefore we can remove the first XOR and propagate zero to the next XOR.
o The 2nd XOR has A in one input and “0” in the other. An XOR where one of its inputs is a “0” will pass the value of the other input as it is. Therefore we can remove the 2nd
XOR and propagate A to the next XOR.
o The 3rd XOR has A in both its inputs and so, will always produce 0. The entire circuit can be optimized away and a logic “0” be propagated to its output.

0 A 0

/amradelm

/amradelm 12
Compile – Optimize

Cross-Boundary Optimization
• The RTL code consists of several modules that are connected to each other.
• By default, the synthesis tools will synthesize each module separately and then connect them at the top module.
• In the example below, the inverter pair can be removed. But since the inverters exist in different modules the synthesis tool won’t remove them.
• One solution to this is to enable cross-boundary optimization. This allows the tool to perform the optimizations and also propagate constants across the modules. The
drawback is an increase in runtime.

/amradelm

/amradelm 13
Compile – Optimize

Hierarchy Flattening
• The other solution is to remove the module boundaries altogether. This is called flattening and generally produces better results but takes higher runtime and makes post-
synthesis simulations and the PNR flow more difficult1.

No Flattening With Flattening

/amradelm

[1] : Because it makes tracing signals and referencing cells more difficult. /amradelm 14
Compile – Optimize

NAND/NOR vs AND/OR
• In CMOS logic, NANDs and NORs have smaller area/power and faster delay than an AND or NOR. This is because an AND is implemented by connecting a NAND with an
inverter, and the same for OR and NOR.
• Because of this, ASIC1 synthesis tools will try to optimize the design to use NANDs and NORs when possible.
• The example on the right shows two circuits that perform the same function. However, the bottom one is better in all aspects (timing, area, power)

CMOS NAND CMOS Inverter

BAD

CMOS NAND CMOS AND Good

/amradelm

[1] : This is not the case in FPGAs because all gates are implemented with LUTs /amradelm 15
Compile – Optimize

Area vs Timing
• Synthesis tools have switches to control the tradeoff between area vs timing.
• Consider the example below, both circuits perform the same function but:
o The circuit on the left has less area 115 𝜇𝑚2 a but longer critical path 1050 𝑝𝑠 (better for area)
o The circuit on the right has more area 180 𝜇𝑚2 a but shorter critical path 800 𝑝𝑠 (better for timing)
o The synthesis tool will choose which circuit to use based on the user settings.

MUX
𝑇𝑐𝑜𝑚𝑏 = 100𝑝𝑠
𝐴𝑟𝑒𝑎 = 10𝜇𝑚2

Adder
𝑇𝑐𝑜𝑚𝑏 = 700𝑝𝑠
𝐴𝑟𝑒𝑎 = 75𝜇𝑚2

Logic
𝑇𝑐𝑜𝑚𝑏 = 250𝑝𝑠
𝐴𝑟𝑒𝑎 = 20𝜇𝑚2

/amradelm

/amradelm 16
Compile – Optimize

Power vs Timing
• Similarly, synthesis tools have switches to control the tradeoff between power vs timing.
• For example, a timing-driven synthesis may use large cells that have smaller delays but higher power and area.

2-Input AND Size x2

2-Input AND Size x1

Timing Library from Skywater 130nm Open-source PDK

/amradelm

/amradelm 17
Example

/amradelm

/amradelm 18
Example – Memory
• To learn how synthesis works we will manually synthesize the code on the right.
• The code shows a memory of 16 locations, each location is 8 bits wide. The memory is positive edge
triggered
• Let’s start with just a single bit element (1 Flip flop) and then build up the rest of the memory:
o The always block is positive edge triggered so we need a positive edge FF.
o We don’t have a reset. So we don’t need a FF with a reset pin.
o When the address corresponding to this FF is selected, the FF reads and stores the data, otherwise it
maintains the current stored value.
o To implement this we need a mux in front of the FF. When the MUX select signal is “1” it will read the
data, otherwise, it will keep the current value from the FF.
o FPGAs and some ASIC standard cell libraries have FFs with this MUX implemented inside the FF as a
single cell.
o Now we need to implement the circuit that will read the address and generate the enable signal

/amradelm

/amradelm 19
Example – Memory
o To generate the address we need a decoder that outputs 1 to the corresponding FF and 0 to the others.
o This is implemented using AND gates and inverters (the bubbles). The example below shows an AND gate that will produce “1” if the address is “0000” and
“0” otherwise.
o Now we will expand the FF into 8 FFs to store the 8-bit data. The entire row has the same enable/address signal.

Address
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

/amradelm

/amradelm 20
Example – Memory
o We have 16 rows/locations in the memory. We will repeat the last structure 16 times. Each row has its
unique address generator.
• We now finished the elaboration step. We converted the behavioral HDL into logic gates but we still didn’t do
mapping and optimization yet.
• The initial cell count is:
o 16 × 8 𝐹𝐹𝑠 = 128 𝐹𝐹𝑠
o 16 4-Input ANDs
o 32 Inverters

/amradelm

/amradelm 21
Example – Memory
• We will now do the mapping. Assume we don’t have 4-input AND cells. We will have to break it into 2-inputs ANDs as shown.

• We have shown that NANDs and NORs are preferred over ANDs and ORs. We will use Boolean manipulations to convert the ANDs to NANDs and NORs as
shown below.

• The current cell count:

o 2 × 16 = 32 𝑁𝐴𝑁𝐷
o 16 𝑁𝑂𝑅

/amradelm

/amradelm 22
Example – Memory
• We have another optimization to reduce the cell count:
o Consider the last 4 addresses, the orange gates are all the same so we can remove them and create only one cell thus we save area.

addr[3] addr[2] addr[1] addr[0]

0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1

/amradelm

/amradelm 23
Example – Memory
• If we do the same to the entire address generator. We end up with 8 NAND gates instead of 32.

addr[3] addr[2] addr[1] addr[0]

0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1

/amradelm

/amradelm 24
Example – Memory
• Consider the first row in the schematic below:
o We have a NAND with 2 bubbles/inverters at its input. That is a total of 3 gates. We can replace them with one OR gate that we will do the same function.

/amradelm

/amradelm 25
Example – Memory
• Our circuit now looks like this.
• Let’s address the timing aspects of this circuit:
• Each NOR gate drives 8 FFs. This load might be big for each gate
We can solve this by upsizing the NOR gates to x4 for example.
• Each NAND drives 4 NOR gates. We might need to upsize it to x2.
• We replaced 2 NANDs with an OR (blue arrows). The speed of
an OR is less than a NAND. We might need to convert the OR
gates to low VT cells.
• The red arrows point to the longest and most
critical paths:
Each path goes through an inverter, a NAND, a NOR
then a FF.
The cells on these paths need to be faster
(if they cause setup violations) .
We can upsize them or change the
VT flavor

/amradelm

/amradelm 26
Example – Memory
• The final optimization: The circuit has no output, hence
we don’t need the entire circuit and we can remove it.
This way the area is reduced to 0 ☺.
• Such issues will get detected by post-synthesis
simulation, but you can save your co-workers lots of time
by doing sanity checks on the synthesis logs and
reviewing the modules and cells that got optimized
away.
• There are cases1 where we want to instruct the synthesis
to avoid optimizing certain cells even if they appear
useless to the synthesis tool. This is done using the
“set_dont_touch” command.

/amradelm

[1] : For example, if you have a physical only or analog cells inside your RTL. /amradelm 27
Logic Synthesis
Part 2a – FPGA Fabric

Amr Adel Mohammady

/amradelm

/amradelm
Introduction
• In the previous part we discussed the main steps of logic synthesis and saw some of the optimizations that can be done on the design to meet the
constraints
• In this part we will discuss the synthesis flow on FPGAs and the possible optimizations and settings.
• In order to fully understand how FPGA synthesis work we need to learn what’s inside an FPGA and how it operates.
• FPGAs differ from one vendor to another. So, in the following slides will focus on the XILINX/AMD Series 7 FPGAs

/amradelm

/amradelm 29
Configurable Logic Block (CLB)
• An FPGA consists of a matrix of configurable blocks interconnected via a grid of programmable wires
• The most common and main block is the CLB which is used to create logic elements as simple as an inverter or as complex as a multiplier.
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other using special vertical routes or using the general routes
(more on this later).

Special Routes

General Routes

/amradelm

/amradelm 30
Slice
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other
using special and fast vertical routes or using general routes.
• The slice consists of:
o 4 Lookup tables (LUT) : Used to create any logic function or small memory elements
o MUX7, MUX8: Used to merge the outputs of multiple LUTs to create larger LUTs
o Carry Chain: Used to implement arithmetic functions
o Storage Elements (FFs/Latches)
LUT
Registers

MUX

Carry Chain (Adders)

Slice
/amradelm

/amradelm 31
Lookup Table
(LUT)

/amradelm

/amradelm 32
Lookup Table (LUT)
• The LUT is the main block inside the slice, it is a small memory that can be configured to perform any a b AND OR NAND NOR XOR
logic function. 0 0 0 0 1 1 0
• Inputs to the LUT serve as addresses to its memory, enabling the implementation of any logic function 0 1 0 1 1 0 1
by pre-storing corresponding output values.
1 0 0 1 1 0 1
• The smallest LUT in the 7 series XILINX FPGAs is a 6-input LUT. However, we can implement a smaller 1 1 1 1 0 0 0
function with, for example, 3 inputs, by using the inputs we need and tying the rest to logic 0.
However, we won’t fully utilize the LUT1.

LUT as AND LUT as OR

/amradelm

[1] : This fact will be important when we discuss area optimizations in FPGA synthesis /amradelm 33
MUXES
(F7MUX, F8MUX)

/amradelm

/amradelm 34
LUT7 – LUT8
• A larger LUT can be constructed by combining the outputs of multiple smaller LUTs using multiplexers. The example below shows how two 2-input LUTs can be
made into a 3-input LUT using a multiplexer
LUT_8
• The Multiplexers inside the 7 series FPGAs are:
o MUX7: combines two LUT_6 to form a LUT_7. LUT_6
o MUX8: combines two LUT_7 to form a LUT_8.
• The largest combinatorial logic that can implemented within a single slice is an 8-input logic.
If we want to implement a larger logic we need to combine multiple slices.

x1 x0 Out LUT_7
0 0 0
0 1 0 x2 x1 x0 Out
1 0 0 0 0 0 0
1 1 1 0 0 1 0 MUX8
0 1 0 0
0 1 1 1
1 0 0 0
x1 x0 Out
1 0 1 1
0 0 0
1 1 0 1
0 1 1 MUX7
1 1 1 1
1 0 1
1 1 1

3-Input LUT
Slice
/amradelm

/amradelm 35
LUT Combining
• A LUT6 cell is internally constructed using two LUT5 elements and a multiplexer (MUX).
• If the input (I5) is connected to a dynamic input then LUT6 functions as a 6-input logic gate.
• If the input (I5) is tied to “logic 1” then LUT6 functions as two 5-input gates. However, both gates should have the same inputs.

6 Input Gate Two 5 Input Gates

/amradelm

/amradelm 36
Muxes
• Each LUT6 can be configured as a 4-to-1 Multiplexer (4 Data ports, 2 Selection Ports).
• If we combine 2 LUT6 with a MUX7 we get a 8-to-1 MUX. And if we combine all 4 LUT6
we get a 16-to-1 MUX.
• The largest MUX that can be implemented within one slice is a 16-to-1 MUX. If we
need to implement a larger MUX, we need to combine multiple slices.

/amradelm

/amradelm 37
LUT As Memory
(Distributed RAM)

/amradelm

/amradelm 38
LUT As Memory (Distributed RAM)
• The normal LUT is basically a memory so it can be used as a ROM (Read Only Memory). But can
it be used as a RAM that can be modified dynamically?
• The 7 series FPGAs contain special slices called SLICEM that contain special LUTs with a clock,
write_enable, and write_address pins. These pins allow the LUT to be used as a RAM
• The pins of each LUT are:
o DI: The data bit that will be written inside the LUT
o A[6:1] : Read address.
o WA[6:1] : Write address
o WE: Write enable
o O: The memory output

SliceM
/amradelm

/amradelm 39
Sync vs Async Operation
• By default, write operations to the LUTRAM are synchronous, occurring on the clock edge, while read operations are asynchronous.
• However, synchronous read operations can be enabled by utilizing a pipeline register at the LUT output, which introduces a one-cycle latency but improves
timing performance by reducing combinational delay.

Asynchronous Read
(Data changes with the address after a combinational delay)

Synchronous Read
(Data changes with the clock edge)

/amradelm

/amradelm 40
Multi Ports
• FPGAs allow for the creation of memories with multiple read and write ports. This is particularly useful in [1]
applications requiring simultaneous access to different memory locations.
• The diagram below shows how to have a dual port RAM:
o The write data and address are passed to 2 LUTs in parallel. Any write operation will write to both LUTs
simultaneously.
o The upper LUT: has the read address “A[6:1]” and write address “WA[6:1]” connected to each other1.
This means you can either read or write from this LUT at a time but not both simultaneously
o The lower LUT: has separate read and write address pins. This means you can read and write
simultaneously
o Based on the above, the operations we can do simultaneously are:
▪ Write (Both) and Read (Lower)
▪ Read (Upper) and Read (Lower)
• The slice has 4 LUTs which means it can be
configured to be up to a 4-port (quad) RAM.
o Similarly the write data and address goes to all LUTs
in parallel
o Under this configuration the operations that can
be done simultaneously are:
▪ Write and 3 Reads
▪ 4 Reads Dual Port RAM Quad Port RAM

[1] : The first LUT always has the write and read address connected. So under any configuration, it can either do a write or /amradelm

a read operation at a time but not both. The user guide doesn’t mention why the LUT is made so, but it could be that /amradelm 41
there are no routing resources to accommodate 2 separate ports for read and write addresses.
Deeper Memories
• We can have smaller memories with, for example, 16
locations. We will tie the most significant bit of the address
to logic “0”. But this way we don’t fully utilize the LUT.
• The question now is, can we have deeper memories with
more than 64 locations in a single slice?
• We can combine 2 LUT6 (64x1 RAM) to construct a LUT7
(128x1 RAM) using MUX7 as shown in the figure below.
• If we combine all 4 LUT6 we can construct a LUT8 (256x1
RAM). This is the deepest memory that can be implemented
in a single slice. If we need a deeper memory we have to use
multiple slices.
• We can create a dual port 128x1 RAM by using two LUT7

128x1 RAM Dual Port 128x1 RAM 256x1 RAM

/amradelm

/amradelm 42
Multi-Bit RAMs
• So far all the memory configurations we saw are 1-bit wide. We will now learn how to construct wider (multi-bit) memories.
• To construct wider memories we use multiple memories in parallel. The address and write enable are passed to all RAMs and each RAM has dedicated data in
and data out ports to store one bit as shown in the diagram.
• Since LUT6 is built internally with 2 LUT5, each LUT6 can be used as a 2-bit wide 32x1 RAM. We can use 2, 3, or 4 LUT6 to build 32 deep by 4, 6, or 8-bit wide
RAMs respectively.
• Other configurations include 64x2, 64x3, 128x2 RAMs etc.

4-Bit Wide RAM 32x2 RAM

/amradelm

/amradelm 43
Shift Register Logic
(SRL)

/amradelm

/amradelm 44
LUT As Shift Register Logic (SRL)
• The LUT can be configured as a shift register without using the FFs inside the slice.
• Each LUT6 can be configured to implement a chain of 32 shift registers1.
• The output of LUT6 (MC31) can be passed to the LUT next to it to form a longer
chain (64, 96, or 128).
• The address ports can be used to dynamically read out bits from the chain of
registers.

32-Bit Shift Register Reading Bits From the Shift Reg 96-Bit Shift Register

[1] : LUT6 can’t implement 64 shift registers because it consists internally of 2 LUT5 and there is no way to connect the /amradelm

output of one LUT5 to the next LUT5. /amradelm 45

Static vs Dynamic Configuration
• You can implement shift registers that are not multiple of 32.
• The diagram below shows an 8-bit shift register:
o The address is tied to logic “00111”, which means the output Q will always read out bit 8 from the shift
register.
o Since the address port is tied we lose the ability to dynamically read bits from the shift register chain.
o Xilinx calls this a static shift register configuration.

8-Bit Static Shift Register 72-Bit Static Shift Register

/amradelm

/amradelm 46
Carry Chain

/amradelm

/amradelm 47
Carry Chain
• The slice contains primitive cells to implement efficient carry look-ahead adders.
𝐶4
The carry lookahead algorithm calculates the sum as follows:
𝑆
•
1

𝑃3
o Propagate term: 𝑃𝑖 = 𝐴𝑖 ⨁𝐵𝑖 𝑆3
𝐺3
o Generate term: 𝐺𝑖 = 𝐴𝑖 ⋅ 𝐵𝑖
o Carry term: 𝐶𝑖+1 = 𝐺𝑖 + 𝑃𝐼 ⋅ 𝐶𝑖 𝐶3 𝑆 1
o Sum term: 𝑆𝑖 = 𝑃𝑖 ⨁𝐶𝑖 𝑃2
𝑆2
• The main issue lies with the carry term because each bit (𝐶𝑖+1 ) depends on the previous one 𝐺2
(𝐶𝑖 ). This creates a long chain of logic.
o 𝐶1 = 𝐺0 + 𝑃0 ⋅ 𝐶0 𝐶2 𝑆 1

o 𝐶2 = 𝐺1 + 𝑃1 ⋅ 𝐶1 = 𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 ) 𝑃1
𝑆1
o 𝐶3 = 𝐺2 + 𝑃2 ⋅ 𝐶2 = 𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )) 𝐺1

o 𝐶4 = 𝐺3 + 𝑃3 ⋅ 𝐶3 = 𝐺3 + 𝑃3 ⋅ (𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )))

𝐶1 𝑆
• Xilinx 7 series aims to optimize the delay of this long chain using fast carry chain primitives. 1
𝑃0
• Each slice can implement up to a 4-bit adder. Larger adders will require multiple slices. 𝑆0
𝐺0

/amradelm

/amradelm 48
Registers

/amradelm

/amradelm 49
Registers
• There are eight1 storage elements per slice. Four of them can be configured as either
edge-triggered D-type flip-flops or level-sensitive latches.
• There are four additional storage elements that can only be configured as edge-
triggered D-type flip-flops
• Each storage element has the following pins:
o D : The input to the FF/Latch. It comes from a LUT or from the bypass signal 𝐴𝑥
that comes from outside the slice and doesn’t go through a LUT.
o Q: The output of the FF/Latch.
o CE: Write enable signal
o SR: Set/Reset pin.
• Each storage element can be configured during synthesis as follows:
o Is it a latch or FF?
o Does it contain 0 or 1 upon FPGA power-up? (INIT0/INIT1)
o Does the SR pin functions as a set pin (1) or reset pin (0)?
o Is the Set/Reset synchronous or asynchronous? This option affects all storage
elements in the slice and can’t be set individually.

[1]

[1] : There are 8 storage elements although there are just 4 LUTs because if we enable LUT combining, each LUT can /amradelm

provide 2 separate outputs (𝑂5 and 𝑂6 ) so under this configuration we can have 8 different outputs from the slice /amradelm 50
Digital Signal Processing
(DSP48E1)

/amradelm

/amradelm 51
DSP48e1
• The 7 series FPGAs contain lots of DSP slices which are used to implement fast, low-power arithmetic blocks.
• The DSP slice consists of:
o 25-bit pre-adder: (𝐴 ± 𝐷)
o 25 x 18 Multiplier: (𝐴 ± 𝐷) × 𝐵
o Control MUXs
o ALU: ( 𝐴 ± 𝐷 × 𝐵) + C
o Patter Detector ALU
o Optional Pipeline registers : To enhance the delay
and synchronize operands

Simplified DSP48e1 Block Diagram

/amradelm

/amradelm 52
Pre-adder
• The pre-adder adds A (least 25-bit) to D (25-bit) and then passes the result to the multiplier Static Bypass Muxes
(A MULT) or bypasses A concatenated to B (A:B) to the ALU (X MUX) or passes A to another
DSP slice (ACOUT).
Dynamic Bypass Muxes
• The pipeline registers inside the pre-adder can be static or dynamic:
o Static means they are always used and can’t be bypassed during runtime.
o Dynamic means they can be controlled during runtime by control signals to enable the
pipelines or bypass them.
o The dynamic behavior is controlled with the INMODE[4:0] port. The port allows more
dynamic operations other than bypassing the pipeline registers. The table below shows
all the different configurations

INMODE[3] INMODE[2] INMODE[1] INMODE[0] USE_DPORT Multiplier Input

0 0 0 0 FALSE A2
0 0 0 1 FALSE A1
0 0 1 0 FALSE Zero
0 0 1 1 FALSE Zero
0 1 0 0 TRUE A2
0 1 0 1 TRUE A1
0 1 1 0 TRUE Zero
0 1 1 1 TRUE Zero
1 0 0 0 TRUE D + A2
1 0 0 1 TRUE D + A1
1 0 1 0 TRUE D
1 0 1 1 TRUE D
1 1 0 0 TRUE -A2
1 1 0 1 TRUE -A1
1 1 1 0 TRUE Zero
1 1 1 1 TRUE Zero

/amradelm

/amradelm 53
Control MUXs
• The inputs to the Adder/ALU pass through 3 MUXs. The selections of these MUXs can be controlled dynamically
• MUX X:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)
10 P (Final output from the same DSP slice to implement an accumulator)
11 A:B (Concatenated A and B)

• MUX Y Inputs:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)1
10 48’FFFF_FFFF_FFFF
11 C (Input port)

• MUX Z Inputs:
OPMODE MUX Y Output
000 0
001 PCIN (Final output from the adjacent DSP slice)
010 P (Final output from the same DSP slice to implement an accumulator)
011 C (Input port) Inputs to The ALU
100 P
101 17-bit shift PCIN (To enable wider multiplier implementation)2
110 17-bit shift P (To enable wider multiplier implementation)2
111 Illegal selection

[1] : The multiplier output appears at both MUX X and MUX Y because the multiplier produces 2 partial products that /amradelm

should be summed together to get the final result of the multiplier. So each MUX gets one of the partial products /amradelm 54
[2] : Discussed in the next slide
Wider Multipliers – Partial Products
• We can build wider multipliers using smaller multipliers, shifters, and adders.
• Consider the example on the right:
o We want to multiply two 2-digit numbers but only have a single-digit multiplier.
o We can decompose the 2-digit numbers into smaller single-digit numbers like so:
▪ 35 = 30 + 5
▪ 24 = 20 + 4
▪ 32 × 24 = (30 × 20) + (5 × 20) + (30 × 4) + (5 × 4)
▪ We will ignore the right-hand zeros but we should shift the results by the number
of zeros we ignored
❑ 3 × 2 = 6 then shift by 2 = 600
❑ 5 × 2 = 10 then shift by 1 = 100
❑ 3 × 4 = 12 then shift by 1 = 120
❑ 5 × 4 = 20
▪ Now sum all the terms and get the result = 840
• We can do the same thing in binary 35 100011 × 24 011000 as shown in the example on
the right.

/amradelm

/amradelm 55
ALU
• The ALU unit can be configured dynamically using OPMODE[3:2] and ALUMODE[3:0] to
function as:
o Logic unit: to implement all logic functions (AND, NAND, XOR, etc.). When this mode is used,
the multiplier should not be used.
o Adder/Subtractor
• Single Instruction, Multiple Data (SIMD) Mode:
o The ALU can be split into smaller ALUs to do multiple calculations at once.
o The ALU can be configured as two 24-bit ALUs or four 12-bit ALUs. However, the ALUs
should do the same operation (Single Instruction)
o This configuration is static

Simplified DSP48e1 Block Diagram SIMD Config

/amradelm

/amradelm 56
DSP48e1 – Pattern Detector
• The DSP contains a specialized comparator to detect patterns in the output of the last stage.
• The pattern detector works as follows:
1. Select bits to consider/Mask: a mask is applied on the output to select which bits we want to
check and which we want to ignore. When a MASK bit is set to 1, the corresponding pattern bit
is ignored. When 0, the pattern bit is compared
2. Select comparing pattern: the pattern we want to compare our output to. It can be set statically
as a constant or dynamically using the C port1
3. Select operation on detection: If the pattern is matched, we choose one of many operations.
Such as raising a flag upon pattern detection. Other operations include resetting the P register
when a match occurs or resetting if a match didn’t occur.
• The example on the right shows a 16-bit wide fixed-point number. We want to check whether
the result is both odd (least significant integer bit = 1) and an integer (all fractions = 0):
1. The mask: We only consider the least significant bit of the integer part plus all the fraction bits.
Therefore we apply a mask as shown.
2. Comparing pattern: We want to check that the least significant bit = 1 and all the fraction bits =
0 . Therefore we apply a pattern as shown.
3. Operation on detection: We want to raise a flag when this pattern is detected. By default, the
port “PATDET” is set to “1” when a match occurs so no more work is needed.

[1] : Dynamic mask can only be used when the DSP is configured as a multiplier /amradelm

/amradelm 57
Block RAMs
(BRAM)

/amradelm

/amradelm 58
Block RAM (BRAM)
• The block RAM in Xilinx 7 series FPGAs stores up to 36 Kbits of data and can be configured as either two independent 18 Kb RAMs
• The block RAM can be configured into different depths/widths.
• Since the depth can only be a power of 2. The possible configurations are1:
Depth Width (Bits)
32K 1 Depth Width (Bits)
16K 2 16K 1
8K 4 8K 2
4K 9 4K 4
2K 18 2K 9
1K 36 1K 18
512 722 512 362

36Kbit 18Kbit

• The RAM sizes (36Kbits and 18Kbits) are not powers of 2 because the RAMs can include parity bits for each byte to allow error detection and
correction.
o 32 Bits (4 Bytes) + 4 Parity bits = 36
o 16 Bits (2 Bytes) + 2 Parity bits = 18
• Each block RAM can be configured as dual-port RAM to allow WRITE/WRITE, READ/READ, or WRITE/READ operations.
o In the case of collision, where both ports are trying to write to the same location, the memory location is written with non-deterministic data.
It’s up to the user to avoid or handle collisions.

[1] : The RAMs can be of any width or depth but won’t be fully utilized. /amradelm

[2] : This data width is only allowed with SDP configuration. See next slide for more info. /amradelm 59
True Dual Port (TDP) vs Simple Dual Port (SDP)
• The RAM can be:
o True dual port: A total of 2 ports, each port can read or write this allows you to (READ/READ) or (WRITE/WRITE) or (READ/WRITE)
o Simple dual port: A total of 2 ports, one port can only read and the other can only write. This allows (READ/WRITE) only.
▪ This configuration avoids the (WRITE/WRITE) collision and leaves routing resources for other blocks (less nets and pins).
▪ This configuration allows the write and read ports to have different widths1. For example, write can be 18-bit wide, and read can be 36-bit wide.

Simple Dual Port

True Dual Port

[1] : Either the Read or Write port is a fixed width of x32 or x36 for RAM18 and x64 or x72 for RAM36. /amradelm

/amradelm 60
Read and Write Operations
• The write operation is synchronous to the clock edge (can be configured to be positive or negative edge).
• By default the read operation is asynchronous to the clock. This mode is called “Latch mode”
o An optional register can be added to the output port to enable synchronous operation. This is done to separate the long combinational 𝑇𝑐𝑞 delay from the
memory to the output, thus allowing increasing the operating frequency.

Asynchronous Read
(Data changes with the address after a combinational delay)

Synchronous Read
(Data changes with the clock edge)

/amradelm

/amradelm 61
Write Modes
• In true dual port mode each port can do a read or write operation. A question arises, what happens to the output read port (DO) if a write operation is done? Does
it change or remain unchanged?
• The BRAM can have one of 3 write modes:
o Write First: The data that is being written appears at the output read port.
o Read First: The old data that was stored in the same address that will get overwritten appears at the output read port.
o No Change: The output data from the last read operation remains at the port.

WRITE_FIRST READ_FIRST NO_CHANGE

/amradelm

/amradelm 62

Logic Synthesis All Parts 1727645136
No ratings yet
Logic Synthesis All Parts 1727645136
108 pages
Logic Synthesis Steps & Optimization
No ratings yet
Logic Synthesis Steps & Optimization
113 pages
24 02 18 Rejender Pratap
No ratings yet
24 02 18 Rejender Pratap
95 pages
Digital VLSI Design Lecture 5: Logic Synthesis: Semester A, 2016-17 Lecturer: Dr. Adam Teman
No ratings yet
Digital VLSI Design Lecture 5: Logic Synthesis: Semester A, 2016-17 Lecturer: Dr. Adam Teman
59 pages
Digital Design and Synthesis: Fall 09
No ratings yet
Digital Design and Synthesis: Fall 09
32 pages
Modeling For Synthesis With VHDL
No ratings yet
Modeling For Synthesis With VHDL
74 pages
Synthesis
No ratings yet
Synthesis
15 pages
Synthesis
No ratings yet
Synthesis
34 pages
Logic Synthesis with Design Compiler
No ratings yet
Logic Synthesis with Design Compiler
105 pages
LECTURE 2. From Combination Alto Processor
No ratings yet
LECTURE 2. From Combination Alto Processor
64 pages
Digital VLSI Design Lecture 3: Logic Synthesis: Semester A, 2018-19 Lecturer: Dr. Adam Teman
No ratings yet
Digital VLSI Design Lecture 3: Logic Synthesis: Semester A, 2018-19 Lecturer: Dr. Adam Teman
47 pages
Verilog Synthesis & Optimization Guide
No ratings yet
Verilog Synthesis & Optimization Guide
32 pages
Simplified FPGA Design Implementation Flow
No ratings yet
Simplified FPGA Design Implementation Flow
36 pages
Session 2
No ratings yet
Session 2
208 pages
Cell B
No ratings yet
Cell B
73 pages
Introduction To Synthesis
100% (1)
Introduction To Synthesis
39 pages
Architectural and System Synthesis: Camposano, J. Hofstede, Knapp, Macmillen Lin
No ratings yet
Architectural and System Synthesis: Camposano, J. Hofstede, Knapp, Macmillen Lin
106 pages
VLSI CAD Flow: Logic Synthesis,: by Ajay Joshi (Slides by S. Devadas)
No ratings yet
VLSI CAD Flow: Logic Synthesis,: by Ajay Joshi (Slides by S. Devadas)
22 pages
FPGA Design and Architecture Guide
No ratings yet
FPGA Design and Architecture Guide
104 pages
Sec5-Fpga - Part1
No ratings yet
Sec5-Fpga - Part1
41 pages
Unit 3 DSD
No ratings yet
Unit 3 DSD
107 pages
ASIC Design Flow Tutorial
No ratings yet
ASIC Design Flow Tutorial
130 pages
Notes - Lecture 5
No ratings yet
Notes - Lecture 5
76 pages
HDL Le Unite 4
No ratings yet
HDL Le Unite 4
75 pages
Anna University Regional Campus, Coimbatore: Department of Electronics and Communication Engineering
No ratings yet
Anna University Regional Campus, Coimbatore: Department of Electronics and Communication Engineering
32 pages
Design Technology: BY Hassan Al Manasrah Tamir Al Zu'Bi
No ratings yet
Design Technology: BY Hassan Al Manasrah Tamir Al Zu'Bi
50 pages
RTL Synthesis for Engineers
No ratings yet
RTL Synthesis for Engineers
71 pages
008 Architectural
No ratings yet
008 Architectural
45 pages
VLSI Design Style
No ratings yet
VLSI Design Style
34 pages
Chapter 3
No ratings yet
Chapter 3
76 pages
UNIT4 - Introduction To Synthesis
No ratings yet
UNIT4 - Introduction To Synthesis
97 pages
Combinational Logic Implementation
No ratings yet
Combinational Logic Implementation
22 pages
Ec3561-Vlsi Design Lab
No ratings yet
Ec3561-Vlsi Design Lab
144 pages
VHDL Synthesis Basics FSM Implementations (PDFDrive)
No ratings yet
VHDL Synthesis Basics FSM Implementations (PDFDrive)
94 pages
Lecture 6 - RTL Synthesis
100% (2)
Lecture 6 - RTL Synthesis
72 pages
EE3022-Vlsi Lab Manual
No ratings yet
EE3022-Vlsi Lab Manual
142 pages
Lecture 7-Design Technology
No ratings yet
Lecture 7-Design Technology
57 pages
FPGA Based System Design
No ratings yet
FPGA Based System Design
40 pages
Chap - 03-Comb Logic Design
No ratings yet
Chap - 03-Comb Logic Design
65 pages
Chapter 14
No ratings yet
Chapter 14
23 pages
02 Verilog
No ratings yet
02 Verilog
101 pages
DataFlow Modelling
No ratings yet
DataFlow Modelling
60 pages
CMOS VLSI Unit 2
No ratings yet
CMOS VLSI Unit 2
51 pages
Lec 03
No ratings yet
Lec 03
77 pages
Design Technology
No ratings yet
Design Technology
50 pages
Electronics in High Energy Physic: Field Programmable Gate Arrays
No ratings yet
Electronics in High Energy Physic: Field Programmable Gate Arrays
37 pages
P2L4.1 FPGA System Design and Implementation With Vivado and Vitis
No ratings yet
P2L4.1 FPGA System Design and Implementation With Vivado and Vitis
40 pages
Digital IC Design Fundamentals
No ratings yet
Digital IC Design Fundamentals
29 pages
Logic Synthesis Process Guide
100% (2)
Logic Synthesis Process Guide
51 pages
Prep Asic
No ratings yet
Prep Asic
36 pages
VLSI
No ratings yet
VLSI
65 pages
04 Synthesis
No ratings yet
04 Synthesis
57 pages
1) and GATE
No ratings yet
1) and GATE
19 pages
VLSI Viva Questions With Answers Clean
No ratings yet
VLSI Viva Questions With Answers Clean
4 pages
Manual JD English c13960
No ratings yet
Manual JD English c13960
53 pages
technical-notice-TOP CROLL S-TOP CROLL L-1
No ratings yet
technical-notice-TOP CROLL S-TOP CROLL L-1
26 pages
(Troubleshooting) Komatsu Excavator Fault Code Fault Code PDF
100% (8)
(Troubleshooting) Komatsu Excavator Fault Code Fault Code PDF
17 pages
G9 Performance Task Proposal
No ratings yet
G9 Performance Task Proposal
5 pages
4 - Unit 1 - UTS
100% (1)
4 - Unit 1 - UTS
19 pages
8 Week Single Kettlebelland Bodyweight Program
No ratings yet
8 Week Single Kettlebelland Bodyweight Program
64 pages
Sociology of Simmel: Urban Life & Money
No ratings yet
Sociology of Simmel: Urban Life & Money
12 pages
Appendix A. Thermodynamic Parameters PDF
No ratings yet
Appendix A. Thermodynamic Parameters PDF
10 pages
Spice Up Your Life
No ratings yet
Spice Up Your Life
10 pages
Week 9 Module 7 Koko Soko Asoko Using Wa Particle
100% (2)
Week 9 Module 7 Koko Soko Asoko Using Wa Particle
14 pages
IOQJS Question Paper 2021 With Solution
No ratings yet
IOQJS Question Paper 2021 With Solution
12 pages
Coleridge-The Nightingale
No ratings yet
Coleridge-The Nightingale
4 pages
Captivated - Dark!Destiel AU - Print
No ratings yet
Captivated - Dark!Destiel AU - Print
59 pages
Theory and International Law - An Introduction. by Philip Allott, Tony Carty, Martti Koskenniemi and Colin Warbrick. (London - The (1994) (1
No ratings yet
Theory and International Law - An Introduction. by Philip Allott, Tony Carty, Martti Koskenniemi and Colin Warbrick. (London - The (1994) (1
2 pages
Ansi/I - 75.08.06-2002
No ratings yet
Ansi/I - 75.08.06-2002
16 pages
Warhammer 40K Campaign
100% (1)
Warhammer 40K Campaign
5 pages
TG410 Auto Start Controller 052914 Spec Sheet
No ratings yet
TG410 Auto Start Controller 052914 Spec Sheet
2 pages
MEASURE GLU (Fixed)
No ratings yet
MEASURE GLU (Fixed)
3 pages
Control System Kec 602
No ratings yet
Control System Kec 602
2 pages
Pilot Hand Book
No ratings yet
Pilot Hand Book
11 pages
Andaman and Nicobar Culture
No ratings yet
Andaman and Nicobar Culture
7 pages
Pole Dance 10may2018
No ratings yet
Pole Dance 10may2018
6 pages
Sv9500 Data Sheet
No ratings yet
Sv9500 Data Sheet
2 pages
In GATE Solved Papers - 2024
No ratings yet
In GATE Solved Papers - 2024
16 pages
Skybowl Melanie Rawn Download
No ratings yet
Skybowl Melanie Rawn Download
29 pages
1st Module PPT 4-9-18
No ratings yet
1st Module PPT 4-9-18
88 pages
Textile Materials & Weaving Basics
100% (1)
Textile Materials & Weaving Basics
11 pages
IMP - J.ceramint.2021.03.144
No ratings yet
IMP - J.ceramint.2021.03.144
6 pages
امتحانات لغة انجليزية - 2 ع - أكتوبر - ذاكرولي
No ratings yet
امتحانات لغة انجليزية - 2 ع - أكتوبر - ذاكرولي
13 pages
C3mkii PDF
No ratings yet
C3mkii PDF
4 pages

Synthesis

Uploaded by

Synthesis

Uploaded by

Logic Synthesis

Amr Adel Mohammady

[1] : https://github.com/pulp-platform/pulpino /amradelm 6

Example Cell Report from Design Compiler

LUT as AND LUT as OR

DSP48e1 from Xilinx/AMD

No Flattening With Flattening

CMOS NAND CMOS Inverter

CMOS NAND CMOS AND Good

2-Input AND Size x2

Timing Library from Skywater 130nm Open-source PDK

• The current cell count:

addr[3] addr[2] addr[1] addr[0]

addr[3] addr[2] addr[1] addr[0]

Amr Adel Mohammady

Carry Chain (Adders)

LUT as AND LUT as OR

6 Input Gate Two 5 Input Gates

128x1 RAM Dual Port 128x1 RAM 256x1 RAM

4-Bit Wide RAM 32x2 RAM

output of one LUT5 to the next LUT5. /amradelm 45

8-Bit Static Shift Register 72-Bit Static Shift Register

o 𝐶4 = 𝐺3 + 𝑃3 ⋅ 𝐶3 = 𝐺3 + 𝑃3 ⋅ (𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )))

Simplified DSP48e1 Block Diagram

INMODE[3] INMODE[2] INMODE[1] INMODE[0] USE_DPORT Multiplier Input

Simplified DSP48e1 Block Diagram SIMD Config

Simple Dual Port

True Dual Port

WRITE_FIRST READ_FIRST NO_CHANGE

You might also like