HSCD
HSCD
Detailed Syllabus:
The Nature of Hardware and Software: Introducing Hardware/ Software Co-design, The Quest
for Energy Efficiency, The Driving Factors in Hardware/ Software Co-design, The Dualism of
Hardware Design and Software Design.
Data Flow Modeling and Transformation: Introducing Data Flow Graphs, Analyzing
Synchronous Data Flow Graphs, Control Flow Modeling and the Limitations of Data Flow,
Transformations.
Analysis of Control Flow and Data Flow: Data and Control Edges of a C Program, Implementing
Data and Control Edges, Construction of the Control Flow Graph and Data Flow Graph.
Finite State Machine with Datapath: Cycle-Based Bit-Parallel Hardware, Hardware Modules,
Finite State Machines with Datapath, FSMD Design Example: A Median Processor.
System on Chip: The System-on-Chip Concept, Four Design Principles in SoC Architecture, SoC
Modeling in GEZEL. Applications: Trivium Crypto-Coprocessor, CORDIC Co-Processor.
1
Hardware-Software Co-Design
Learning Resources:
Text Books:
1. Patrick Schaumont, A Practical Introduction to Hardware/ Software Co-design,
Springer, 2010.
2. Ralf Niemann, Hardware/Software Co-Design for Data flow Dominated
Embedded Systems, Springer, 1998.
Reference Books:
2
Hardware-Software Co-Design
The Nature of Hardware and Software
• What is H/S Codesign (Prof. Schaumont’s definition):
Other definitions
• HW/SW Codesign is a design methodology supporting the concurrent
development of hardware and software (co-specification, co-development and co-
verification) in order to achieve shared functionality and performance goals for a
combined system
3
Hardware-Software Co-Design
Hardware
we will model hardware by means of single-clock synchronous digital circuits
created using word-level combinational logic and flip-flops.
Hardware is realized by word-level combinational and sequential components,
such as registers, MUXs, adders and multipliers.
3 3
d q 3 4
0
+ 1
clk 3 2
3
4
Hardware-Software Co-Design
Hardware
Bear in mind that this is a very simplistic treatment of actual hardware, We ignore
advanced circuit styles including asynchronous hardware, dynamic logic, multi-
phase clocked hardware, etc.
The cycle-based model is limited because it does not model glitches, race
conditions or events that occur within clk cycles
7
Hardware-Software Co-Design
Software
ARM assembly example
start
LDR R0, =array ; Load the address of the array into R0
LDR R1, =array_length ; Load the length of the array into R1
LDR R2, [R0], #4 ; Load the first element of the array into R2
SUBS R1, R1, #1 ; Decrement array
Loop : CMP R1, #0 ; Check if we have processed all elements
BEQ done ; If R1 is 0, we're done
LDR R3, [R0], #4 ; Load the next element of the array into R3
CMP R3, R2 ; Compare the next with the current max (R2)
BLE continue_loop ; If R3 <= R2, skip the update of max
MOV R2, R3 ; If R3 > R2, update max value in R2
continue_loop : SUBS R1, R1, #1 ; Decrement the counter
BNE loop ; If there are more elements, continue the loop
done: END
8
Hardware-Software Co-Design
9
Hardware-Software Co-Design
10
Hardware-Software Co-Design
11
Hardware-Software Co-Design
12
C Program
13
HDL program
14
15
Hardware-Software Co-Design
This very simple design can be addressed using hardware/software codesign;
The hardware model contains the 8051 processor, the coprocessor, and the connections
between them.
During execution, the 8051 processor will execute a software program written in C.
C program and RTL hardware model for this design, written in the GEZEL language.
Each time, it also cycles the value on port P0 between ins hello and ins idle, which are
encoded as value 1 and 0,respectively.
The hardware model includes both the microcontroller and the coprocessor.
This particular hardware model is a combination of a finite state machine (lines 10–18) and a
datapath (lines 1–8).
16
Hardware-Software Co-Design
This FSMD is quite easy to understand.
The FSM controller selects, each clock cycle, which of those instructions to execute.
This means: when the value of insreg is 1 and the FSM controller current state is s1, the
datapath will execute instructions hello and decode, and the FSM controller next-state is s2.
When the value of insreg would be 0, the datapath will execute only instruction decode and
the FSM controller next-state is s1.
The overall coprocessor behavior is like this: when the ins input changes from 0 to 1, then the
din input will be printed in the next clock cycle.
17
Hardware-Software Co-Design
The 8051 microcontroller is captured with three ipblock (GEZEL library
modules), on lines 20–37.
The first ipblock is an i8051system.
It represents the 8051 microcontroller core, and it indicates the name
of the compiled C program that will execute on this core (driver.ihx on
line 22).
The other two ipblock are two 8051 output ports (i8051systemsource),
one to model port P0, and the other to model port P1.
Finally, the coprocessor and the 8051 ports are wired together in a
top-level module, shown in lines 39–49.
We can now simulate the entire model, including hardware and
software, as follows.
First, the 8051 C program is compiled to a binary executable.
Next, the GEZEL simulator will combine the hardware model and the
8051 binary executable in a co-simulation.
18
Hardware-Software Co-Design
19
Defining Hardware/Software Codesign
20
Defining Hardware/Software Codesign
21
Defining Hardware/Software Codesign
A Digital-Signal Processor (DSP) is a processor with a specialized
instruction set, optimized for signal-processing applications.
Writing efficient programs for a DSP requires detailed knowledge of
these specialized instructions.
Very often, this means writing assembly code, or making use of a
specialized software library.
Hence, there is a strong connection between the efficiency of the
software and the capabilities of the hardware.
23
The Quest for Energy Efficiency
Choosing between implementing a design in hardware or
implementing it in software very difficult.
Indeed, from a designers’ point-of-view, the easiest approach is to
write software, for example in C.
Software is easy and flexible, software compilers are fast, there are
large amounts of source code available, and all you need to start
development is a nimble personal computer.
Furthermore, why go through the effort of designing a hardware
architecture when there is already one available (namely, the RISC
processor)?
24
Relative Performance
25
Relative Performance
Figure illustrates various cryptographic implementations in software
and hardware that have been proposed over the past few years.
These are all designs proposed for embedded applications, where the
trade-off between hardware and software is crucial.
As demonstrated by the graph, hardware crypto architectures have,
on the average, a higher relative performance compared to embedded
processors.
26
Relative Performance
However, relative performance may not be a sufficient argument to
motivate the use of a dedicated hardware implementation.
Consider for example a specialized Application-Specific Integrated
Circuit (ASIC) versus a high-end (workstation) processor.
The hardware inside of the ASIC can execute many operations in
parallel, but the processor runs at a much higher clock frequency.
Furthermore, modern processors are very effective in completing
multiple operations per clock cycle.
As a result, an optimized software program on top of a high-end
processor may outperform a quick-and-dirty hardware design job on
an ASIC.
Thus, the absolute performance of software may very well be higher
than the absolute performance of hardware.
In contrast to relative performance, the absolute performance needs
to take clock frequency into account.
27
Energy Efficiency
There is another metric which is independent from clock frequency, and which can
be applied to all architectures.
That metric is energy-efficiency: the amount of useful work done per unit of energy.
Flexibility
28
Energy Efficiency
Take an example of a particular encryption application (AES) for
different target platforms.
The flexibility of these platforms varies from very high on the left to
very low on the right.
The platforms include: Java on top of a Java Virtual machine on top of
an embedded processor;
C on top of an embedded processor;
optimized assembly-code on top of a Pentium-III processor;
Verilog code on top of a Virtex-II FPGA; and
an ASIC implementation using 0.18 micron CMOS standard cells.
Y-axis shows the amount of gigabits that can be encrypted on each of
these platforms using a single Joule of energy.
This shows battery-operated devices would greatly benefit using less
flexible, dedicated hardware engines
29
The Driving Factors in Hardware/Software Codesign
30
The Driving Factors in Hardware/Software Codesign
Also, specialized hardware architectures are usually also more efficient than software
from a relative performance perspective, i.e., amount of useful work done per clock
cycle
Flexibility comes with a significant energy cost -- one which energy optimized
applications cannot tolerate
Therefore, you will never find a Pentium processor in a cell phone!
31
The Driving Factors in Hardware/Software Codesign
32
The Driving Factors in Hardware/Software Codesign
Power Densities:
Further increasing clock speed in modern high-end processors as a performance
enhancer has run-out-of-gas because of thermal limits
This is driven a broad and fundamental shift to increase parallelism within
processor architectures
However, at this moment, there is no dominant parallel computer architecture
that has shown to cover all applications. commercially available systems include
Symmetric multiprocessors with shared memory
Traditional processors tightly coupled with FPGAs as accelerator engines
Multi-core and many-core architectures such as GPUs
Nor is there yet any universally adopted parallel programming language, i.e.,
code must be crafted differently depending on the target parallel platform
This forces programmers to be architecturally-aware of the target platform
33
The Driving Factors in Hardware/Software Codesign
Design Complexity:
Today, it is common to integrate multiple microprocessors together with all related
peripherals and hardware components on a single chip.
This approach has been touted system-on-chip (SoC). Modern SoC are extremely complex.
The conception of such a component is impossible without a detailed planning and design
phase.
Extensive simulations are required to test the design upfront, before committing to a costly
implementation phase.
Since software bugs are easier to address than hardware bugs, there is a tendency to increase
the amount of software.
Design Cost:
New chips are very expensive to design. As a result, hardware designers make chips
programmable so that these chips can be reused over multiple products or product
generations.
The SoC is a good example of this trend.
However, ‘programmability’ can be found in many different forms other than embedded
processors: reconfigurable systems are based on the same idea of reuse-through-
reprogramming.
34
The Driving Factors in Hardware/Software Codesign
35
The Driving Factors in Hardware/Software Codesign
Deep-Submicron Effects:
Designing new hardware from-scratch in high-end silicon processes is
difficult due to second-order effects in the implementation.
For example, each new generation of silicon technology has an increased
variability and a decreased reliability.
Programmable, flexible technologies make the hardware design process
simpler, more straightforward, and easier to control.
In addition, programmable technologies can be created to take the effects
of variations into account.
Finding the correct balance, while weighing in all these factors, is a complex
problem Instead, we will focus on optimizing metrics related to design cost
and performance
In particular, we will consider how adding hardware to a software
implementation increases performance while weighing in the increase in
design cost
36
The Hardware–Software Codesign Space
The proceeding discussion makes it apparent that there are a multitude of
alternatives available for mapping an application to an architecture
For a given application, there are many different possible solutions.
The collection of all these implementations is called the hardware–software
codesign space.
The following figure gives a symbolic representation of this design space and
indicates the main design activities in this design space.
37
The Hardware–Software Codesign Space
38
The Hardware–Software Codesign Space
Examples Micrographs of Target Platforms
Microprocessor FPGA SoC
DSP Microcontroller
39
The Hardware–Software Codesign Space
SoC Examples
Example System-on-Chip (SoC) with IP cores
Processor RF Micro
Memory RF #RF2173
Pow Amp
Transreflective Analog Devices
monochrome Maxim
#AD7873 #MAX4472
backlit display Screen digitizer Pow. Amp contrl
Hynix drivers
#HY57V641629 Motorola
SDRAM 8MB
Motorola DSP #MC1376VF
#MC68VZ328 Dig. Transceivers
Fijitsu DragonBall Proc.
#MBM29D1323
Flash 4MB Philips TCXO
#PDIUBD12
USB Interface K001 VCO
FPGA Interface
Manual inputs
40
The Hardware–Software Codesign Space
Codesign Examples
Video Codec (H261)
Camera Display
MSQ bus
MCC bus
uP+code SW Processors
HW HW Processors
41
The Hardware–Software Codesign Space
42
The Hardware–Software Codesign Space
Each of the above platforms presents a trade-off between flexibility and efficiency
43
The Hardware–Software Codesign Space
Codesign involves the following three activities:
• Platform selection
• Application mapping
• Platform programming
Very often, a specification is just a piece of English text, that leaves many
details of the application undefined
44
The Hardware-Software Codesign Space
Step 2: Application mapping
Examples include:
• RISC: Software is written in C while the hardware is a processor
• FPGAs: Software is written in a hardware description language (HDL)
FPGAs can be configured to implement a soft processor, in which case, software
also needs to be written in C
• DSP: A digital signal processor is programmed using a combination of C and
assembly, which is run on a specialized processor architecture
• ASIP: Programming an ASIP is a combination of C and an HDL description
• ASIC: The application is written in a HDL which is then synthesized to a hardwired
netlist and implementation
Note: ASICs are typically non-programmable, i.e., the application and platform
are one and the same
45
HW/SW Codesign
The Hardware-Software Codesign Space
However, many platforms are not just composed of simple components, but
rather require multiple pieces of software, possibly in different programming
languages.
For example, the platform may consist of a RISC processor and a specialized
hardware coprocessor
Here, the software consists of C (for the RISC) as well as dedicated
coprocessor instruction-sequences (for the coprocessor).
46
The Hardware–Software Codesign Space
Another concept reflected in the wedge-figure is the domain-specific platform
The first question is harder - seasoned designers choose based on their previous expe-
rience with similar applications
The second issue is also challenging, but can be addressed in a more systematic fash-
ion using a design methodology
A design method is a systematic sequence of steps to convert a specification
into an implementation
49
The Dualism of Hardware Design and Software Design
Designing requires the decomposition of a specification into low level primitives such
as gates (HW) and instructions (SW)
50
The Dualism of Hardware Design and Software Design
Resource Cost:
Temporal vs. spatial decomposition
The dualism in decomposition methods leads a similar dual resource cost.
Decomposition in space, as used by a hardware designer, means that more
gates are required for when a more complex design needs to be implemented.
Decomposition in time, as used by a software designer, implies that a more
complex design will take more instructions to complete.
Therefore, resource cost for hardware is circuit area while resource cost for
software is execution time
Design Constraints:
A hardware designer is constrained by the clock cycle period of a design.
A software designer, on the other hand, is limited by the capabilities of the
processor instruction set and the memory space available with the processor.
Thus, the design constraints for hardware are in terms of a time budget, while the
design constraints for software are fixed by the CPU.
So, a hardware designer invests circuit area to maintain control over execution
time, and a software designer invests execution time for an almost constant circuit
area.
51
The Dualism of Hardware Design and Software Design
Flexibility:
Software excels over hardware in the support of application flexibility.
Flexibility is the ease by which the application can be modified or
adapted after the target architecture for that application is
manufactured.
In software, flexibility is essentially free.
In hardware on the other hand, flexibility is not trivial.
Hardware flexibility requires that circuit elements can be easily reused
for different activities or functions in a design.
52
The Dualism of Hardware Design and Software Design
Parallelism:
A dual of flexibility can be found in the ease with which parallel
implementations can be created.
Parallelism is the most obvious approach to improving performance.
For hardware, parallelism comes for free as part of the design
paradigm.
For software, on the other hand, parallelism is a major challenge.
If only a single processor is available, software can only implement
concurrency, which requires the use of special programming
constructs such as threads.
When multiple processors are available, a truly parallel software
implementation can be made, but inter-processor communication and
synchronization become a challenge.
53
The Dualism of Hardware Design and Software Design
Modelling:
In software, modeling and implementation are very close.
Indeed, when a designer writes a C program, the compilation of that
program for the appropriate target processor will also result in the
implementation of the program.
In hardware, the model and the implementation of a design are
distinct concepts.
Initially, a hardware design is modeled using a HDL.
Such a hardware description can be simulated, but it is not an
implementation of the actual circuit.
Hardware designers use a hardware description language, and their
programs are models which are later transformed to implementation.
Software designers use a software programming language, and their
programs are an implementation by itself.
54
The Dualism of Hardware Design and Software Design
Reuse:
Finally, hardware and software are also quite different when it comes to
Intellectual Property Reuse or IP-reuse.
The idea of IP-reuse is that a component of a larger circuit or a program can
be packaged, and later reused in the context of a different design.
In software, IP-reuse has known dramatic changes in recent years due to
open source software and the proliferation of open platforms.
When designing a complex program these days, designers will start from a
set of standard libraries that are well-documented and available on a wide
range of platforms.
For hardware design, IP-reuse is still in its infancy.
Hardware Designers are only starting to define standard exchange
mechanisms.
IP-reuse of hardware has a long way to go compared to the state of reuse in
software.
55
Abstraction Levels
57
Discrete-event
Here, simulators abstract node behavior into discrete, possibly
irregularly spaced, time steps called events
Events represent the changes that occur to circuit nodes when the
inputs are changed in the test bench
The simulator is capable of modeling actual propagation delay of the
gates, similar to what would happen in a hardware instance of the
circuit
Discrete-event simulation is very popular for modeling hardware at
the lowest layer of abstraction in codesign
This level of abstraction is much less compute-intensive than
continuous time but accurate enough to capture details of circuit
behavior including glitches
58
Cycle-accurate
Single-clock synchronous hardware circuits have the important property that
all interesting things happen at regularly spaced intervals, namely at the clock
edge.
This abstraction is important enough to merit its own abstraction level, and it
is called cycle-accurate modeling.
A cycle-accurate model does not capture propagation delays or glitches.
All activities that fall ‘in between’ clock edges are concentrated at the clock
edge itself.
This level of abstraction is considered the golden reference in HW/SW
codesign
59
Instruction-accurate
RTL models are great but may be too slow for complex systems.
For example, your laptop has a processor that probably clocks over 1 GHz (one billion cycles).
Assuming that you could write a C function that expresses a single clock cycle of processing,
you would have to call that function one billion times to simulate just a single second of
processing.
Clearly, further abstraction can be useful to build leaner and faster models.
Instruction-accurate modeling expresses activities in steps of one microprocessor instruction
(not cycle count)
Each instruction lumps together several cycles of processing.
Instruction-accurate simulators are used to verify complex software systems, such as
complete operating systems.
60
Transaction-accurate
For very complex systems, even instruction-accurate models may be too slow
or require too much modeling effort
In transaction-accurate modeling, only the interactions (transactions) that
occur between components of a system are of interest
For example, suppose you want to model a system in which a user process is
performing hard disk operations, e.g., writing a file
The simulator simulates commands exchanged between the disk drive and
the user application
The sequence of instruction-level operations between two transactions can
number in the millions but the simulator instead simulates a single function
call
Transaction-accurate models are important in the exploratory phases of a
design, before effort is spent on developing detailed models
For this course, we are interested in instruction-accurate and cycle-accurate levels
61
Concurrency and Parallelism
Concurrency and parallelism are terms that often occur in the context of hardware-
software codesign.
They mean very different things.
Concurrency is the ability to execute simultaneous operations because these
operations are completely independent.
Parallelism is the ability to execute simultaneous operations because the operations
can run on different processors or circuit elements.
Thus, concurrency relates to an application model, while parallelism relates to the
implementation of that model.
Hardware is always parallel.
Software on the other hand can be sequential, concurrent, or parallel.
Sequential and concurrent software requires a single processor
Parallel software requires multiple processors.
Software running on your laptop, e.g., WORD, email, etc. is concurrent
Software running on a 65536-processor IBM Blue Gene/L is parallel
62
Concurrency and Parallelism Cont..
A key objective of HW/SW codesign is to allow designers to leverage the
benefits of true parallelism in cases where concurrency exists in the
application
There is a well-known Comp. Arch principle called Amdahl’s law
The maximum speedup of any application that contains q% sequential
code is: 1 / (q/100).
For example, if your application spends 33% of its time running
sequentially, the maximum speedup is 3
This means that no matter how fast you make the parallel component
run, the maximum speedup you will ever be able to achieve is 3
Thus, you see that we don’t only need to have parallel platforms, we
also need a way to write parallel programs to run on those platforms.
Surprisingly, even algorithms that seem sequential at first can be executed (and
specified) in a parallel fashion.
63
Concurrency and Parallelism Cont..
64
Concurrency and Parallelism Cont..
The authors of the CM, Hellis and Steele, show that it is possible to express
algorithms in a concurrent fashion so that they map neatly onto a CM
Consider the problem of summing an array of numbers
The array can be distributed across the CM by assigning one number to each
processor
To take the sum, distribute the array over the CM processors so that each
processor holds one number.
We can now take the sum over the entire array in log(n) steps (n being the
number of processors)
65
Concurrency and Parallelism Cont..
Even through the parallel sum speeds up the computation significantly, there
remains a lot of wasted compute power
On the other hand, if the application requires all partial sums, i.e., the sum of the
first two, three, four, etc. numbers, then the full power of the parallel machine is
used
66
Data Flow Modeling and
Implementation
Introduction
• We will learn how to create data flow models, and how to implement those models in
hardware and software.
• Unlike C programs, data flow models are concurrent: they can express activities that
happen simultaneously.
• This property makes data flow well suited for a parallel hardware implementation as well
as a sequential software implementation.
• By nature, hardware is parallel and software is sequential.
• As a result, software models (C programs) are not very well suited to capture hardware
implementations, and vice versa, hardware models (RTL programs) are not a good
abstraction to describe software.
• However, designers frequently encounter situations in which they cannot predict if the
best solution for a design problem is a hardware implementation or a software
implementation.
• Trying to do both is not on option; it requires the designers to work twice as hard.
Introduction
• Signal processing domain experts are used to describe complex
systems, such as digital radios and radar processing units, using block
diagrams.
• A block diagram is a high‐level representation of the target system as
a collection of smaller functions.
• A block diagram does not specify if a component should be hardware
or software;
• We are specifically interested in digital signal processing systems.
• Such systems represent signals as streams of discrete samples rather
than continuous signal shapes.
Introduction
• Figure shows the block diagram for a simple digital signal processing system.
• It’s a pulse‐amplitude modulation (PAM) system, and it is used to transmit digital
information over bandwidth‐limited channels.
• A PAM signal is created from binary data in two steps.
• First, each word in the file needs to be mapped to PAM symbols, which are just pulses of
different heights.
Introduction
• An entire file of words will thus be converted to a stream of symbols or pulses.
• Next, the stream of pulses needs to be converted to a smooth shape using pulse‐shaping.
• Pulse‐shaping ensures that the bandwidth of the resulting PAM signal bandwidth does not exceed
the PAM symbol rate.
• The output of the pulse‐shaping unit produces many samples for each input symbol pulse, but it
is still a stream of discrete samples.
• A convolution‐based function is used to ensure the curve goes through the symbols while
providing a memory effect, which allows symbols to influence other symbols
• The final module in the block diagram is the digital‐to‐analog module, which will convert the
stream of discrete samples into a continuous signal.
Introduction
Here’s a high-level simulation model for PAM-4
Introduction
• Although this is a good model for simulation, it is not for an implementation C implicitly assumes
sequential execution
• If we observe block diagram carefully, we can see that the block diagram does not require a
sequential execution of the symbol mapping function and the pulse shaping function.
• The block diagram only specifies the flow of data in the system but not the execution order of the
functions.
• The lines between blocks represent data dependencies, and therefore, they force an ordering to
the sequence of operations
• However, unlike C, each block can execute simultaneously with other blocks.
• Another example of the difference between C and block diagrams is shown by the ‘fanout’ in the
following:
• Here, Block2 and Block3 are clearly parallel but C would execute them sequentially
Data Flow Models
• Note that, in general, it is easier to create a sequential implementation from a parallel
model than it is to create a parallel implementation from a sequential model.
• This argues in favor of Data Flow
• The following is a Data Flow model of PAM‐4:diagrams for modelling
• The bubbles, called actors, represent the functions in the block diagram
• Actors are linked together using directional lines, called queues
• The numbers on the lines represent the relative rates of communications between
modules, e.g., Map converts a 32‐bit word into 16 2‐bit symbols.
• Note that each actor works independently, i.e., it checks its input queue for the proper
number of elements and executes immediately when satisfied
Data Flow Models vs. C Programs
• Data Flow is a concurrent model (this is a major driver for their popularity), which means they can
easily be mapped to hardware or software implementations.
• Data Flow models are distributed, i.e., there is no centralized controller, i.e., each actor operates
autonomously.
• Data Flow models are modular, allowing libraries of components to be constructed and utilized in
a plug‐and‐play fashion
• Data Flow models can be analyzed,
• e.g., for deadlock conditions that can result in system lock‐up
• Deterministic, mathematical methods can be used to analyze Data Flow models, which is
generally not possible using C
• Data Flow models have been around since the early 1960s
• The 70’s and 80’s were active periods of research and development of Data Flow programming
languages and even Data Flow architectures
• NI’s Labview is a classic example of a Data Flow programming language
Tokens, Actors and Queues
• Here, we define the elements that make up a Data Flow model, and discuss a special class of Data
Flow models called Synchronous Data Flow (SDF) Graphs SDFs allow for the application of formal
analysis techniques
• A simple example:
• Therefore, this information combined with a marking makes is easy to decide whether an actor
can fire.
• Data Flow actors can also consume more than one token per firing This is referred to as a multi‐
rate Data Flow graph
Synchronous Data Flow Graphs
• Synchronous Data Flow (SDF) graphs refer to systems where the number of tokens consumed and
produced per actor firing is fixed and constant
• The term synchronous refers to the fixed consumption and production rate of tokens.
• Note that SDF will not be able to handle control‐flow constructs, such as if‐then‐else statements
in C without adding special operators .
• Despite this significant limitation, SDFs are very powerful (and popular), and more importantly,
mathematical techniques can be used to verify certain properties
• The first of these properties is determinism
• The entire SDF is deterministic under the condition that all of its actors implement a deterministic
function
• Determinism guarantees that the same results will always be produced independent of the firing
order
Synchronous Data Flow Graphs
• Illustration of determinism:
Synchronous Data Flow Graphs
• As we start firing actors, tokens are transported through the graph.
• After the first firing, an interesting situation occurs: both the add actor as well as the
plus1 actor can fire.
• Going down at the left side, we assume that the plus1 actor fires first.
• Going down at the right side, we assume that the add actor fires first.
• However, regardless of this choice, the graph eventually converges to the situation
shown at the bottom.
• Why is this property so important?
• Assume for a moment that the add actor and the plus1 actor execute on two different
processors, a slow one and a fast one.
• This is the power of determinate property of SDF, it doesn’t matter which processor runs
on what actor: the results will be always the same.
• In other words, no matter what technology we are using to implement actors, the system
will work as specified as long as we implement the firing rules correctly.
Analyzing Synchronous Data Flow Graphs
• The second important property of SDF relates to an admissible schedule
• An admissible SDF is one that can run forever without deadlock (unbounded execution)
or without overflowing any of the communication queues (bounded buffer)
• Deadlock occurs when an SDF graph progresses to marking that prevents firings
• Overflow occurs when tokens are produced faster than they are consumed
• There is also a systematic method to determine whether a SDF graph is
admissible
• The method provides a closed form solution, i.e., no simulation is required
Analyzing Synchronous Data Flow Graphs
Lee proposed a method called Periodic Admissible Schedules (PASS), defined as:
• A schedule is the order in which the actors must fire
• An admissible schedule is a firing order that is deadlock‐free with bounded
buffers
• A periodic admissible schedule is a schedule that supports unbounded execution,
i.e., is periodic in the sense that the same markings will recur
• We also consider a special case called Periodic Admissible Sequential Schedules
(PASSs) that supports a microprocessor implementation with one actor firing at a
time
• There are four steps to creating a PASS for an SDF graph:
I. Create the topology matrix G of the SDF graph
II. Verify the rank of the matrix to be one less than the number of nodes in the graph
III. Determine a firing vector
IV. Try firing each actor in a round robin fashion, until the firing count given by the firing
vector is reached
Analyzing Synchronous Data Flow Graphs
Consider the following example:
• Step 1: Create a topology matrix for this graph:
• The topology matrix has as many rows as there are edges (FIFO queues) and as many columns
as there are nodes
• The entry (i, j) will be positive if the node j produces tokens onto the edge i and negative if it
consumes tokens
Analyzing Synchronous Data Flow Graphs
Step 2: The condition for a PASS to exist is that the rank of G has to be one less than the
number of nodes in the graph
• The rank of the matrix is the number of independent equations in G
• For our graph, the rank is 2 ‐‐ verify by multiplying the first column by ‐2 and the
second column by ‐1, and adding them to produce the third column
• Given that there are three nodes in the graph and the rank of the matrix is 2, a PASS is
possible
• This step effectively verifies that tokens can NOT accumulate on any edge of the graph
• The actual number of tokens can be determined by choosing a firing vector and
carrying out a matrix multiplication
Analyzing Synchronous Data Flow Graphs
• For example, the tokens produced/consumed by firing A twice and B and C zero times
is given by:
• Note that since the rank is less than the number of nodes, there are an infinite number
of solutions to the matrix equation
Analyzing Synchronous Data Flow Graphs
• Step 3: Determine a periodic firing vector (cont.)
• This is true b/c, intuitively, if firing vector (a, b, c) is a PASS, then so
should be firing vectors (2a, 2b, 2c), (3a, 3b, 3c), etc.
• Our task is to find the simplest one ‐‐ for this example, it is:
• Note that the existence of a PASS firing vector does not guarantee
that a PASS will also exist
• Here, we reversed the (A,C) edge We would find the same qPASS but
the resulting graph is deadlocked ‐‐ all nodes are waiting for each
other
Analyzing Synchronous Data Flow Graphs
Step 4: Construct a valid PASS.
• Here, we fire each node up to the number of times specified in qPASS
• Each node that is able to fire, i.e., has an adequate number of tokens, will fire
• If we find that we can fire NO more nodes, and the firing count is less than the number in
qPASS, the resulting graph is deadlocked
• Trying this out on our graph, we fire A once, and then B and C
Analyzing Synchronous Data Flow Graphs
Step 4: Construct a valid PASS.
• Try this out on the deadlocked graph ‐‐ it aborts immediately on the first iteration
because no node is able to fire successfully
• Note that the determinate property allows any ordering to be tried freely, e.g., B, C
and then A
• In some graphs (not ours), this may lead to additional PASS solutions
SDF Graphs: PAM‐4 Example
• Consider the digital pulse‐amplitude modulation system (PAM‐4) discussed earlier
• The SDF for this system consists of 4 actors, and is a multi‐rate Data Flow system:
• In general, deriving the optimal schedule is a difficult problem for complex systems
Euclid’s Algorithm as an SDF Graph
• The graph evaluates the greatest common divisor of
two numbers a and b.
• The sort actor reads two numbers, sorts them, and
copies them to the output.
• The diff actor subtracts the smallest number from the
largest one, as long as they are
• For example, assume (a0, b0)=(16,12)
• then we see the following sequence of token values.
(a1, b1)=(4,12), (a2, b2)=(8,4), (a3, b3)=(4,4),
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎
diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Euclid’s Algorithm as an SDF Graph
Limitations of Data Flow Models
• SDF systems are distributed, data‐driven systems.
• They execute whenever there is data to process, and remain idle when there is nothing to do.
• However, SDF seems to have trouble to model control‐related aspects.
• Control appears in many different forms in system design, for example:
Stopping and Restarting. As we saw in the Euclid example an SDF model never terminates; it just
keeps running.
• Stopping and restarting is a control‐flow property that cannot be addressed well with SDF graphs.
Mode‐Switching. When a cell‐phone switches from one standard to the other, the processing
(which may be modeled as an SDF graph) needs to be reconfigured.
• However, the topology of an SDF graph is fixed and cannot be modified at runtime.
Exceptions. When catastrophic events happen, processing may suddenly need to be altered.
• SDF cannot model exceptions that affect the entire graph topology.
• For example, once a token enters a queue, the only way of removing it is to read the token out of
the queue.
• It is not possible to suddenly ‘empty’ the queue on a global, exceptional condition.
Limitations of Data Flow Models
• Run‐Time Conditions. A simple if‐then‐else statement (choice between two activities
depending on an external condition) is troublesome for SDF.
• An SDF node cannot simply ‘disappear’ or become inactive – it is always there.
• Moreover, we cannot generate conditional tokens, as this would violate SDF rules which
require fixed production/consumption rates.
• Thus, SDF cannot model conditional execution such as required for if‐then‐else
statements.
• There are two solutions to the problem of control flow modeling in SDF.
• The first one is to try using SDF anyhow,
• Next is emulate control flow at the cost of some modelling overhead.
Limitations of Data Flow Models
• There are two solutions to the problem of control flow modeling in SDFs
• Solution 1:
• emulate control flow on top of the SDF semantics Consider the
• stmt : if (c) then A else B
• Input sample rate is the time interval between two adjacent input samples from a data stream
• For example, a digital sound system generates 44,100 samples per second
• Input sample rate defines a design constraint for the real‐time performance of the
• Data Flow system Similar constraints usually exists for output sample rate
Definitions
• We use two common metrics as measures of performance:
• Throughput: the number of samples processed per second (Note that input and output throughput may be
different)
• Latency: The time required to process a single token from input to output
DFG Performance Modelling and Transformations
The Data Flow Resource Model:
• The time stamp sequences on the left and right indicate when input
samples are read and when output samples are produced
• The time stamps for DFG (a) and (b) are different because of the position of
the delay element in the loop
• (a) requires the sum of execution times of A and B before producing a result
• (b) can produce a result at time stamp 3 because the delay element allows it to
execute immediately at system start time (we refer to this as transient behavior)
• In this case, the delay elements affect only the latency of the first sample
DFG Performance Modelling and Transformations
• In contrast, (c) shows that delay elements can be positioned to enable parallelism, and
affect both latency and throughput
• Both actors can execute in parallel in (c), resulting in better performance than (a) and (b)
• The throughput of (a) and (b) is 1 sample per 8 time units, while (c) is 1 sample per 5
time units
• Similar to a pipelined system, the throughput in (c) is ultimately limited to the speed of
the slowest actor (A in this case ‐‐ B is forced to wait)
DFG Performance Modelling and Transformations
Limits on Throughput
• As indicated, the distribution of the delay elements in the loops impacts performance
• As an aid in analyzing performance, let’s define
• Loop bound as the round‐trip delay of a loop, divided by the number of delays in the loop
• Iteration bound as the largest loop bound in any loop of a DFG
• Iteration bound defines an upper limit on the best throughput of a DFG
• The loop bounds in this example are given as LBBC = 7 and LBABC = 4
• The iteration bound is 7 ‐‐ therefore, we need at least 7 time units per iteration
DFG Performance Modelling and Transformations
Limits on Throughput
• The DFG above (from an earlier slide) has an iteration bound (5 + 3)/2 = 4
time units, but the throughput is limited to the slowest actor at 1 sample
per 5 time units
• A nice way to think about actors and delays is to consider an actor as a
combinational circuit and a delay as a buffer or pipeline stage.
Transformations
Performance‐Enhancing Transformations
• Based on previous discussions, intuitively, it should be possible to ‘tune’ the DFG to
enhance performance, while maintaining the same functionality
• Enhancing performance either reduces latency or increases throughput or both The
following transformations will be considered:
• Multi‐rate Expansion: A transformation which converts a multi‐rate synchronous DFG to
a single‐rate synchronous DFG.
• Retiming: A transformation that redistributes the delay elements in the DFG Retiming
changes the throughput but does not change the latency or the transient behavior of the
DFG
• Pipelining: A transformation that introduces new delay elements in the DFG Pipelining
changes both the throughput and transient behavior of the DFG
• Unfolding: A transformation designed to increase parallelism by duplicating actors
Unfolding changes the throughput but not the transient behavior of the DFG
Transformations
Multi‐rate Transformation
• The following DFG shows actor A produces three tokens per firing, and actor B consumes
two tokens per firing
• Here, the actors are duplicated according to their firing rates, and all multi‐rate I/O are
converted to single‐rate I/O
Transformations
Retiming Transformation
• Retiming redistributes delay elements in the DFG as a mechanism to increase
throughput
• Retiming does not introduce new delay elements
• Evaluation involves inspecting successive markings of the DFG and then selecting the one
with the best performance
• Retiming of the pipelined graph yields (c) after firing A twice and B once,
which improves both throughput to 10 and latency to 20
• Again we see the slowest pipeline stage determines the best achievable
throughput.
Pipelining Transformation out[n]=c2×x[n−2]+c1×x[n−1]+c0×x[n]
Transformations
Unfolding Transformation
• Unfolding is very similar to the transformation carried out for multi‐rate expansion
• Here, actor A in the original DFG is replicated as needed, and interconnections and delay elements are
redistributed
• Note the original graph is single‐rate and goal is to increase sample consumption rate
• The text describes the sequence of steps that need to be applied to carry out unfolding
• (a) is unfolded two times in (b),
showing that the number of inputs
and outputs are doubled, allowing
twice as much data to be processed
per iteration
• Unfolding appears to slow it down, increasing
the size of the loop to include A0, B0,
• A1 and B1 while including only the single delay
Element
Hence, the iteration bound of a v‐unfolded graph
increases v times
Data Flow Implementation in
Software and Hardware
Software Implementation of Data Flow
Converting Queues and Actors into Software
There are a wide variety of approaches of mapping DFGs to software
• Before discussing these, let’s first look at C implementations of actors and queues
FIFO Queues:
• Although DFGs theoretically have infinite length queues, in practice, queues are limited in size
• We discussed earlier that constructing a PASS allows the maximum queue size to be determined by
analyzing actor firing sequences
• A typical software interface of a FIFO queue has two parameters and three methods
• The # of elements N that can be stored in the queue and the data type of the elements.
• Methods that put elements into the queue, get elements from the queue, and query the current size of
the queue.
Software Implementation of Data Flow
DFG Elements
• Queues are well defined (standardized) data structures
• A circular queue consists of an array, a write‐pointer and a read‐pointer
• They use modulo addressing, e.g., the Ith element is at position (Rptr + I) mod Q.getsize()
Note that the queue size is fixed here at
compile time.
Finally, the actor_io and queue objects can be instantiated in the main program, and the actor
functions can be called using a system scheduler.
Sequential Targets with Dynamic Schedule
• A software implementation of SDF is obtained by combining several different actor
descriptions, by interconnecting those actors using FIFO queues, and by executing the
actors through a system schedule.
• In a dynamic system schedule, the firing rules of the actors will be tested at runtime; the
system scheduling code consists of the firing rules, as well as the order in which the firing
rules are tested.
Mapping DFGs to Single Processors: Dynamic Schedule
Single‐Thread Dynamic Schedules
• Implementation of a system schedule as a function that instantiates all actors and
queues, and next calls the actors in a round‐robing fashion.
• First, note that it is impossible to call the actors in the ‘wrong’ order, because each of
them still has a firing rule that protects them from running when there is no data
available.
• The schedule on the right shows that snk in (a) is called as often as src, However, snk will
only fire on even numbered invocations.
• Even though snk will be often called as often as src, the firing rule of snk will only allow
that actor to run when there is sufficient data available.
• (b) shows a problem that is not handled by static schedulers Round‐robin scheduling in
this case will eventually lead to queue overflow
Mapping DFGs to Single Processors: Dynamic Schedule
• The underlying problem with (b) is that the implemented firing rate differs from the
firing rate for a PASS, which is given as (src, snk, snk)
• There are two solutions to this issue:
• Solution 1: We could adjust the system schedule to reflect the firing rate predicted by
the PASS.
• This solution is not very elegant, because it destroys the idea of having a dynamic
scheduler.
• If we have to obtain the PASS firing rate first, we may as well forget about using a
dynamic schedule.
Mapping DFGs to Single Processors: Dynamic Schedule
• Solution 2: We could adjust the code for the snk actor to continue execution as long as
there are tokens present. Thus, the code for the snk actor becomes:
• This is a better solution as the previous one, because it keeps the advantages of a
dynamic system schedule.
Mapping DFGs to Single Processors: Example Dynamic Schedule
• Let’s implement the 4‐point Fast Fourier Transform (FFT) shown above using a dynamic
schedule
• The array t stores 4 (time domain) samples
• The array f will be used to store the frequency domain representation of t
• The FFT utilizes butterfly operations to implement the FFT, as defined on the right side in
the figure
• The twiddle factor W(k, N) is a complex number defined as e‐j2pk/N, with W(0, 4) = 1 and
W(1, 4) = ‐j
Mapping DFGs to Single Processors: Example Dynamic Schedule
• reorder: Reads 4 tokens and shuffles them to match the flow diagram
• The t[0] and t[2] are processed by the top butterfly and t[1] and t[3] are processed by
the bottom butterfly
• fft2: Calculates the butterflies for the left half of the flow diagram
• fft4mag calculates the butterflies for the right half and produces the magnitude
component of the frequency domain representation
Mapping DFGs to Single Processors: Example Dynamic Schedule
The implementation first requires a valid schedule to be computed
• The firing rate is easily determined to be [qreorder, qfft2, qfft4mag] = [1, 2, 1]
void reorder(actorio_t *g)
{
int v0, v1, v2, v3;
while ( fifo_size(g‐>in[0]) >= 4 )
{
v0 = get_fifo(g‐>in[0]);
v1 = get_fifo(g‐>in[0]);
v2 = get_fifo(g‐>in[0]);
v3 = get_fifo(g‐>in[0]);
put_fifo(g‐>out[0], v0);
put_fifo(g‐>out[0], v2);
put_fifo(g‐>out[0], v1);
put_fifo(g‐>out[0], v3);
}
}
Mapping DFGs to Single Processors: Example Dynamic Schedule
The deterministic property of SDFs and the while loops inside the actors allow the
call order shown above to be re-arranged while preserving the functional behavior
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• In multi‐threaded programming, each actor (implemented as a function) lives in a
separate thread
• The threads are time‐interleaved by a scheduler in single processor environments
• Systems in which threads voluntarily relinquish control back to the scheduler is referred
to as cooperative multithreading
Such a system can be implemented using two functions create() and yield() as shown in above figure
The scheduler can apply different strategies to schedule threads, with the simplest one shown above as a round-
robin schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Quickthreads is a cooperative multithreading library
• The quickthreads (qt) API (Application Programmers Interface) consists of 4 functions
• spt_init(): initializes the threading system
• spt_create(stp_userf_t *F, void *G) creates a thread that will start execution with user function F, and
will be passed a single argument G
• stp_yield() releases control over the thread to the scheduler
• stp_abort() terminates a thread (prevents it from being scheduled)
• Here’s an example
#include "../qt/stp.h"
#include <stdio.h>
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Here is the example of a sort actor, implemented using the cooperative threading model.
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• The system scheduler now will call threads rather than actors:
Mapping DFGs to Single Processors: Static Schedule
• From the PASS analysis of an SDF graph, we know at least one
solution for a feasible sequential schedule
• This solution can be used to optimize the implementation in several
ways
• We can remove the firing rules since we know the exact sequential schedule
• This yields only a small performance benefit
• We can also determine an optimal interleaving of the actors to minimize the
storage requirements for the queues
• Finally, we can create a fully inlined version of the SDF graph which eliminates
the queues altogether
Mapping DFGs to Single Processors: Static Schedule
• From the above SDF topology, we know that the relative firing rates of A, B, and C must be 4, 2, and 1 to
yield a PASS.
• The right side of the code shows an example implementation of this PASS.
• The A, B, C actors are called in accordance with their PASS rate.
• Due to this particular interleaving, it is easy to see that in a steady state condition, the queue AB will carry
four tokens maximum, while the queue BC will contain two tokens maximum.
• This is not the most optimal interleaving.
• By calling the actors in the sequence (A,A,B,A,A,B,C), the maximum amount of tokens on any queue is
reduced to two.
Mapping DFGs to Single Processors: Static Schedule
• Implementing a truly static schedulemeans that we will no longer
test firing rules when calling actors.
• In fact, when we call an actor, we will have to guarantee that the
required input tokens are available.
• In a system with a static schedule, all SDFrelated operations get a
fixed execution order: the actor firings and the sequences of put and
get operations on the FIFO queues.
• This provides the opportunity to optimize the resulting SDF system.
• We will discuss optimization of single‐thread SDF systems with a
static schedule using an example we discussed before – the GCD.
• From our earlier analysis, we know that a valid PASS fires each node
a single time. Listing 2.4 is a description for each of the actors (sort,
diff).
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎
diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎
diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule
// initial tokens
put_fifo(&F1, 16);
put_fifo(&F2, 12);
Hardware Implementation of Data Flow
• SDF graphs are particularly well‐suited for mapping directly into hardware
• For SDFs with single‐rate schedules, all actors can execute in parallel using
the same clock
• The rules for mapping single‐rate SDFs to hardware are straightforward
• All actors map to a single combinational hardware circuit
• All FIFO queues map to wires (without storage)
• Each initial token in a queue is replaced with a register
• This means that two actors with no token on the queue between them will
be effectively represented as two back‐to‐back combinational circuits
• Timing requirements regarding the delay of the combinational logic must
be met,
• i.e., all computation within the combinational logic must complete in 1 clock cycle
• Therefore, in practice, the length of the actor sequence will be limited
Hardware Implementation of Data Flow
Hardware implementation of Euclid’s algorithm
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎
diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
The sort and diff actors are implemented using a comparator, a subtractor plus a few multiplexers
Note that the delay through the circuits of both actors define the maximum achievable clock frequency
If the critical path delays are 5 and 15 ns, then max. operating freq. is 50 MHz
Hardware Implementation of Data Flow
• Does this circuit work?
• Yes, it does. The circuit evaluates a single iteration through a PASS per clock cycle.
• Is this translation procedure general so that it would work for any SDF graph?
• No, it is not. The translation procedure is restricted to the following SDF graphs.
• We need a single‐rate SDF graph, which has a PASS firing vector with all ones in
it.
• All actors need to be implemented using combinational logic.
• In addition, the above method may result in circuits with a very long
combinational path.
• In the circuit above for example, the maximal operating clock frequency is
determined by the combined delay of the sort circuit and the diff circuit.
Hardware Implementation of Data Flow
Pipelining
• Pipelining of SDF graphs helps to break long combinational paths that may exist in circuits.
• Consider the example shown below.
• This is a data flow specification of a digital filter. It evaluates a weighted sum of samples of an
input stream, with the sum defined as
The critical path of this graph is associated with the one of the multiplier actors, c0 or c1, in series with
two addition actors
Hardware Implementation of Data Flow
• Pipeling allows additional tokens to be added (at the expense of increase latency) Here, one is
introduced by the in actor
• And then retiming is used to redistribute the tokens, as shown by the above marking, which is
produced after firing c0, c1 and c2
• Retiming creates additional registers and an additional pipeline stage
• The critical path is now reduced to two back‐to‐back adder actors
Hardware Implementation of Data Flow
• Note that it is not possible to introduce arbitrary initial tokens in a graph without
following the actor’s firing rules
• Doing so would likely change the behavior of the system
Analysis of Control Flow and Data Flow
Data and Control Edges of a C Program
• Fundamental to DFG model is the decomposition of a system into individual nodes
(actors), which communicates through unidirectional, point‐to‐point channels (queues).
• The resulting system model is represented as a graph.
• Such a data flow system is able to express concurrent computations that map easily into
hardware as well as into software.
• So, the data flow model of computation illustrates how we can build system models that
are equally well suited for hardware implementation as well as for software
implementation.
• Our objective in this chapter is to think of a C program in a similar target independent
fashion.
• For a software designer, a C program is always software.
• For a hardware–software codesigner however, a C program may be hardware or
software, depending on the requirements and needs of the application.
Data and Control Edges of a C Program
• Obviously, one cannot make a direct conversion of C into hardware – a major roadblock
is that hardware is parallel by nature, while C is sequential.
• These building blocks are the operations of the C program and the relations between
them.
Dashed components, entry, exit and body, are other CFGs of the C
program which have single-entry and single-exit points
The do‐while loop and the while‐do loop are similar iterative structures
Data and Control Edges of a C Program
Construction of the Control Flow Graph
Consider the CFG for the GCD algorithm
• A control path is defined as a sequence of control edges that traverse the CFG
• For example, each non-terminating iteration of the while loop will follow the path 2->3->4->2 or else
2->3->5->2
• Note that the CFG and the DFG will contain the same set of nodes ‐‐ only the edges will
be different
• While it is possible to derive the DFG directly from a C program, it is easier to create the
CFG first and use it to derive the DFG
• The method involves tracing control paths in the CFG while simultaneously identifying
corresponding read and write operations of the variables
• Our analysis focuses on C programs that do NOT have arrays or pointers
Data and Control Edges of a C Program
Construction of the Data Flow Graph
• The procedure to recover the data edges related to assignment statements is as follows:
1. In the CFG, select a node where a variable is used as an operand in an expression.
• Mark that node as a read‐node.
2. Find the CFG nodes that assign that variable.
• Mark those nodes as write‐nodes.
3. If there exists a direct control path from a write‐node into a read‐node that does not pass
through another write‐node, then you have identified a data edge.
• The data edge originates at the write‐node and ends at the read‐node.
4. Repeat the previous steps for all variable‐read nodes.
• This procedure identifies all data edges related to assignment statements, but not those
originating from conditional expressions in control flow statements.
Data and Control Edges of a C Program
Construction of the Data Flow Graph
• Let’s derive the DFG of the GCD program
• We first pick a node where a variable is read
Consider statement 5:
• There are two variable-reads in this statement, one for a and one for b Consider b first
• Find all nodes that reference b by tracing backwards through predecessors of node 5 in the CFG -- this
produces the ordered sequence 3, 2, 1, 4, and 5
• Both nodes 1 and 5 write b and there is a direct path from 1 to 5 (e.g. 1, 2, 3,5), and from 5 to 5 (e.g. 5, 2, 3, 5)
• Therefore, we need to add data edges for b from 1 to 5 and from 5 to 5
Data and Control Edges of a C Program
Construction of the Data Flow Graph
Incoming data edges for
• A similar process can be carried out for variable‐read of a in node 5 in the GCD program
node 5
• Nodes 1 and 4 write into a and there is a direct control path
from 1 to 5 and from 4 to 5
• Hence, data edges are added for a from 1 to 5 and from 4 to
5
• To complete the set of data edges into node 5, we also need
to identify all conditional expressions that affect the
outcome of node 5
• From the CFG, node 5 depends on the condition evaluated
in node 3 (a > b) AND the condition evaluated in node 2 (a
!= b)
Data and Control Edges of a C Program
Construction of the Data Flow Graph
The final DFG for all nodes and all variable-reads for GCD is shown below
Note: this DFG leaves out the data edges originating from conditional expressions
Translating C to Hardware
• As mentioned, control and data flow analysis can be helpful in translating C into
hardware
• Translating data structures and pointers from C to hardware can get tricky
• For the purpose of this course, we restrict our analysis as follows
• Only scalar C code is used (no pointers, arrays or other data structures)
• We assume each C statement executes in a single clock cycle
• We first create the CFG and DFG for the C program
• The control edges translate to signals that control datapath operations
• The data edges define the interconnection of the datapath components
Translating C to Hardware
Data Path Design:
With the CFG and DFG available, the following rules will define the implementation of the hardware
datapath.
1. Find the C expression embedded in each node of the CFG, and create an equivalent
combinational circuit to implement the expression.
• For example, if a node in the CFG corresponds to the C statement a = b ‐ a;, then the C expression embedded in
that statement is b ‐ a.
• The combinational circuit required to implement this expression is a subtractor.
• Conditional expressions generate datapath elements, too.
• The outputs of these expressions become the flags used by the hardware controller of this datapath.
2. Each variable in the C program is translated into a register with a multiplexer in front of it.
• The multiplexer is needed when multiple sources may update the register.
• By default, the register will update itself.
• The selection signals of the multiplexer will be driven by the controller.
3. The datapath circuit and the register variables are connected based on the nodes and data
edges in the DFG.
• Each assignment operation connects a combinational circuit with a register.
• Each data edge connects a register with the input of a combinational circuit.
• Finally, we also connect the system‐inputs and system outputs to inputs of datapath circuits and register outputs,
respectively.
Translating C to Hardware
Data Path Design:
• The datapath and the registers are connected consistent with the DFG
• Assignments connect combinational circuit outputs to register inputs, while the
• data edges connect register outputs to combinational circuit inputs.
• System I/O is connected to datapath inputs and register outputs resp.
• Let’s convert the GCD program to a hardware implementation
• The variables a and b are assigned to registers
• The conditional expressions for the if and while stmts require an equality‐ and greater‐than
comparator circuit
• The subtractions b ‐ a and a ‐ b are implemented using subtractors
• The connectivity of the components is defined by the data edges of the DFG
• The resulting datapath has two data inputs (in_a and in_b), and one data output (out_a)
• The circuit needs two control variables, upd_a and upd_b (outputs of controller)
• and it produces two flags, flag_while and flag_if (inputs to controller)
Translating C to Hardware
Translating C to Hardware
• The directed edges in the DFG correspond to the connections in the schematic
• Schematic representations of a circuit are low‐level representations with lots of detail
(similar to assembly code in programming languages)
• We will learn how to create HLS and behavioral VHDL descriptions that synthesize to
schematics similar to the one shown here
Translating C to Hardware: Control Path Design
• Controller Design: The design of the controller can
be derived directly from the CFG and translated
into a finite state machines (FSM)
• How do we create the controller for this datapath
such that it implements the GCD algorithm?
• This control information is present in the C
program and is captured in the CFG.
• In fact, we can translate the CFG almost straight
into hardware by considering it to be a finite state
machine (FSM) specification.
• Each of the transitions in this FSM takes 1 clock
cycle to complete.
• The activities of the FSM are expressed as
condition/command tuples.
Translating C to Hardware: Control Path Design
• For example, _ /run1 means that during this clock cycle, the value of the condition flags is a don’t‐
care, while the command for the datapath is the symbol run1.
• Similarly, flag_while/_ means that this transition is conditional on flag while being true, and that the
command for the datapath is a hold operation.
• A hold operation is one which does not change the state of the datapath, including registers.
• The command set for this FSM includes ( _, run1, run4,run5).
• Each of these symbols represents the execution of a particular node of the CFG.
• The datapath control signals can be created by additional decoding of these command signals.
Translating C to Hardware: Control Path Design
• Here, a2 and b2 are mapped into registers while the other variables are replaced with wires
• This type of manipulation allows multiple C statements to be executed per clock
Finite State Machine with
Datapath
Cycle‐Based Bit‐Parallel Hardware
• In this chapter, we develop a model to systematically describe custom hardware
consisting of a controller and a datapath.
• Deal with how to create datapath modules, control model and the combination of
control and datapath into an Finite State Machine with Datapath (FSMD).
• We will develop a cycle‐based hardware models.
• In such models, the behavior of a circuit is expressed in steps of a single clock cycle.
• This abstraction level is very common in digital design, and it is captured with the term
synchronous design.
Cycle‐Based Bit‐Parallel Hardware
Hardware Modules
• A hardware module defines a level of hierarchy for a hardware netlist. In order to
communicate across the levels of hierarchy, hardware modules define ports.
• Both Mealy and Moore outputs can be present in an FSM, and usually are Moore outputs
depend only on the current state
• VHDL descriptions of the FFs and combinational logic (next state logic and output logic)
are separated into two process blocks using two segment style
Mealy State Machine
• The Output of the state machine depends on both present state and
current input.
• When the input changes, the output of the state machine updated
without waiting for change in clock input.
Moore State Machine
• The Output of the State machine depends only on present state. The output of state machine are
only updated at the clock edge
VHDL code for Sequence detector (101) using moore state machine
VHDL code for Sequence detector (101) using mealy state machine
Algorithmic State Machine
• Algorithmic State Machine is a representation of a Finite State Machine suitable for FSMs
with a larger number of inputs and outputs compared to FSMs expressed using state
diagrams and state tables.
• Algorithmic state machines can model both Mealy and Moore Finite State
Machines, they can also model machines that are of the mixed type
Moore Example Reset
Reset A
w= 1 Next state
Present Output
w= 0 Az = 0 Bz = 0 0
w
w= 0 state z 1
w = 0 w = 1
B
w= 0 w= 1
A A B 0
Cz = 1 B A C 0 0
w
C A C 1 1
C
w= 1
z
State Table
State Diagram
0 1
w
ASM
VHDL code for ASM
CASE y IS END IF ;
USE ieee.std_logic_1164.all ; WHEN A =>
END PROCESS ;
IF w = '1' THEN
ENTITY simple IS y <= B ;
PORT ( Clock : IN STD_LOGIC ; ELSE z <= '1' WHEN y = C ELSE '0' ;
Reset : IN STD_LOGIC ; y <= A ;
w : IN STD_LOGIC ; END IF ;
z : OUT STD_LOGIC ) ; WHEN B => END Behavior ;
END simple ; IF w = '1' THEN Reset
y <= C ;
ARCHITECTURE Behavior OF simple IS ELSE
A
END IF ; w
1
y <= A ; y <= A ; C
z
ELSIF rising_edge(Clock) THEN END IF ;
END CASE ; 0
w
1
ASM for mealy and moore combined machine
• Let us consider an example of a state graph of a sequential network shown below.
• This state graph has both Mealy and Moore outputs. The outputs Y1 and Y2 are the Mealy
outputs and so should be conditional outputs.
• The Ya, Yb, and Yc are the Moore outputs so they should be part of state box.
• Input X can either be “0” or “1” and hence it should be part of the decision box.
Verilog model for ASM
Design of Secure Memory Access Controller
• Let’s consider a control algorithm that is designed to provide a secure
access control mechanism to an on‐chip memory
• create a project with GPIO and BRAM IP blocks
• The general purpose I/O is configured to implement two memory‐
mapped 32‐bit registers located between the PS and PL sides of the
Zynq SoC
• The BRAM is configured as a stand‐alone memory and is embedded
within the PL side
• Our memory controller’s primary function is to allow C programs
running on the ARM microprocessor to load and unload the BRAM in
a restricted manner
Design of Secure Memory Access Controller
• Let’s start with at level of abstraction of the FSM running in the hardware, e.g.,
• One of the first issues we’ll need to deal with is setting up a communication mechanism
between the C program and the FSM
• The GPIOs are visible to both the C program and the VHDL, and will be used for this
purpose
Secure Memory Access Controller
• We will define the bit‐fields within the two 32‐bit GPIO registers as follows:
• Some of the high order 16‐bits are designated for control while all of the lower order
• 16‐bits are designated for data transfers
• Note that the GPIOs cross clock domains, with the PL side running at 50 or 100 MHz and
the PS side running at more than 600 MHz
• Therefore, we need a reliable protocol to allow transfers between PL and PS
Secure Memory Access Controller
• Synchronization between our VHDL controller (PL) and the C program (PS) is done using a 2‐
way handshake, which makes use of two signals
• LM_ULM_stopped and LM_ULM_
• The following defines the protocol for data transfers from the C program to BRAM:continue
• Data transfer from BRAM to C program is similar except for direction of data flow
• This protocol ensures a reliable communication channel between the PS and PL side
Secure Memory Access Controller
Low Level Algorithm for Secure Memory Access Controller
• We are now ready to describe the FSM at a lower level of detail
• The C program starts the secure memory access controller (SMAC)
• FSM, but other usage scenarios may have other state machines start SMAC
• In this case, the C program may need to inform SMAC that it has no read or write
requests for the memory at that point in time
• The C program operations in the following are designed to provide context, and
are not part of the FSM
Secure Memory Access Controller
Algorithm for Secure Memory Access Controller
1. C program checks ’ready’, sets ’load_unload’ flag and issues ’start’ signal
2. C program waits for ’stopped’
3. idle: SMAC waits in idle for ’start’
if ( start = ’1’ )
Store ’base address’ and ’upper limit’ registers
Check ’load_unload’, if 0, Goto load_mem
Check ’load_unload’, if 1, Goto unload_mem
4. load_mem: Process write requests from C program Assert ’stopped’ signal
If ’done’ is 0
Check ’continue’, if asserted, update BRAM
Assert ’PNL_BRAM_we’
Assign ’PNL_BRAM_din’ GPIO data
Goto wait_load_unload
else
Goto wait_done
Secure Memory Access Controller
5. unload_mem: Process read requests from C program
Drive GPIO register with ’PNL_BRAM_dout’ data
Assert ’stopped’
If ’done’ is 0
Check ’continue’, if asserted, C program has data
Goto wait_load_unload
else
Goto wait_done
6. wait_load_unload: Finish handshake and update addr
Check ’continue’, if 0
If ’done’, if 1
Goto wait_done
Elsif ’base addr’ = ’upper_limit’
Goto idle
Else
Inc ’base addr’
Goto load_mem if ’load_unload’ is 0
Goto unload_mem if ’load_unload’ is 1
7. wait_done: Wait for C program to deassert ’done’
Check ’done’, if 0
Goto idle
ASM For LoadUnloadMem
System‐on‐Chip architecture
Introduction
• Technological Advances
• today’s chip can contains more than hundred million transistors.
• transistor gate lengths are now in term of nano meters.
• approximately every 18 months the number of transistors on a chip doubles –
Moore’s law.
• The Consequences
• components connected on a Printed Circuit Board can now be integrated onto
single chip .
• hence the development of System‐On‐Chip design .
4 April 2025
What is SoC
• The VLSI manufacturing technology advances has made possible to put
millions of transistors on a single die.
4 April 2025
System on Chip
4 April 2025
What is SoC ?
• An IC that integrates multiple components of a system onto a single chip.
• SoC not only chip, but more on “system”.
• SoC = Chip + Software + Integration
• The SoC chip includes:
• Embedded processor
• ASIC Logics and analog circuitry
• Embedded memory
• The SoC Software includes:
• OS, compiler, simulator, firmware, driver, protocol stack Integrated
• development environment (debugger, linker, ICE) Application interface
• (C/C++, assembly)
4 April 2025
Evolution: Boards to SoC
• Evolution:
• IP based design
• Platform‐based design
• Some Challenges
• HW/SW Co‐design
• Integration of analog (RF) IPs
• Mixed Design
• Productivity
• Emerging new technologies
• Greater complexity
• Increased performance
• Higher density
• Lower power dissipation
4 April 2025
Migration from ASICs to SoCs
4 April 2025
Three forms of SoC design
The scenario for SoC design is characterized by three forms:
1. ASIC vendor design: This refers to the design in which all the
components in the chip are designed as well as fabricated by
an ASIC vendor.
2. Integrated design: This refers to the design by an ASIC vendor
in which all components are not designed by that vendor. It
implies the use of cores obtained from some other source
such as a core/IP vendor or a foundry.
3. Desktop design: This refers to the design by a fabless
company that uses cores which for the most part have been
obtained from other source such as IP companies, EDA
companies, design services companies, or a foundry.
4 April 2025
A common set of problems facing everyone who is
designing complex chips
Reusing macros (called “cores”, or IP) that have already been designed and
verified helps to address all of the problems above.
4 April 2025
Design for Reuse
To overcome the design gap, design reuse - the use of pre-designed
and pre-verified cores, or reuse of the existing designs becomes a
vital concept in design methodology.
An effective block-based design methodology requires an extensive
library of reusable blocks, or macros, and it is based on the following
principles:
The macro must be extremely easy to integrate into the overall
chip design.
The macro must be so robust that the integrator has to perform
essentially no functional verification of internals of the macro.
The challenge for designers is not whether to adopt reuse, but how to employ
it effectively.
4 April 2025
Intellectual Property
Utilizing the predesigned modules Resources vs. Number of Uses
enables:
to avoid reinventing the wheel for
every new product,
to accelerate the development of
new products,
to assemble various blocks of a
large ASIC/SoC quite rapidly,
to reduce the possibility of failure
based on design and verification of
a block for the first time.
4 April 2025
Intellectual Property Categories
IP cores are classified into three distinct categories:
Hard IP cores consist of hard layouts using particular physical design libraries and are
delivered in masked-level designed blocks (GDSII format). The integration of hard IP cores
is quite simple, but hard cores are technology dependent and provide minimum flexibility
and portability in reconfiguration and integration.
Soft IP cores are delivered as RTL VHDL/Verilog code to provide functional descriptions of
IPs. These cores offer maximum flexibility and reconfigurability to match the requirements
of a specific design application, but they must be synthesized, optimized, and verified by
their user before integration into designs.
Firm IP cores bring the best of both worlds and balance the high performance and
optimization properties of hard IPs with the flexibility of soft IPs. These cores are delivered
in form of targeted netlists to specific physical libraries after going through synthesis without
performing the physical layout.
4 April 2025
Comparison of Different IP Formats
4 April 2025
Typical SoC
4 April 2025
SoC Structure
4 April 2025
Multi‐Core (Processor) System‐on‐Chip
• Inter‐node communication between CPU/cores can be performed by
message passing or shared memory.
• Number of processors in the same chip‐die increases at each node (CMP
and MPSoC).
• Memory sharing will require: Shared Bus
• Large Multiplexers
• Cache coherence
• Not Scalable
• Message Passing: NOC: Network‐on‐Chip
• Scalable
• Require data transfer transactions
• Overhead of extra communication
4 April 2025
Buses to Networks
• Architectural paradigm shift: Replace wire spaghetti by network
• Usage paradigm shift: Pack everything in packets
• Organizational paradigm shift
• Confiscate communications from logic designers
• Create a new discipline, a new infrastructure responsibility
4 April 2025
System‐on‐Chip(SoC)
4 April 2025
Traditional SoC
• Variety of dedicated interfaces
• Design and verification complexity
• Unpredictable performance
• Many underutilized wires
s s s
Module
s s s
Module
Module
s s s
4 April 2025
NoC: A paradigm Shift in VLSI
NI DMA
CPU NI
Coproc NI NI DSP
switch
NI MPEG
DRAM NI
DRAM NI NI Ethnt
Accel NI
4 April 2025
NoC Operation Example
Interface
Network
CPU switch
switch
1. CPU request
2. Packetization and trans.
Interface
Network
switch I/O
3. Routing
4. Receipt and unpacketization (AHB, OCP, ... pinout)
5. Device response
6. Packetization and transmission
7. Routing
8. Receipt and unpacketization
4 April 2025
Network‐on‐chip (NoC)
What are NoC’s?
4 April 2025
Why NoC’s emerged ???
4 April 2025
On‐chip Interconnection Types
I/O
P1 P2 P3 M1 P4 M2
2
I/O I/O
P5 P6 M3 P7 M4
1 3
Shared bus
4 April 2025
On‐chip Interconnection Types
Wait
P3 M1 P4 M2
Wait
I/O
Bridge
P1 P2 I/O
2 M3 P7 M4
3
Wait
I/O
P5 P6
1
Wait Wait
Hierarchical bus
4 April 2025
On‐chip Interconnection Types
M2 M3
I/O
P1
1
M1 P6
P2 P5 Wait
Wait P3 P4
Bus matrix
4 April 2025
On‐chip Interconnection Types
Processing Router
element
Unidirectional
links
Network
Interface
Input
buffers
Network-on-Chip
4 April 2025
On‐chip Interconnection Types
4 April 2025
Why today's SoC's need a NoC interconnect fabric?
Benefits of NoCs :
• Reduce Wire Routing Congestion
• Ease Timing Closure
• Higher Operating Frequencies
• Change IP Easily
4 April 2025
Network‐on‐chip (NoC)
Network‐on‐chip (NoC) is a reliable and scalable communication paradigm deemed
as an alternative to classic bus systems in modern systems‐on‐chip designs
a move from computation‐centric to communication‐centric design
4 April 2025
SoC Concepts
• A system‐on‐chip architecture combines one or more microprocessors, an on‐chip bus
system, several dedicated coprocessors, and on‐chip memory, all on a single chip.
• An SoC architecture provides general‐purpose computing capabilities along with a few
highly specialized functions, adapted to a particular design domain.
SoC Concepts
• A processor with a specific configuration of peripherals is also called a platform.
• Just like a personal computer is a platform for general‐purpose computing, a system‐on‐chip is a
platform for domain specialized computing.
• Examples of application domains are mobile telephony, video processing, or high‐speed
networking.
• The set of applications in the video‐processing domain for example could include image
transcoding, image compression and decompression, image color transformations, and so forth.
• The specialization of the platform ensures that its processing efficiency is higher compared to that
of general‐purpose solutions.
• Entire families of microcontrollers are defined using a single type of microprocessor integrated
with different kinds of peripherals (to serve one market segment).
• This type of SoC is called a derivative platform.
• For example, an SoC for automative applications may contain different peripherals than an SoC
for cellphones, even though both may be based on ARM.
SoC Concepts
• SoC architecture can be analyzed along 4 orthogonal dimensions:
• Control, Communication, Computation, and Storage.
• The role of central controller is given to the microprocessor, who is responsible of issuing
control signals to, and collecting status signals from, the various components in the
system.
• The microprocessor may or may not have a local instruction memory.
• In case it does not have a local instruction memory, caches may be utilized to improve
instruction memory bandwidth.
• The SoC implements communication using system‐wide busses.
• Each bus is a bundle of signals including address, data, control, and synchronization
signals.
SoC Concepts
• The data transfers on a bus are expressed as read‐ and write‐operations with a particular
memory address.
• The bus control lines indicate the nature of the transfer (read/write, size, source,
destination), while the synchronization signals ensure that the sender and receiver on the
bus are aligned in time during a data transfer.
• The ensemble of components can thus be represented in an address map, a list of all
system‐bus addresses relevant in the system‐on‐chip.
SoC Concepts
• It is common to split SoC busses into segments.
• Each segment connects a limited number of components, grouped according to their
communication needs.
• In the example, a high‐speed communication bus is used to interconnect the microprocessor,
a high‐speed memory interface, and a Direct Memory Access (DMA) controller.
• A DMA is a device specialized in performing block‐transfers on the bus, for example to copy
one memory region to another.
• Next to a high‐speed communication bus, you may also find a peripheral bus, intended for
lower‐speed components such as a timer and input–output peripherals.
• Segmented busses are interconnected with a bus bridge, a component that translates bus
transfers from one segment to another segment.
• A bus bridge will only selectively translate transfers from one bus segment to the other.
• This selection is done based on the address map.
SoC Concepts
• Therefore, bus segmentation increases the available communication parallelism
• The bus control lines of each bus segment are under command of the bus master.
• The control lines are controlled by a bus master.
• The microprocessor and the ‘DMA controller’ are examples of bus masters.
• Other components, the bus slaves, will follow the directions of the bus master.
• Each bus segment can contain one or more bus masters.
• In case there are multiple masters, the identity of the bus master can be rotated among
bus‐master components at run time.
• In that case, a bus arbiter will be needed to decide which component can become a bus
master for a given bus transfer.
SoC Interfaces for Custom Hardware
• The shaded areas in the SoC block diagram correspond to places where a designer could
integrate custom hardware.
• “custom hardware module” means a dedicated digital machine described as an FSMD or
as a microprogrammed machine.
• Eventually, all custom hardware will be under control of the central processor in the SoC.
• The SoC architecture offers several possible hardware software interfaces to attach
custom hardware modules.
• Three approaches can be distinguished in the SoC block diagram .
SoC Interfaces for Custom Hardware
• The most general approach is to integrate a custom hardware module as a standard peripheral on
a system bus.
• The microprocessor communicates with the custom hardware module by means of read/write
memory accesses.
• The memory addresses occupied by the custom hardware module cannot be used for other
purposes (i.e., as addressable memory).
• For the memory addresses occupied by the custom hardware module, the microprocessors’ cache
has no meaning, and the caching effect is unwanted.
• Microcontroller chips with many different peripherals typically use this memory‐mapped strategy
to attach peripherals.
• The strong point of this approach is that a universal communication mechanism (memory
read/write operations) can be used for a wide range of custom hardware modules.
• The corresponding disadvantage, of course, is that such a bus‐based approach to integrate
hardware is not very scalable in terms of performance
SoC Interfaces for Custom Hardware
• A second mechanism is to attach custom hardware through a local bus system or
coprocessor interface provided by the microprocessor.
• In this case, the communication between the hardware module and the microprocessor
will follow a dedicated protocol, defined by the local bus system or coprocessor
interface.
• In comparison to system‐bus interfaces, coprocessor interfaces have a high bandwidth
and a low latency.
• The microprocessor may also provide a dedicated set of instructions to communicate
over this interface.
• Typical coprocessor interfaces do not involve a memory addresses.
• This type of coprocessor obviously requires a microprocessor with a coprocessor‐ or
local‐bus interface.
SoC Interfaces for Custom Hardware
• Microprocessors may also provide a means to integrate a custom‐hardware datapath
inside of the microprocessor.
• The instruction set of the microprocessor is then extended with additional, new
instructions to drive this custom hardware.
• The communication channel between the custom datapath and the processor is typically
through the processor register file, resulting in a very high communication bandwidth.
• However, the very tight integration of custom hardware with a microprocessor also
means that the traditional bottlenecks of the microprocessor are also a bottleneck for
the custom‐hardware modules.
• If the microprocessor is stalled because of external events (such as memory‐access
bandwidth), the custom data‐datapath is stalled as well.
Four Design Principles in SoC Architecture
• A SoC is very specific to an application domain.
• Are there any guiding design principles that are relevant to the design of any SoC?
• The following are the four design principles that govern the majority of modern SoC architectures.
I. Heterogeneous and distributed communications
II. Heterogeneous and distributed data processing
III. Heterogeneous and distributed storage
IV. Hierarchical control.
Heterogeneous and Distributed Data Processing
• A first prominent characteristic of an SoC architecture is heterogeneous and distributed
data processing.
• For example, a digital signal processing chip in a camera may contain specialized units to
perform image‐processing.
• The Intel Core 2 processor contains 291 million transistors in 65 nm CMOS technology.
• Assuming a core clock frequency of 2.1GHz, we thus find that the silicon used to create a
Core 2 can theoretically implement 682,000 Giga‐operations per second.
291 10
𝐸 2.1𝐺 682000𝐺𝑜𝑝𝑠
28 32
Heterogeneous and Distributed Data Processing
• The actual Core 2 architecture handles around 9.24 instructions per clock cycle, in a
single core and in the most optimal case.
• The actual efficiency of the 2.1GHz Core 2 therefore is 19.4Giga‐operations per second.
• We make the (strong) approximation that these 9.24 instructions each correspond to a
32‐bit addition, and call the resulting throughput the actual Core2 efficiency.
• The ratio of the intrinsic Core2 efficiency over the actual Core2 efficiency illustrates the
efficiency of silicon technology compared to the efficiency of a processor core
architecture.
𝐸 682000
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 35,150
𝐸 19.4
• Therefore, bare silicon can implement computations 35,000 times more efficient than a
Core2!
• it demonstrates why specialization of silicon using multiple, independent computational
units is so attractive.
Heterogeneous and Distributed Communications
• The central bus in a system‐on‐chip is a critical resource.
• One approach to prevent this resource from becoming a bottleneck is to split the bus
into multiple bus segments using bus bridges.
• The on‐chip communication requirements typically show large variations over an SoC.
• There may be shared busses, point‐to‐point connections, serial connections, and parallel
connections.
Heterogeneous and Distributed Communications
• Heterogeneous and distributed SoC communications enable a designer to exploit the on‐
chip communication bandwidth.
• In modern technology, this bandwidth is extremely high.
• Example:
• In a 90nm 6‐layer metal processor, we can reasonably assume that metal layers will be
used as follows:
• Two metal layers are used for power and ground, respectively,
• Two metal layers are used to route wires in the X direction,
• Two metal layers are used to route wires in the Y direction.
• The density of wires in a 90 nm process is 4 wires per micron and the bit frequency is at
500 MHz.
• Consequently, in a chip of 10 millimeter on the side, we will have 40,000 wires per
• Such a chip can transport 80,000 bits in any direction at a frequency of 500MHz.
• This corresponds to 40 terabits per second
Heterogeneous and Distributed Communications
• On‐chip bandwidth is thus cheap. The challenge is to efficiently organize it.
• Note that the same is not true for off‐chip bandwidth, i.e., off‐chip bandwidth is very
expensive.
• Paul Franzon describes an example of a chip that would produce 8 Tbps off‐chip bandwidth.
• The best links nowadays produce 20 Gbps by means of differential signaling.
• Here, each signal is produced in direct as well as complementary form, and thus consumes
two chip pins.
• At 20 Gbps, we would need 400 pairs to produce 8 Tbps, which requires 800 pins.
• In addition, you would need to add 800 pins to provide ground and power, just to provide
adequate drive capability and signal integrity.
• The 8 Tbps chip thus requires over 1600 pins, which indicates the package limits its
practicality.
• In addition, assuming about 30 mW per pair, the chip would consume 12 W for I/O
operations alone.
Heterogeneous and Distributed Storage
• A System‐on‐Chip contains multiple types of memories distributed across multiple
locations of a chip.
• For example,
• Microprocessors/FSMD/etc. contain one or more registers
• Microprocessors contain instruction‐caches and data‐caches (static RAM)
• Microprocessors and microprogrammed engines contain local instruction memories
• Dedicated buffer memories may be present to support a specific application, For
example, a video buffer in a camera, a scratchpad memory next to a processor
• Distributed storage can significantly complicate the concept of a centralized memory
address space which is common in classic computing architectures.
• Often, local memories are just local buffer storage, invisible for the other components of
the system.
Heterogeneous and Distributed Storage
• Therefore, local storage helps to increase parallelism in the overall system.
• However, very often, distributed memories need to maintain or communicate common
data sets.
• Consider the system above, that consists of a CPU with a vector multiplier coprocessor
(implemented as FSMD).
• The coprocessor can calculate very quickly the inner product of two vectors stored in a
local data buffer.
Heterogeneous and Distributed Storage
• Thus, for two arrays u and v stored in the data buffer, the vector multiplier evaluates:
• Now consider how this system operates when the CPU performs a matrix multiplication
on matrices which are stored in the data memory.
• A matrix multiplication in software can be written as three nested loops:
• In order to use the vector multiplier, we need to copy a row of a and a column of b to u
and v in the data buffer attached to the multiplier.
Heterogeneous and Distributed Storage
• We then perform the vector‐multiplication, and transmit the result to be stored in c[i][j].
• The C program on the CPU might look like:
• It is intended to be used for still images, video, and audio in portable (battery operated)
devices.
Imaging/Video Subsystem
• The video subsystem also contains a video encoder, capable of merging two video
streams on screen, and providing picture‐in‐picture functionality.
• The video encoder also contains menu subsystem functionality.
• The output of the video encoder goes to an attached LCD or a TV.
• The video encoder provides approximately 100 operations per pixel, while the power
budget of the entire video subsystem is less than 100 mW.
• These numbers are clearly out of range for a software‐driven processor.
Portable Multimedia System
DSP Subsystem
• The DSP subsystem is created on top of a 72 MHz C54x processor with 128 Kbytes of RAM.
• The DSP processor performs the main processing and incorporates the control logic for the wide
range of signal processing algorithms.
• Signal processing algorithms include MPEG‐1, MPEG‐2, MPEG‐4, WMV, H.263, H.264, JPEG,
JPEG2K, M‐JPEG, MP3, AAC, WMA.
• A coprocessor subsystem delivers additional computing power for the cases where the DSP falls
short.
• There is a DMA engine that moves data back and forth between the memory attached to the
DSP and the coprocessors.
Portable Multimedia System
DSP Subsystem