0% found this document useful (0 votes)
31 views277 pages

HSCD

The document outlines a syllabus for a course on Hardware-Software Co-Design, covering topics such as the nature of hardware and software, data flow modeling, control flow analysis, and system on chip design. It emphasizes the concurrent development of hardware and software to achieve energy efficiency and performance goals. The course includes practical examples and learning resources, focusing on the integration of hardware and software components in design processes.

Uploaded by

Sai Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views277 pages

HSCD

The document outlines a syllabus for a course on Hardware-Software Co-Design, covering topics such as the nature of hardware and software, data flow modeling, control flow analysis, and system on chip design. It emphasizes the concurrent development of hardware and software to achieve energy efficiency and performance goals. The course includes practical examples and learning resources, focusing on the integration of hardware and software components in design processes.

Uploaded by

Sai Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

Hardware-Software Co-Design

Detailed Syllabus:
The Nature of Hardware and Software: Introducing Hardware/ Software Co-design, The Quest
for Energy Efficiency, The Driving Factors in Hardware/ Software Co-design, The Dualism of
Hardware Design and Software Design.

Data Flow Modeling and Transformation: Introducing Data Flow Graphs, Analyzing
Synchronous Data Flow Graphs, Control Flow Modeling and the Limitations of Data Flow,
Transformations.

Data Flow Implementation in Software and Hardware: Software Implementation of Data


Flow, Hardware Implementation of Data Flow, Hardware/ Software Implementation of Data
Flow.

Analysis of Control Flow and Data Flow: Data and Control Edges of a C Program, Implementing
Data and Control Edges, Construction of the Control Flow Graph and Data Flow Graph.

Finite State Machine with Datapath: Cycle-Based Bit-Parallel Hardware, Hardware Modules,
Finite State Machines with Datapath, FSMD Design Example: A Median Processor.

System on Chip: The System-on-Chip Concept, Four Design Principles in SoC Architecture, SoC
Modeling in GEZEL. Applications: Trivium Crypto-Coprocessor, CORDIC Co-Processor.

1
Hardware-Software Co-Design

Learning Resources:

Text Books:
1. Patrick Schaumont, A Practical Introduction to Hardware/ Software Co-design,
Springer, 2010.
2. Ralf Niemann, Hardware/Software Co-Design for Data flow Dominated
Embedded Systems, Springer, 1998.

Reference Books:

1. Handbook of Hardware/Software Codesign (Springer Reference) by Soonhoi Ha,


Jürgen Teich, 2017, ISBN: 978-94-017-7267-9
2. Hardware/Software Co-Design: Principles and Practice, by JørgenStaunstrup,
Wayne Wolf, 1997

2
Hardware-Software Co-Design
The Nature of Hardware and Software
• What is H/S Codesign (Prof. Schaumont’s definition):

Hardware/Software Codesign is the partitioning and design of an application in terms


of fixed and flexible components

Other definitions
• HW/SW Codesign is a design methodology supporting the concurrent
development of hardware and software (co-specification, co-development and co-
verification) in order to achieve shared functionality and performance goals for a
combined system

• HW/SW Codesign means meeting system level objectives by exploiting the


cooperation of hardware and software through their concurrent design

Giovanni De Micheli and Rajesh Gupta, "Hardware/Software Co-design", IEEE


Proceedings, vol. 85, no.3, March 1997, pp. 349-365

• Codesign is the concurrent development of hardware and software

3
Hardware-Software Co-Design

Hardware
 we will model hardware by means of single-clock synchronous digital circuits
created using word-level combinational logic and flip-flops.
 Hardware is realized by word-level combinational and sequential components,
such as registers, MUXs, adders and multipliers.

3 3
d q 3 4
0
+ 1
clk 3 2
3

 Cycle-based hardware modeling is often called register-transfer-level (RTL)


modeling because the behavior of a circuit can be thought of as a sequence
of transfers between registers, with logical and arithmetic operations
performed on the signals during the transfers.

4
Hardware-Software Co-Design
Hardware

 Bear in mind that this is a very simplistic treatment of actual hardware, We ignore
advanced circuit styles including asynchronous hardware, dynamic logic, multi-
phase clocked hardware, etc.

 The cycle-based model is limited because it does not model glitches, race
conditions or events that occur within clk cycles

 However, it provides a convenient abstraction for a designer who is mapping a


behavior, e.g., an algorithm, into a set of discrete steps
5
Hardware-Software Co-Design
Software:
 Hardware/software codesign deals with hardware/software interfaces.
 The low-level construction details of software are important, because they can
directly affect the performance and implementation cost of the interface.
 Hence, the course will consider important software implementation aspects such as
the various sections of memory (global, stack, heap), the different kinds of memory
(registers, caches, RAM, and ROM).
 We will model software as single-thread sequential programs, written in C or
assembly.
 Most of the discussions in this topic will be processor-independent, and they will
assume 32-bit architectures (ARM, Microblaze) as well as 8-bit architectures (8051,
Picoblaze).
 The choice for single-thread sequential C is simply because it matches so well to the
actual execution model of a typical microprocessor.
 For example, the sequential execution of C programs corresponds to the sequential
instruction fetch-and-execute cycle of microprocessors.
6
Hardware-Software Co-Design

Software : We assume software refers to a single-thread sequential program written in C


or assembly program

Programs will be shown in the following (C example)


1 int max;
2
3 int findmax(int a[10])
4 {
5 unsigned i;
5 max = a[0];
6 for (i = 1; i < 10; i++)
7 if (a[i] > max) max = a[i];
8 }

7
Hardware-Software Co-Design

Software
ARM assembly example

start
LDR R0, =array ; Load the address of the array into R0
LDR R1, =array_length ; Load the length of the array into R1
LDR R2, [R0], #4 ; Load the first element of the array into R2
SUBS R1, R1, #1 ; Decrement array
Loop : CMP R1, #0 ; Check if we have processed all elements
BEQ done ; If R1 is 0, we're done
LDR R3, [R0], #4 ; Load the next element of the array into R3
CMP R3, R2 ; Compare the next with the current max (R2)
BLE continue_loop ; If R3 <= R2, skip the update of max
MOV R2, R3 ; If R3 > R2, update max value in R2
continue_loop : SUBS R1, R1, #1 ; Decrement the counter
BNE loop ; If there are more elements, continue the loop
done: END

8
Hardware-Software Co-Design

 The variables of C are stored in a single, shared-memory space, corresponding to


the memory attached to the microprocessor.
 There is a close connection between the storage concepts of a microprocessor
(registers, stack) and the storage types supported in C (register int, local variables).
 Furthermore, common datatypes in C (char, int) directly map onto microprocessor
variables(byte, word).
 A detailed understanding of C execution is closely related to a detailed
understanding of the microprocessor activity at a lower abstraction level.
 of course, there are many forms of software that do not fit the model of a single-
thread sequential C program.
 Multi-threaded software, for example, creates the illusion of concurrency and lets
users execute multiple programs at once.
 Other forms of software, such as object-oriented software and functional
programming, substitute the simple machine model of the microprocessor with a
more sophisticated one.

9
Hardware-Software Co-Design

 let’s compare C Program and Assembly program


 An ideal starting point when matching a C program to an assembly program is to
look for similar structures:
 loops in C will be reflected as branches in assembly;
 if-then-else statements in C will be reflected as conditional branches in assembly.
 Even if you’re unfamiliar with the assembly from a microprocessor, you can often
derive such structures easily.

10
Hardware-Software Co-Design

Hardware and Software


 The objective of this course is to discuss the combination of hardware design and
software design in all its forms.
 Hardware as well as software can be modeled using RTL programs and C programs,
respectively.
 A term model merely indicates they are not the actual implementation, but only a
representation of it.
 An RTL program is a model of a network of logic gates; a C program is a model of a
binary image of microprocessor instructions.
 Models are an essential part of the design process.
 They are a formal representation of a designers’ intent, and they are used as input
for simulation tools and implementation tools.
 In hardware/software codesign, we are working with models that are partly written
as C programs, and partly as RTL programs.

11
Hardware-Software Co-Design

 Figure shows an 8051 microcontroller and an attached coprocessor.


 The coprocessor is attached to the 8051 microcontroller through two 8-bit
ports P0 and P1.
 A C program executes on the 8051 microcontroller, and this program contains
instructions to write data to these two ports.
 When a given, predefined value appears on port P0, the coprocessor will
make a copy of the value present on port P1 into an internal register.

12
C Program

13
HDL program

14
15
Hardware-Software Co-Design
 This very simple design can be addressed using hardware/software codesign;

 it includes the design of a hardware model and the design of a C program.

 The hardware model contains the 8051 processor, the coprocessor, and the connections
between them.

 During execution, the 8051 processor will execute a software program written in C.

 C program and RTL hardware model for this design, written in the GEZEL language.

 The C driver sends three values to port P1.

 Each time, it also cycles the value on port P0 between ins hello and ins idle, which are
encoded as value 1 and 0,respectively.

 The hardware model includes both the microcontroller and the coprocessor.

 The coprocessor is on lines 1–18.

 This particular hardware model is a combination of a finite state machine (lines 10–18) and a
datapath (lines 1–8).

16
Hardware-Software Co-Design
 This FSMD is quite easy to understand.

 The datapath contains several instructions: decode and hello.

 The FSM controller selects, each clock cycle, which of those instructions to execute.

 For example, lines 14–15 shows the following control statement.

 This means: when the value of insreg is 1 and the FSM controller current state is s1, the
datapath will execute instructions hello and decode, and the FSM controller next-state is s2.

 When the value of insreg would be 0, the datapath will execute only instruction decode and
the FSM controller next-state is s1.

 The overall coprocessor behavior is like this: when the ins input changes from 0 to 1, then the
din input will be printed in the next clock cycle.

17
Hardware-Software Co-Design
 The 8051 microcontroller is captured with three ipblock (GEZEL library
modules), on lines 20–37.
 The first ipblock is an i8051system.
 It represents the 8051 microcontroller core, and it indicates the name
of the compiled C program that will execute on this core (driver.ihx on
line 22).
 The other two ipblock are two 8051 output ports (i8051systemsource),
one to model port P0, and the other to model port P1.
 Finally, the coprocessor and the 8051 ports are wired together in a
top-level module, shown in lines 39–49.
 We can now simulate the entire model, including hardware and
software, as follows.
 First, the 8051 C program is compiled to a binary executable.
 Next, the GEZEL simulator will combine the hardware model and the
8051 binary executable in a co-simulation.
18
Hardware-Software Co-Design

 The output of the simulation model is shown below.


 > sdcc driver.c
 > /opt/gezel/bin/gplatform hello.fdl
 i8051system: loading executable [driver.ihx]
 9662 Hello! You gave me 3/3
 9806 Hello! You gave me 2/2
 9950 Hello! You gave me 1/1
 Total Cycles: 10044
 You can notice that the model produces output on cycles 9662, 9806,
and 9950, while the complete C program executes in 10044 cycles.

19
Defining Hardware/Software Codesign

 Hardware/Software co-design is the design of cooperating hardware components and


software components in a single design effort

 For example, if you would design the architecture of a processor and


at the same time develop a program that could run on that processor,
then you would be using hardware/software codesign.
 However, this definition does not tell precisely what software and
hardware mean.
 In the previous example, the software was a C program, and the
hardware was an 8051 microcontroller with a coprocessor.
 In reality, there are many forms of hardware and software, and the
distinction between them can easily become blurred (not clear).

20
Defining Hardware/Software Codesign

 A Field Programmable gate Array (FPGA) is a hardware circuit that can


be reconfigured to a user-specified netlist of digital gates.
 The program for an FPGA is a ‘bitstream’, and it is used to configure
the netlist topology.
 Writing ‘software’ for an FPGA really looks like hardware development
– even though it is software.
 A soft-core is a processor implemented in the bitstream of an FPGA.
 However, the soft-core itself can execute a C program as well. Thus,
software can execute on top of other ‘software’.

21
Defining Hardware/Software Codesign
 A Digital-Signal Processor (DSP) is a processor with a specialized
instruction set, optimized for signal-processing applications.
 Writing efficient programs for a DSP requires detailed knowledge of
these specialized instructions.
 Very often, this means writing assembly code, or making use of a
specialized software library.
 Hence, there is a strong connection between the efficiency of the
software and the capabilities of the hardware.

 An Application-Specific Instruction-set Processor (ASIP) is a processor


with a customizable instruction set.
 The hardware of such a processor can be extended, and these
hardware extensions can be encapsulated as new instructions for the
processor.
 Thus, an ASIP designer will develop a hardware implementation for
these custom instructions and subsequently write software that uses
those instructions.
22
Defining Hardware/Software Codesign

 These examples illustrate a few of the many forms of hardware and


software that designers use today.
 A common characteristic of all these examples is that creating the
‘software’ requires intimate familiarity the ‘hardware’.
 In addition, hardware includes much more than RTL models: it also
includes specialized processor instructions, the FPGA fabric, multicore
architectures, and more.
 Let us define the application as the overall function of a design,
including hardware as well as software.
 This allows to define hardware/software codesign as follows:

Hardware/Software codesign is the partitioning and design of an


application in terms of fixed and flexible components.

23
The Quest for Energy Efficiency
 Choosing between implementing a design in hardware or
implementing it in software very difficult.
 Indeed, from a designers’ point-of-view, the easiest approach is to
write software, for example in C.
 Software is easy and flexible, software compilers are fast, there are
large amounts of source code available, and all you need to start
development is a nimble personal computer.
 Furthermore, why go through the effort of designing a hardware
architecture when there is already one available (namely, the RISC
processor)?

24
Relative Performance

 Proponents of hardware implementation will argue that performance


is a big plus of hardware over software.
 Specialized hardware architectures have a larger amount of
parallelism than software architectures.
 We can measure this as follows: Relative performance means the
amount of useful work done per clock cycle.
 Under this metric, highly parallel implementations are at an advantage
because they do many things at the same time.

25
Relative Performance
 Figure illustrates various cryptographic implementations in software
and hardware that have been proposed over the past few years.
 These are all designs proposed for embedded applications, where the
trade-off between hardware and software is crucial.
 As demonstrated by the graph, hardware crypto architectures have,
on the average, a higher relative performance compared to embedded
processors.

26
Relative Performance
 However, relative performance may not be a sufficient argument to
motivate the use of a dedicated hardware implementation.
 Consider for example a specialized Application-Specific Integrated
Circuit (ASIC) versus a high-end (workstation) processor.
 The hardware inside of the ASIC can execute many operations in
parallel, but the processor runs at a much higher clock frequency.
 Furthermore, modern processors are very effective in completing
multiple operations per clock cycle.
 As a result, an optimized software program on top of a high-end
processor may outperform a quick-and-dirty hardware design job on
an ASIC.
 Thus, the absolute performance of software may very well be higher
than the absolute performance of hardware.
 In contrast to relative performance, the absolute performance needs
to take clock frequency into account.

27
Energy Efficiency

 There is another metric which is independent from clock frequency, and which can
be applied to all architectures.

 That metric is energy-efficiency: the amount of useful work done per unit of energy.
Flexibility

28
Energy Efficiency
 Take an example of a particular encryption application (AES) for
different target platforms.
 The flexibility of these platforms varies from very high on the left to
very low on the right.
 The platforms include: Java on top of a Java Virtual machine on top of
an embedded processor;
 C on top of an embedded processor;
 optimized assembly-code on top of a Pentium-III processor;
 Verilog code on top of a Virtex-II FPGA; and
 an ASIC implementation using 0.18 micron CMOS standard cells.
 Y-axis shows the amount of gigabits that can be encrypted on each of
these platforms using a single Joule of energy.
 This shows battery-operated devices would greatly benefit using less
flexible, dedicated hardware engines
29
The Driving Factors in Hardware/Software Codesign

 Energy-efficiency and relative performance are important factors to prefer a


(fixed, parallel) hardware implementation over a (flexible, sequential)
software implementation.
 In the design of modern electronic systems, many tradeoffs have to be
made, often between conflicting objectives.
 Some factors argue for more software while other factors argue for more
hardware.

30
The Driving Factors in Hardware/Software Codesign

 There is a large overhead associated with executing software instructions in the


microprocessor implementation

 Instruction and operand fetch from memory


 Complex state machine for control of the datapath, etc.

 Also, specialized hardware architectures are usually also more efficient than software
from a relative performance perspective, i.e., amount of useful work done per clock
cycle

 Flexibility comes with a significant energy cost -- one which energy optimized
applications cannot tolerate
 Therefore, you will never find a Pentium processor in a cell phone!

31
The Driving Factors in Hardware/Software Codesign

Arguments in favor of increasing the amount of hardware (HW):


 Performance: The classic argument in favor of dedicated hardware design
has been increased performance: more work done per clock cycle.
 Increased performance is obtained by reducing the flexibility of an application
 Energy Efficiency: Almost every electronic consumer product today carries
a battery (iPod, PDA, mobile phone, Bluetooth device, ..).
 This makes these products energy-constrained.
 At the same time, such consumer appliances are used for similar
applications as traditional high-performance personal computers.
 In order to become sufficiently energy-efficient, consumer devices are
implemented using a combination of embedded software and dedicated
hardware components.
 Thus, a well-known use of hardware–software co-design is to trade
function specialization and energy-efficiency by moving (part of) the
flexible software of a design into fixed hardware.

32
The Driving Factors in Hardware/Software Codesign
Power Densities:
 Further increasing clock speed in modern high-end processors as a performance
enhancer has run-out-of-gas because of thermal limits
 This is driven a broad and fundamental shift to increase parallelism within
processor architectures
 However, at this moment, there is no dominant parallel computer architecture
that has shown to cover all applications. commercially available systems include
 Symmetric multiprocessors with shared memory
 Traditional processors tightly coupled with FPGAs as accelerator engines
 Multi-core and many-core architectures such as GPUs

 Nor is there yet any universally adopted parallel programming language, i.e.,
code must be crafted differently depending on the target parallel platform
 This forces programmers to be architecturally-aware of the target platform

33
The Driving Factors in Hardware/Software Codesign
Design Complexity:
 Today, it is common to integrate multiple microprocessors together with all related
peripherals and hardware components on a single chip.
 This approach has been touted system-on-chip (SoC). Modern SoC are extremely complex.
 The conception of such a component is impossible without a detailed planning and design
phase.
 Extensive simulations are required to test the design upfront, before committing to a costly
implementation phase.
 Since software bugs are easier to address than hardware bugs, there is a tendency to increase
the amount of software.
Design Cost:

 New chips are very expensive to design. As a result, hardware designers make chips
programmable so that these chips can be reused over multiple products or product
generations.
 The SoC is a good example of this trend.
 However, ‘programmability’ can be found in many different forms other than embedded
processors: reconfigurable systems are based on the same idea of reuse-through-
reprogramming.

34
The Driving Factors in Hardware/Software Codesign

Shrinking Design Schedules:


 Each new generation of technology tends to replace the older one more
quickly.
 In addition, each of these new technologies is exponentially more complex
than the previous generation.
 For a design engineer, this means that each new product generation brings
more work that needs to be completed in a shorter period of time.
 Shrinking design schedules require engineering teams to work on multiple
tasks at once: hardware and software are developed concurrently.
 A software development team will start software development as soon as
the characteristics of the hardware platform are established, even before an
actual hardware prototype is available.

35
The Driving Factors in Hardware/Software Codesign

Deep-Submicron Effects:
 Designing new hardware from-scratch in high-end silicon processes is
difficult due to second-order effects in the implementation.
 For example, each new generation of silicon technology has an increased
variability and a decreased reliability.
 Programmable, flexible technologies make the hardware design process
simpler, more straightforward, and easier to control.
 In addition, programmable technologies can be created to take the effects
of variations into account.

 Finding the correct balance, while weighing in all these factors, is a complex
problem Instead, we will focus on optimizing metrics related to design cost
and performance
 In particular, we will consider how adding hardware to a software
implementation increases performance while weighing in the increase in
design cost
36
The Hardware–Software Codesign Space
 The proceeding discussion makes it apparent that there are a multitude of
alternatives available for mapping an application to an architecture
 For a given application, there are many different possible solutions.
 The collection of all these implementations is called the hardware–software
codesign space.
 The following figure gives a symbolic representation of this design space and
indicates the main design activities in this design space.

37
The Hardware–Software Codesign Space

 On top is the application, and a designer will map this application


onto a platform.
 A platform is a collection of programmable components.
 Mapping an application onto a platform means writing software for
that platform, and if needed, customizing the hardware of the
platform.
 The format of the software varies according to the components of the
platform.
 For example, a RISC processor may be programmed in C, while an
FPGA could be programmed starting from a HDL program.

38
The Hardware–Software Codesign Space
Examples Micrographs of Target Platforms
Microprocessor FPGA SoC

DSP Microcontroller

39
The Hardware–Software Codesign Space
SoC Examples
Example System-on-Chip (SoC) with IP cores
Processor RF Micro
Memory RF #RF2173
Pow Amp
Transreflective Analog Devices
monochrome Maxim
#AD7873 #MAX4472
backlit display Screen digitizer Pow. Amp contrl

Hynix drivers
#HY57V641629 Motorola
SDRAM 8MB
Motorola DSP #MC1376VF
#MC68VZ328 Dig. Transceivers
Fijitsu DragonBall Proc.
#MBM29D1323
Flash 4MB Philips TCXO
#PDIUBD12
USB Interface K001 VCO

MMC-format Xilinx Maxim Universal


memory #XCR3064 #MAX3386 Connector
card slot CPLD Transceivers

FPGA Interface
Manual inputs

40
The Hardware–Software Codesign Space
Codesign Examples
Video Codec (H261)

Camera Display

Grabber VLD MSQ IDCT MCC

MSQ bus
MCC bus

Unframer Framer Pred.


DCT Filter M.E. VLC
ISD N
Line CODEC

uP+code SW Processors

HW HW Processors

41
The Hardware–Software Codesign Space

 A specification is a description of the desired application.


 A new application could be for example a novel way of encoding audio
in a more economical format than current encoding methods.
 Often, applications start out as informal ideas, uncoupled from the
actual implementation.
 Designers then write C programs to render their ideas in detail.
 The objective of the design process is to implement the application on
a target platform.
 In hardware–software codesign, we are interested in using
programmable components.
 A RISC microprocessor, a FPGA, a DSP, an ASIP, and finally an ASIC.

42
The Hardware–Software Codesign Space

Each of the above platforms presents a trade-off between flexibility and efficiency

The wedge-shape of the diagram expresses this idea:


Increasing flexibility
Increasing energy efficiency

 Flexibility refers to the versatility of the platform for implementing different


application requirements, and how easy it is to update and fix bugs

 Efficiency refers to performance (i.e. time-efficiency) or to energy efficiency

43
The Hardware–Software Codesign Space
Codesign involves the following three activities:
• Platform selection
• Application mapping
• Platform programming

We start with a specification:


 For example, a new application can be a novel way of encoding audio in a
more economical format than current encoding methods

 Designers can optionally write C programs to implement a prototype

 Very often, a specification is just a piece of English text, that leaves many
details of the application undefined

Step 1: Select a target platform


This involves choosing one or more programmable component as discussed pre-
viously, e.g., a RISC micro, an FPGA, etc.

44
The Hardware-Software Codesign Space
Step 2: Application mapping

The process of mapping an application onto a target platform involves writing C


code and/ or VHDL/verilog

Examples include:
• RISC: Software is written in C while the hardware is a processor
• FPGAs: Software is written in a hardware description language (HDL)
FPGAs can be configured to implement a soft processor, in which case, software
also needs to be written in C
• DSP: A digital signal processor is programmed using a combination of C and
assembly, which is run on a specialized processor architecture
• ASIP: Programming an ASIP is a combination of C and an HDL description
• ASIC: The application is written in a HDL which is then synthesized to a hardwired
netlist and implementation
Note: ASICs are typically non-programmable, i.e., the application and platform
are one and the same

45
HW/SW Codesign
The Hardware-Software Codesign Space

Step 3: Platform programming is the task of mapping SW onto HW


This can be done automatically, e.g., using a C compiler or an HDL synthesis
tool

 However, many platforms are not just composed of simple components, but
rather require multiple pieces of software, possibly in different programming
languages.

 For example, the platform may consist of a RISC processor and a specialized
hardware coprocessor
 Here, the software consists of C (for the RISC) as well as dedicated
coprocessor instruction-sequences (for the coprocessor).

 Therefore, the reality of platform programming is more complicated, and


automated tools and compilers are NOT always available

46
The Hardware–Software Codesign Space
 Another concept reflected in the wedge-figure is the domain-specific platform

 General-purpose platforms, such as RISC and FPGA, are able to support a


broad range of applications

 Application-specific platforms, such as the ASIC, are optimized to execute a


single application

 In the middle is a class called domain-specific platforms that are optimized


to execute a range of applications in a particular application domain
 Signal-processing, cryptography, networking, are examples of domains

 And domains can have sub-domains,


 e.g., voice-signal processing vs. video-signal processing
 Optimized platforms can be designed for each of these cases

 DSPs and ASIPs are two examples of domain-specific platforms


47
The Hardware–Software Codesign Space

The Hardware-Software Codesign Space


Difficult questions:
• How does one select a platform for a given specification (harder problem of two)
• How can one map an application onto a selected platform

The first question is harder - seasoned designers choose based on their previous expe-
rience with similar applications

The second issue is also challenging, but can be addressed in a more systematic fash-
ion using a design methodology
A design method is a systematic sequence of steps to convert a specification
into an implementation

Design methods cover many aspects of application mapping


• Optimization of memory usage
• Design performance
• Resource usage
• Precision and resolution of data types, etc.
A design method is a canned sequence of design steps, You can learn it in the context of
one design, and next apply this design knowledge in the context of another design
48
The Dualism of Hardware Design and Software Design

 A key challenge in hardware–software codesign is that a designer


needs to combine two radically different design paradigms.
 In fact, hardware and software are each other’s dual in many respects.

49
The Dualism of Hardware Design and Software Design

 Design Paradigm: Parallel vs. sequential operation


 Hardware supports parallel execution of operations, while software supports
sequential execution of operations

 The natural parallelism available in hardware enables more work to be


accomplished by adding more elements.

 In contrast, adding more operations in software increases its execution time.

 Designing requires the decomposition of a specification into low level primitives such
as gates (HW) and instructions (SW)

 Hardware designers develop solutions using spatial decomposition while soft-


ware designer use temporal decomposition

50
The Dualism of Hardware Design and Software Design
 Resource Cost:
Temporal vs. spatial decomposition
 The dualism in decomposition methods leads a similar dual resource cost.
Decomposition in space, as used by a hardware designer, means that more
gates are required for when a more complex design needs to be implemented.
 Decomposition in time, as used by a software designer, implies that a more
complex design will take more instructions to complete.
 Therefore, resource cost for hardware is circuit area while resource cost for
software is execution time
 Design Constraints:
 A hardware designer is constrained by the clock cycle period of a design.
 A software designer, on the other hand, is limited by the capabilities of the
processor instruction set and the memory space available with the processor.
 Thus, the design constraints for hardware are in terms of a time budget, while the
design constraints for software are fixed by the CPU.
 So, a hardware designer invests circuit area to maintain control over execution
time, and a software designer invests execution time for an almost constant circuit
area.

51
The Dualism of Hardware Design and Software Design
Flexibility:
 Software excels over hardware in the support of application flexibility.
 Flexibility is the ease by which the application can be modified or
adapted after the target architecture for that application is
manufactured.
 In software, flexibility is essentially free.
 In hardware on the other hand, flexibility is not trivial.
 Hardware flexibility requires that circuit elements can be easily reused
for different activities or functions in a design.

52
The Dualism of Hardware Design and Software Design
Parallelism:
 A dual of flexibility can be found in the ease with which parallel
implementations can be created.
 Parallelism is the most obvious approach to improving performance.
 For hardware, parallelism comes for free as part of the design
paradigm.
 For software, on the other hand, parallelism is a major challenge.
 If only a single processor is available, software can only implement
concurrency, which requires the use of special programming
constructs such as threads.
 When multiple processors are available, a truly parallel software
implementation can be made, but inter-processor communication and
synchronization become a challenge.

53
The Dualism of Hardware Design and Software Design
Modelling:
 In software, modeling and implementation are very close.
 Indeed, when a designer writes a C program, the compilation of that
program for the appropriate target processor will also result in the
implementation of the program.
 In hardware, the model and the implementation of a design are
distinct concepts.
 Initially, a hardware design is modeled using a HDL.
 Such a hardware description can be simulated, but it is not an
implementation of the actual circuit.
 Hardware designers use a hardware description language, and their
programs are models which are later transformed to implementation.
 Software designers use a software programming language, and their
programs are an implementation by itself.
54
The Dualism of Hardware Design and Software Design
Reuse:
 Finally, hardware and software are also quite different when it comes to
Intellectual Property Reuse or IP-reuse.
 The idea of IP-reuse is that a component of a larger circuit or a program can
be packaged, and later reused in the context of a different design.
 In software, IP-reuse has known dramatic changes in recent years due to
open source software and the proliferation of open platforms.
 When designing a complex program these days, designers will start from a
set of standard libraries that are well-documented and available on a wide
range of platforms.
 For hardware design, IP-reuse is still in its infancy.
 Hardware Designers are only starting to define standard exchange
mechanisms.
 IP-reuse of hardware has a long way to go compared to the state of reuse in
software.
55
Abstraction Levels

 Abstraction refers to the level of detail that is available in a model


 Lower levels have more detail, but are often much more complex and
difficult to manage
 Abstraction is heavily used to design hardware systems, and the
representations at different levels are very different
 A concept of abstraction is well exemplified by time-granularity in
simulations.
 There are five abstraction levels commonly used by computer
engineers for the design of electronic hardware–software systems.
I. Continuous time (lowest level):
II. Discrete-event
III. Cycle-accurate
IV. Instruction-accurate
V. Transaction-accurate
56
Continuous Time

 At the lowest abstraction level, we describe operations as continuous


actions.
 For example, electric networks can be described as systems of interacting
differential equations.
 Solving these equations leads to an estimate for voltages and currents in
these electric networks.
 This is a very detailed level, useful to analyze analog effects.
 However, this level of abstraction is not used to describe typical hardware–
software systems.

57
Discrete-event
 Here, simulators abstract node behavior into discrete, possibly
irregularly spaced, time steps called events
 Events represent the changes that occur to circuit nodes when the
inputs are changed in the test bench
 The simulator is capable of modeling actual propagation delay of the
gates, similar to what would happen in a hardware instance of the
circuit
 Discrete-event simulation is very popular for modeling hardware at
the lowest layer of abstraction in codesign
 This level of abstraction is much less compute-intensive than
continuous time but accurate enough to capture details of circuit
behavior including glitches

58
Cycle-accurate
 Single-clock synchronous hardware circuits have the important property that
all interesting things happen at regularly spaced intervals, namely at the clock
edge.
 This abstraction is important enough to merit its own abstraction level, and it
is called cycle-accurate modeling.
 A cycle-accurate model does not capture propagation delays or glitches.
 All activities that fall ‘in between’ clock edges are concentrated at the clock
edge itself.
 This level of abstraction is considered the golden reference in HW/SW
codesign

59
Instruction-accurate
 RTL models are great but may be too slow for complex systems.
 For example, your laptop has a processor that probably clocks over 1 GHz (one billion cycles).
 Assuming that you could write a C function that expresses a single clock cycle of processing,
you would have to call that function one billion times to simulate just a single second of
processing.
 Clearly, further abstraction can be useful to build leaner and faster models.
 Instruction-accurate modeling expresses activities in steps of one microprocessor instruction
(not cycle count)
 Each instruction lumps together several cycles of processing.
 Instruction-accurate simulators are used to verify complex software systems, such as
complete operating systems.

60
Transaction-accurate
 For very complex systems, even instruction-accurate models may be too slow
or require too much modeling effort
 In transaction-accurate modeling, only the interactions (transactions) that
occur between components of a system are of interest
 For example, suppose you want to model a system in which a user process is
performing hard disk operations, e.g., writing a file
 The simulator simulates commands exchanged between the disk drive and
the user application
 The sequence of instruction-level operations between two transactions can
number in the millions but the simulator instead simulates a single function
call
 Transaction-accurate models are important in the exploratory phases of a
design, before effort is spent on developing detailed models
For this course, we are interested in instruction-accurate and cycle-accurate levels

61
Concurrency and Parallelism

 Concurrency and parallelism are terms that often occur in the context of hardware-
software codesign.
 They mean very different things.
 Concurrency is the ability to execute simultaneous operations because these
operations are completely independent.
 Parallelism is the ability to execute simultaneous operations because the operations
can run on different processors or circuit elements.
 Thus, concurrency relates to an application model, while parallelism relates to the
implementation of that model.
 Hardware is always parallel.
 Software on the other hand can be sequential, concurrent, or parallel.
 Sequential and concurrent software requires a single processor
 Parallel software requires multiple processors.
 Software running on your laptop, e.g., WORD, email, etc. is concurrent
 Software running on a 65536-processor IBM Blue Gene/L is parallel
62
Concurrency and Parallelism Cont..
 A key objective of HW/SW codesign is to allow designers to leverage the
benefits of true parallelism in cases where concurrency exists in the
application
 There is a well-known Comp. Arch principle called Amdahl’s law
 The maximum speedup of any application that contains q% sequential
code is: 1 / (q/100).
 For example, if your application spends 33% of its time running
sequentially, the maximum speedup is 3
 This means that no matter how fast you make the parallel component
run, the maximum speedup you will ever be able to achieve is 3
 Thus, you see that we don’t only need to have parallel platforms, we
also need a way to write parallel programs to run on those platforms.
 Surprisingly, even algorithms that seem sequential at first can be executed (and
specified) in a parallel fashion.

63
Concurrency and Parallelism Cont..

 Consider an application that performs addition, and assume it is implemented on a


Connection Machine (from the’80s)
 The Connection Machine (CM) is a massively parallel processor, with a network of
processors, each with its own local memory
 Connection Machine: Original machine contained 65556 processors, each with 4Kbits of
local memory
 How hard is it to write programs for this machine?
 It’s possible to write individual C programs for each node, but this is really not practical
with 64K nodes!

64
Concurrency and Parallelism Cont..
 The authors of the CM, Hellis and Steele, show that it is possible to express
algorithms in a concurrent fashion so that they map neatly onto a CM
 Consider the problem of summing an array of numbers
 The array can be distributed across the CM by assigning one number to each
processor
 To take the sum, distribute the array over the CM processors so that each
processor holds one number.
 We can now take the sum over the entire array in log(n) steps (n being the
number of processors)

65
Concurrency and Parallelism Cont..
 Even through the parallel sum speeds up the computation significantly, there
remains a lot of wasted compute power

 Compute power of a smaller 8-node CM for 3 times steps is 3*8 = 24


computation time-steps of which only 7 are being used

 On the other hand, if the application requires all partial sums, i.e., the sum of the
first two, three, four, etc. numbers, then the full power of the parallel machine is
used

66
Data Flow Modeling and
Implementation
Introduction
• We will learn how to create data flow models, and how to implement those models in
hardware and software.
• Unlike C programs, data flow models are concurrent: they can express activities that
happen simultaneously.
• This property makes data flow well suited for a parallel hardware implementation as well
as a sequential software implementation.
• By nature, hardware is parallel and software is sequential.
• As a result, software models (C programs) are not very well suited to capture hardware
implementations, and vice versa, hardware models (RTL programs) are not a good
abstraction to describe software.
• However, designers frequently encounter situations in which they cannot predict if the
best solution for a design problem is a hardware implementation or a software
implementation.
• Trying to do both is not on option; it requires the designers to work twice as hard.
Introduction
• Signal processing domain experts are used to describe complex
systems, such as digital radios and radar processing units, using block
diagrams.
• A block diagram is a high‐level representation of the target system as
a collection of smaller functions.
• A block diagram does not specify if a component should be hardware
or software;
• We are specifically interested in digital signal processing systems.
• Such systems represent signals as streams of discrete samples rather
than continuous signal shapes.
Introduction
• Figure shows the block diagram for a simple digital signal processing system.
• It’s a pulse‐amplitude modulation (PAM) system, and it is used to transmit digital
information over bandwidth‐limited channels.
• A PAM signal is created from binary data in two steps.
• First, each word in the file needs to be mapped to PAM symbols, which are just pulses of
different heights.
Introduction
• An entire file of words will thus be converted to a stream of symbols or pulses.
• Next, the stream of pulses needs to be converted to a smooth shape using pulse‐shaping.
• Pulse‐shaping ensures that the bandwidth of the resulting PAM signal bandwidth does not exceed
the PAM symbol rate.
• The output of the pulse‐shaping unit produces many samples for each input symbol pulse, but it
is still a stream of discrete samples.
• A convolution‐based function is used to ensure the curve goes through the symbols while
providing a memory effect, which allows symbols to influence other symbols
• The final module in the block diagram is the digital‐to‐analog module, which will convert the
stream of discrete samples into a continuous signal.
Introduction
Here’s a high-level simulation model for PAM-4
Introduction
• Although this is a good model for simulation, it is not for an implementation C implicitly assumes
sequential execution
• If we observe block diagram carefully, we can see that the block diagram does not require a
sequential execution of the symbol mapping function and the pulse shaping function.
• The block diagram only specifies the flow of data in the system but not the execution order of the
functions.

• The lines between blocks represent data dependencies, and therefore, they force an ordering to
the sequence of operations
• However, unlike C, each block can execute simultaneously with other blocks.
• Another example of the difference between C and block diagrams is shown by the ‘fanout’ in the
following:
• Here, Block2 and Block3 are clearly parallel but C would execute them sequentially
Data Flow Models
• Note that, in general, it is easier to create a sequential implementation from a parallel
model than it is to create a parallel implementation from a sequential model.
• This argues in favor of Data Flow
• The following is a Data Flow model of PAM‐4:diagrams for modelling

• The bubbles, called actors, represent the functions in the block diagram
• Actors are linked together using directional lines, called queues
• The numbers on the lines represent the relative rates of communications between
modules, e.g., Map converts a 32‐bit word into 16 2‐bit symbols.
• Note that each actor works independently, i.e., it checks its input queue for the proper
number of elements and executes immediately when satisfied
Data Flow Models vs. C Programs
• Data Flow is a concurrent model (this is a major driver for their popularity), which means they can
easily be mapped to hardware or software implementations.
• Data Flow models are distributed, i.e., there is no centralized controller, i.e., each actor operates
autonomously.
• Data Flow models are modular, allowing libraries of components to be constructed and utilized in
a plug‐and‐play fashion
• Data Flow models can be analyzed,
• e.g., for deadlock conditions that can result in system lock‐up
• Deterministic, mathematical methods can be used to analyze Data Flow models, which is
generally not possible using C
• Data Flow models have been around since the early 1960s
• The 70’s and 80’s were active periods of research and development of Data Flow programming
languages and even Data Flow architectures
• NI’s Labview is a classic example of a Data Flow programming language
Tokens, Actors and Queues
• Here, we define the elements that make up a Data Flow model, and discuss a special class of Data
Flow models called Synchronous Data Flow (SDF) Graphs SDFs allow for the application of formal
analysis techniques
• A simple example:

A Data Flow model is made up of three elements:


• Actors: Contain the actual operations
• Actors have a precise beginning and end, i.e., they have bounded behavior, and they iterate that
behavior continuously
• Each iteration is called a firing, e.g., an addition is performed on each firing
Tokens, Actors and Queues
• Tokens: Carry information from one actor to another
• A token has a value, such ’1’ and ’4’ as shown below
• Queues: Unidirectional communication links that transport tokens between actors
• We assume Data Flow queues have an infinite amount of storage
• Data Flow queues are first‐in, first‐out (FIFO)
• In above example, token ’1’ is entered after token ’4’ so token ’4’ is processed first
• When a Data Flow model executes, actors read tokens from their input queues, apply an
operation and then write values to the output queue
• The execution of a Data Flow model is expressed as a sequence of concurrent actor firings
Tokens, Actors and Queues
• Data Flow models are untimed
• The firing of an actor happens instantaneously and therefore time is irrelevant
• Firings actually take non‐zero time in an actual implementation
• The execution of Data Flow models is guided only by the presence of data, i.e., an actor
can not fire until data becomes available on its inputs
• A Data Flow graph with tokens distributed across its queues is called a marking of a Data
Flow model
• A Data Flow graph goes through a series of marking when it is executed Each marking
corresponds to a different state of the system
• The distribution of tokens in the queues (marking) are the ONLY observable state in the
system (no state is maintained inside the actors)
Firing Rates, Firing Rules and Schedules

• A firing rule defines the conditions that enable an actor to fire


• In the above example, the firing rule checks that the actor’s input queues contain
at least one token
• Therefore, actors are able to check the number of tokens in each of its queues
Firing Rates, Firing Rules and Schedules
• The required number of tokens consumed and produced can be annotated on the actors inputs
and outputs, respectively

• Therefore, this information combined with a marking makes is easy to decide whether an actor
can fire.

• Data Flow actors can also consume more than one token per firing This is referred to as a multi‐
rate Data Flow graph
Synchronous Data Flow Graphs
• Synchronous Data Flow (SDF) graphs refer to systems where the number of tokens consumed and
produced per actor firing is fixed and constant
• The term synchronous refers to the fixed consumption and production rate of tokens.
• Note that SDF will not be able to handle control‐flow constructs, such as if‐then‐else statements
in C without adding special operators .
• Despite this significant limitation, SDFs are very powerful (and popular), and more importantly,
mathematical techniques can be used to verify certain properties
• The first of these properties is determinism
• The entire SDF is deterministic under the condition that all of its actors implement a deterministic
function
• Determinism guarantees that the same results will always be produced independent of the firing
order
Synchronous Data Flow Graphs
• Illustration of determinism:
Synchronous Data Flow Graphs
• As we start firing actors, tokens are transported through the graph.
• After the first firing, an interesting situation occurs: both the add actor as well as the
plus1 actor can fire.
• Going down at the left side, we assume that the plus1 actor fires first.
• Going down at the right side, we assume that the add actor fires first.
• However, regardless of this choice, the graph eventually converges to the situation
shown at the bottom.
• Why is this property so important?
• Assume for a moment that the add actor and the plus1 actor execute on two different
processors, a slow one and a fast one.
• This is the power of determinate property of SDF, it doesn’t matter which processor runs
on what actor: the results will be always the same.
• In other words, no matter what technology we are using to implement actors, the system
will work as specified as long as we implement the firing rules correctly.
Analyzing Synchronous Data Flow Graphs
• The second important property of SDF relates to an admissible schedule
• An admissible SDF is one that can run forever without deadlock (unbounded execution)
or without overflowing any of the communication queues (bounded buffer)
• Deadlock occurs when an SDF graph progresses to marking that prevents firings
• Overflow occurs when tokens are produced faster than they are consumed
• There is also a systematic method to determine whether a SDF graph is
admissible
• The method provides a closed form solution, i.e., no simulation is required
Analyzing Synchronous Data Flow Graphs
Lee proposed a method called Periodic Admissible Schedules (PASS), defined as:
• A schedule is the order in which the actors must fire
• An admissible schedule is a firing order that is deadlock‐free with bounded
buffers
• A periodic admissible schedule is a schedule that supports unbounded execution,
i.e., is periodic in the sense that the same markings will recur
• We also consider a special case called Periodic Admissible Sequential Schedules
(PASSs) that supports a microprocessor implementation with one actor firing at a
time
• There are four steps to creating a PASS for an SDF graph:
I. Create the topology matrix G of the SDF graph
II. Verify the rank of the matrix to be one less than the number of nodes in the graph
III. Determine a firing vector
IV. Try firing each actor in a round robin fashion, until the firing count given by the firing
vector is reached
Analyzing Synchronous Data Flow Graphs
Consider the following example:
• Step 1: Create a topology matrix for this graph:
• The topology matrix has as many rows as there are edges (FIFO queues) and as many columns
as there are nodes
• The entry (i, j) will be positive if the node j produces tokens onto the edge i and negative if it
consumes tokens
Analyzing Synchronous Data Flow Graphs
Step 2: The condition for a PASS to exist is that the rank of G has to be one less than the
number of nodes in the graph
• The rank of the matrix is the number of independent equations in G
• For our graph, the rank is 2 ‐‐ verify by multiplying the first column by ‐2 and the
second column by ‐1, and adding them to produce the third column

• Given that there are three nodes in the graph and the rank of the matrix is 2, a PASS is
possible
• This step effectively verifies that tokens can NOT accumulate on any edge of the graph
• The actual number of tokens can be determined by choosing a firing vector and
carrying out a matrix multiplication
Analyzing Synchronous Data Flow Graphs
• For example, the tokens produced/consumed by firing A twice and B and C zero times
is given by:

• This vector produces 4 tokens on edge(A,B) and 2 tokens on edge(A,C)


• Step 3: Determine a periodic firing vector
• The firing vector given above is not a good choice to obtain a PASS because it leaves
tokens in the system
• We are instead interested in a firing vector that leaves no tokens:

• Note that since the rank is less than the number of nodes, there are an infinite number
of solutions to the matrix equation
Analyzing Synchronous Data Flow Graphs
• Step 3: Determine a periodic firing vector (cont.)
• This is true b/c, intuitively, if firing vector (a, b, c) is a PASS, then so
should be firing vectors (2a, 2b, 2c), (3a, 3b, 3c), etc.
• Our task is to find the simplest one ‐‐ for this example, it is:

• Note that the existence of a PASS firing vector does not guarantee
that a PASS will also exist

• Here, we reversed the (A,C) edge We would find the same qPASS but
the resulting graph is deadlocked ‐‐ all nodes are waiting for each
other
Analyzing Synchronous Data Flow Graphs
Step 4: Construct a valid PASS.
• Here, we fire each node up to the number of times specified in qPASS
• Each node that is able to fire, i.e., has an adequate number of tokens, will fire
• If we find that we can fire NO more nodes, and the firing count is less than the number in
qPASS, the resulting graph is deadlocked
• Trying this out on our graph, we fire A once, and then B and C
Analyzing Synchronous Data Flow Graphs
Step 4: Construct a valid PASS.

• Try this out on the deadlocked graph ‐‐ it aborts immediately on the first iteration
because no node is able to fire successfully
• Note that the determinate property allows any ordering to be tried freely, e.g., B, C
and then A
• In some graphs (not ours), this may lead to additional PASS solutions
SDF Graphs: PAM‐4 Example
• Consider the digital pulse‐amplitude modulation system (PAM‐4) discussed earlier
• The SDF for this system consists of 4 actors, and is a multi‐rate Data Flow system:

• The first step is to construct the topology matrix G


• The queues correspond to the 3 rows and actors to the 4 columns
• The second step is to verify the rank is the number of actors minus 1
• It is easy to show that the 3 rows are independent, i.e., are not linear combinations of any other
rows
• This confirms that a PASS is possible
SDF Graphs: PAM‐4 Example
• The third step is to derive a feasible firing for the system
• The firing vector, qPASS, must yield a zero‐vector when multiplied by the topology matrix

• The fourth step is to derive a schedule ‐‐ there are two possibilities


• The first one is trivial, fire each actor in succession, from left to right Note that the queue (FIFO)
sizes are 16 and 2048
• Alternatively, we can fire FileSource and Map once and then repeat the following
• sequence: Fire PulseShape once and then fire DA 128 times
• The benefit here is the reduced queue sizes, i.e., the PulseShape input queue reduces
from 16 to 1 while the DA input queue reduces from 2048 to 128

• In general, deriving the optimal schedule is a difficult problem for complex systems
Euclid’s Algorithm as an SDF Graph
• The graph evaluates the greatest common divisor of
two numbers a and b.
• The sort actor reads two numbers, sorts them, and
copies them to the output.
• The diff actor subtracts the smallest number from the
largest one, as long as they are
• For example, assume (a0, b0)=(16,12)
• then we see the following sequence of token values.
(a1, b1)=(4,12), (a2, b2)=(8,4), (a3, b3)=(4,4),

𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Euclid’s Algorithm as an SDF Graph
Limitations of Data Flow Models
• SDF systems are distributed, data‐driven systems.
• They execute whenever there is data to process, and remain idle when there is nothing to do.
• However, SDF seems to have trouble to model control‐related aspects.
• Control appears in many different forms in system design, for example:
Stopping and Restarting. As we saw in the Euclid example an SDF model never terminates; it just
keeps running.
• Stopping and restarting is a control‐flow property that cannot be addressed well with SDF graphs.
Mode‐Switching. When a cell‐phone switches from one standard to the other, the processing
(which may be modeled as an SDF graph) needs to be reconfigured.
• However, the topology of an SDF graph is fixed and cannot be modified at runtime.
Exceptions. When catastrophic events happen, processing may suddenly need to be altered.
• SDF cannot model exceptions that affect the entire graph topology.
• For example, once a token enters a queue, the only way of removing it is to read the token out of
the queue.
• It is not possible to suddenly ‘empty’ the queue on a global, exceptional condition.
Limitations of Data Flow Models
• Run‐Time Conditions. A simple if‐then‐else statement (choice between two activities
depending on an external condition) is troublesome for SDF.
• An SDF node cannot simply ‘disappear’ or become inactive – it is always there.
• Moreover, we cannot generate conditional tokens, as this would violate SDF rules which
require fixed production/consumption rates.
• Thus, SDF cannot model conditional execution such as required for if‐then‐else
statements.
• There are two solutions to the problem of control flow modeling in SDF.
• The first one is to try using SDF anyhow,
• Next is emulate control flow at the cost of some modelling overhead.
Limitations of Data Flow Models
• There are two solutions to the problem of control flow modeling in SDFs
• Solution 1:
• emulate control flow on top of the SDF semantics Consider the
• stmt : if (c) then A else B

• The selector‐actor on the right routes either A or B to the output


• Note that this is not an exact match to the if‐then‐else in C because BOTH the if branch (A) and
the else (B) must execute and produce tokens
• However, it is a good match to hardware, which uses a multiplexer to select among one of several
input results
Limitations of Data Flow Models
• Solution 2: extend the SDF semantics using Boolean
Data Flow (BDF)
• BDFs ‘tune’ the production and consumption rate of a
actor according to the value of an external control token
• The condition token is distributed to two BDF
conditional fork and merge nodes, Fc and Sc
• Here, the conditional fork will fire when there is an input
token AND a condition token
• A token is produced on EITHER the upper or lower edge,
dependent on the condition token
• This is indicated by a dynamic variable p, which signifies
a conditional production rate
• The conditional merge works similarly, i.e., it fires when
there is a condition token and will consume a token on
EITHER the upper or lower edge.
DFG Performance Modelling and Transformations
• We indicated earlier that Data Flow graphs (DFGs) are untimed, i.e., our analysis did not model the
amount of time needed to complete a computation
• In this section, we describe how to use DFGs for performance analysis
• Performance estimation will be accomplished by modeling only two components: actors and queues
• Once our new modeling constructs are introduced, we then turn our attention to transformations
designed to enhance performance

• Input sample rate is the time interval between two adjacent input samples from a data stream
• For example, a digital sound system generates 44,100 samples per second
• Input sample rate defines a design constraint for the real‐time performance of the
• Data Flow system Similar constraints usually exists for output sample rate
Definitions
• We use two common metrics as measures of performance:
• Throughput: the number of samples processed per second (Note that input and output throughput may be
different)
• Latency: The time required to process a single token from input to output
DFG Performance Modelling and Transformations
The Data Flow Resource Model:

• We used the symbols on the left earlier to model DFGs


• For performance modelling, we
• Include a number within the actor symbol to model execution latency
• Replace FIFO queues with a communication channel, which includes delays
• Note that the number included in an actor represents the amount of time it takes (in clock cycles,
nanoseconds, etc) after it fires Time spent while waiting for input data is not counted
• Also note that the delay element (which replaces FIFO queues) can hold exactly one token
DFG Performance Modelling and Transformations

• Think of delay elements as buffers with 1 unit of delay


• We can use a performance annotated DFG to evaluate its execution time
• In (a), (b) and (c) above, actor A introduces 5 units of latency while B introduces 3
units
DFG Performance Modelling and Transformations

• The time stamp sequences on the left and right indicate when input
samples are read and when output samples are produced
• The time stamps for DFG (a) and (b) are different because of the position of
the delay element in the loop
• (a) requires the sum of execution times of A and B before producing a result
• (b) can produce a result at time stamp 3 because the delay element allows it to
execute immediately at system start time (we refer to this as transient behavior)
• In this case, the delay elements affect only the latency of the first sample
DFG Performance Modelling and Transformations

• In contrast, (c) shows that delay elements can be positioned to enable parallelism, and
affect both latency and throughput
• Both actors can execute in parallel in (c), resulting in better performance than (a) and (b)
• The throughput of (a) and (b) is 1 sample per 8 time units, while (c) is 1 sample per 5
time units
• Similar to a pipelined system, the throughput in (c) is ultimately limited to the speed of
the slowest actor (A in this case ‐‐ B is forced to wait)
DFG Performance Modelling and Transformations
Limits on Throughput
• As indicated, the distribution of the delay elements in the loops impacts performance
• As an aid in analyzing performance, let’s define
• Loop bound as the round‐trip delay of a loop, divided by the number of delays in the loop
• Iteration bound as the largest loop bound in any loop of a DFG
• Iteration bound defines an upper limit on the best throughput of a DFG

• The loop bounds in this example are given as LBBC = 7 and LBABC = 4
• The iteration bound is 7 ‐‐ therefore, we need at least 7 time units per iteration
DFG Performance Modelling and Transformations
Limits on Throughput

• From the graph, it is clear that loop BC is the bottleneck


• Note that actors A and C have delay elements on their inputs so they can operate
in parallel
• On the other hand, actor B needs to wait for the result from C before it can fire
• The missing delay element forces actors B and C to run sequentially
• Note that linear graphs have implicit feedback loops that must be considered
DFG Performance Modelling and Transformations
Limits on Throughput
• Also note that the iteration bound is an upper limit on throughput, and in
reality, the DFG may not be able to achieve this throughput

• The DFG above (from an earlier slide) has an iteration bound (5 + 3)/2 = 4
time units, but the throughput is limited to the slowest actor at 1 sample
per 5 time units
• A nice way to think about actors and delays is to consider an actor as a
combinational circuit and a delay as a buffer or pipeline stage.
Transformations
Performance‐Enhancing Transformations
• Based on previous discussions, intuitively, it should be possible to ‘tune’ the DFG to
enhance performance, while maintaining the same functionality
• Enhancing performance either reduces latency or increases throughput or both The
following transformations will be considered:
• Multi‐rate Expansion: A transformation which converts a multi‐rate synchronous DFG to
a single‐rate synchronous DFG.
• Retiming: A transformation that redistributes the delay elements in the DFG Retiming
changes the throughput but does not change the latency or the transient behavior of the
DFG
• Pipelining: A transformation that introduces new delay elements in the DFG Pipelining
changes both the throughput and transient behavior of the DFG
• Unfolding: A transformation designed to increase parallelism by duplicating actors
Unfolding changes the throughput but not the transient behavior of the DFG
Transformations
Multi‐rate Transformation
• The following DFG shows actor A produces three tokens per firing, and actor B consumes
two tokens per firing

• After completing the steps above, we obtain the following DFG

• Here, the actors are duplicated according to their firing rates, and all multi‐rate I/O are
converted to single‐rate I/O
Transformations
Retiming Transformation
• Retiming redistributes delay elements in the DFG as a mechanism to increase
throughput
• Retiming does not introduce new delay elements
• Evaluation involves inspecting successive markings of the DFG and then selecting the one
with the best performance

• (a) has an iteration bound of 8


but produces data on intervals
of 16 because of the sequential
execution of actors A, B and C
Transformations
Retiming Transformation
• The next marking (b) is obtained by firing actor A, which consumes the delay elements
on its inputs, and produces a delay element at its output.
• This functionally equivalent configuration improves throughput to 1 sample every 11
time units by allowing actor A to run in parallel with B and C
• Firing B produces the next marking
in (c), which achieves an iteration
bound of 8 and represents the best
that can be obtained The last marking
which fires C creates a configuration
nearly equivalent to (a).
Transformations
Pipelining Transformation
• Pipelining increases the throughput at the cost of increased latency
• Pipelining augments retiming with adding delay elements

• (a) is extended with two pipeline delays in (b)


• Adding delay elements at the input increases the latency of (a) from 20 to 60
• Throughput is 20, i.e., 1 sample every 20 time units
Transformations
Pipelining Transformation

• Retiming of the pipelined graph yields (c) after firing A twice and B once,
which improves both throughput to 10 and latency to 20
• Again we see the slowest pipeline stage determines the best achievable
throughput.
Pipelining Transformation out[n]=c2×x[n−2]+c1×x[n−1]+c0×x[n]
Transformations
Unfolding Transformation
• Unfolding is very similar to the transformation carried out for multi‐rate expansion
• Here, actor A in the original DFG is replicated as needed, and interconnections and delay elements are
redistributed
• Note the original graph is single‐rate and goal is to increase sample consumption rate
• The text describes the sequence of steps that need to be applied to carry out unfolding
• (a) is unfolded two times in (b),
showing that the number of inputs
and outputs are doubled, allowing
twice as much data to be processed
per iteration
• Unfolding appears to slow it down, increasing
the size of the loop to include A0, B0,
• A1 and B1 while including only the single delay
Element
Hence, the iteration bound of a v‐unfolded graph
increases v times
Data Flow Implementation in
Software and Hardware
Software Implementation of Data Flow
Converting Queues and Actors into Software
There are a wide variety of approaches of mapping DFGs to software

• Sequential implementations can make use of static or dynamic schedules


• Parallel, multi‐processor mappings require more effort due to:
• Load balancing: Mapping actors such that the activity on each processor is about the same
• Minimizing inter‐processor communication: Mapping actors such that communication overhead is minimized
Software Implementation of Data Flow
Mapping DFGs to Software
• We will focus first on single‐processor systems, and in particular, on finding efficient versions of
sequential schedules
• As noted on the previous slide, there are two options for implementing the schedule:
Dynamic schedule
• Here, software decides the order in which actors execute at runtime by testing firing
rules to determine which actor can run
• Dynamic scheduling can be done in a single‐threaded or multi‐threaded execution
environment
Static schedule
• In this case, the firing order is determined at design time and fixed in the
implementation
• The fixed order allows for a design time optimization in which the firing of multiple
actors can be treated as a single firing
• This in turn allows for ‘inlined’ implementations
Software Implementation of Data Flow
DFG Elements

• Before discussing these, let’s first look at C implementations of actors and queues
FIFO Queues:
• Although DFGs theoretically have infinite length queues, in practice, queues are limited in size
• We discussed earlier that constructing a PASS allows the maximum queue size to be determined by
analyzing actor firing sequences

• A typical software interface of a FIFO queue has two parameters and three methods
• The # of elements N that can be stored in the queue and the data type of the elements.
• Methods that put elements into the queue, get elements from the queue, and query the current size of
the queue.
Software Implementation of Data Flow
DFG Elements
• Queues are well defined (standardized) data structures
• A circular queue consists of an array, a write‐pointer and a read‐pointer
• They use modulo addressing, e.g., the Ith element is at position (Rptr + I) mod Q.getsize()
Note that the queue size is fixed here at
compile time.

Alternatively, queue size can be changed


dynamically at runtime using malloc()
Software Implementation of Data Flow

Software implementation of the dataflow actor


Software Implementation of Data Flow
ACTOR Implementation
• A data flow actor can be captured as a function, with some additional support to interface
with the FIFO queues.
• Designers will often differentiate between the internal activities of an actor and the
input–output behavior.
• The behavior corresponding to actor firing can be implemented as a simple C function.
• The actor function incorporates a finite state machine (FSM), which checks the firing rules
to determine whether to execute the actor code.
• The local controller (FSM) of an actor has two states.
• wait state: start state which checks the firing rules immediately after being invoked by a
scheduler.
• work state: wait transitions to work when firing rules are satisfied, the actor then reads
tokens, performs calculation and writes output tokens.
Software Implementation of Data Flow
ACTOR Implementation

Finally, the actor_io and queue objects can be instantiated in the main program, and the actor
functions can be called using a system scheduler.
Sequential Targets with Dynamic Schedule
• A software implementation of SDF is obtained by combining several different actor
descriptions, by interconnecting those actors using FIFO queues, and by executing the
actors through a system schedule.
• In a dynamic system schedule, the firing rules of the actors will be tested at runtime; the
system scheduling code consists of the firing rules, as well as the order in which the firing
rules are tested.
Mapping DFGs to Single Processors: Dynamic Schedule
Single‐Thread Dynamic Schedules
• Implementation of a system schedule as a function that instantiates all actors and
queues, and next calls the actors in a round‐robing fashion.
• First, note that it is impossible to call the actors in the ‘wrong’ order, because each of
them still has a firing rule that protects them from running when there is no data
available.

• The schedule on the right shows that snk in (a) is called as often as src, However, snk will
only fire on even numbered invocations.
• Even though snk will be often called as often as src, the firing rule of snk will only allow
that actor to run when there is sufficient data available.
• (b) shows a problem that is not handled by static schedulers Round‐robin scheduling in
this case will eventually lead to queue overflow
Mapping DFGs to Single Processors: Dynamic Schedule
• The underlying problem with (b) is that the implemented firing rate differs from the
firing rate for a PASS, which is given as (src, snk, snk)
• There are two solutions to this issue:
• Solution 1: We could adjust the system schedule to reflect the firing rate predicted by
the PASS.

• This solution is not very elegant, because it destroys the idea of having a dynamic
scheduler.
• If we have to obtain the PASS firing rate first, we may as well forget about using a
dynamic schedule.
Mapping DFGs to Single Processors: Dynamic Schedule
• Solution 2: We could adjust the code for the snk actor to continue execution as long as
there are tokens present. Thus, the code for the snk actor becomes:

• This is a better solution as the previous one, because it keeps the advantages of a
dynamic system schedule.
Mapping DFGs to Single Processors: Example Dynamic Schedule

• Let’s implement the 4‐point Fast Fourier Transform (FFT) shown above using a dynamic
schedule
• The array t stores 4 (time domain) samples
• The array f will be used to store the frequency domain representation of t
• The FFT utilizes butterfly operations to implement the FFT, as defined on the right side in
the figure
• The twiddle factor W(k, N) is a complex number defined as e‐j2pk/N, with W(0, 4) = 1 and
W(1, 4) = ‐j
Mapping DFGs to Single Processors: Example Dynamic Schedule

• reorder: Reads 4 tokens and shuffles them to match the flow diagram
• The t[0] and t[2] are processed by the top butterfly and t[1] and t[3] are processed by
the bottom butterfly
• fft2: Calculates the butterflies for the left half of the flow diagram
• fft4mag calculates the butterflies for the right half and produces the magnitude
component of the frequency domain representation
Mapping DFGs to Single Processors: Example Dynamic Schedule
The implementation first requires a valid schedule to be computed
• The firing rate is easily determined to be [qreorder, qfft2, qfft4mag] = [1, 2, 1]
void reorder(actorio_t *g)
{
int v0, v1, v2, v3;
while ( fifo_size(g‐>in[0]) >= 4 )
{
v0 = get_fifo(g‐>in[0]);
v1 = get_fifo(g‐>in[0]);
v2 = get_fifo(g‐>in[0]);
v3 = get_fifo(g‐>in[0]);
put_fifo(g‐>out[0], v0);
put_fifo(g‐>out[0], v2);
put_fifo(g‐>out[0], v1);
put_fifo(g‐>out[0], v3);
}
}
Mapping DFGs to Single Processors: Example Dynamic Schedule

The deterministic property of SDFs and the while loops inside the actors allow the
call order shown above to be re-arranged while preserving the functional behavior
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• In multi‐threaded programming, each actor (implemented as a function) lives in a
separate thread
• The threads are time‐interleaved by a scheduler in single processor environments
• Systems in which threads voluntarily relinquish control back to the scheduler is referred
to as cooperative multithreading

Such a system can be implemented using two functions create() and yield() as shown in above figure
The scheduler can apply different strategies to schedule threads, with the simplest one shown above as a round-
robin schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Quickthreads is a cooperative multithreading library
• The quickthreads (qt) API (Application Programmers Interface) consists of 4 functions
• spt_init(): initializes the threading system
• spt_create(stp_userf_t *F, void *G) creates a thread that will start execution with user function F, and
will be passed a single argument G
• stp_yield() releases control over the thread to the scheduler
• stp_abort() terminates a thread (prevents it from being scheduled)
• Here’s an example
#include "../qt/stp.h"
#include <stdio.h>
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Here is the example of a sort actor, implemented using the cooperative threading model.
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• The system scheduler now will call threads rather than actors:
Mapping DFGs to Single Processors: Static Schedule
• From the PASS analysis of an SDF graph, we know at least one
solution for a feasible sequential schedule
• This solution can be used to optimize the implementation in several
ways
• We can remove the firing rules since we know the exact sequential schedule
• This yields only a small performance benefit
• We can also determine an optimal interleaving of the actors to minimize the
storage requirements for the queues
• Finally, we can create a fully inlined version of the SDF graph which eliminates
the queues altogether
Mapping DFGs to Single Processors: Static Schedule

• From the above SDF topology, we know that the relative firing rates of A, B, and C must be 4, 2, and 1 to
yield a PASS.
• The right side of the code shows an example implementation of this PASS.
• The A, B, C actors are called in accordance with their PASS rate.
• Due to this particular interleaving, it is easy to see that in a steady state condition, the queue AB will carry
four tokens maximum, while the queue BC will contain two tokens maximum.
• This is not the most optimal interleaving.
• By calling the actors in the sequence (A,A,B,A,A,B,C), the maximum amount of tokens on any queue is
reduced to two.
Mapping DFGs to Single Processors: Static Schedule
• Implementing a truly static schedulemeans that we will no longer
test firing rules when calling actors.
• In fact, when we call an actor, we will have to guarantee that the
required input tokens are available.
• In a system with a static schedule, all SDFrelated operations get a
fixed execution order: the actor firings and the sequences of put and
get operations on the FIFO queues.
• This provides the opportunity to optimize the resulting SDF system.
• We will discuss optimization of single‐thread SDF systems with a
static schedule using an example we discussed before – the GCD.
• From our earlier analysis, we know that a valid PASS fires each node
a single time. Listing 2.4 is a description for each of the actors (sort,
diff).
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule

𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule

// initial tokens
put_fifo(&F1, 16);
put_fifo(&F2, 12);
Hardware Implementation of Data Flow
• SDF graphs are particularly well‐suited for mapping directly into hardware
• For SDFs with single‐rate schedules, all actors can execute in parallel using
the same clock
• The rules for mapping single‐rate SDFs to hardware are straightforward
• All actors map to a single combinational hardware circuit
• All FIFO queues map to wires (without storage)
• Each initial token in a queue is replaced with a register
• This means that two actors with no token on the queue between them will
be effectively represented as two back‐to‐back combinational circuits
• Timing requirements regarding the delay of the combinational logic must
be met,
• i.e., all computation within the combinational logic must complete in 1 clock cycle
• Therefore, in practice, the length of the actor sequence will be limited
Hardware Implementation of Data Flow
Hardware implementation of Euclid’s algorithm

𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;

 The sort and diff actors are implemented using a comparator, a subtractor plus a few multiplexers
 Note that the delay through the circuits of both actors define the maximum achievable clock frequency
 If the critical path delays are 5 and 15 ns, then max. operating freq. is 50 MHz
Hardware Implementation of Data Flow
• Does this circuit work?
• Yes, it does. The circuit evaluates a single iteration through a PASS per clock cycle.
• Is this translation procedure general so that it would work for any SDF graph?
• No, it is not. The translation procedure is restricted to the following SDF graphs.
• We need a single‐rate SDF graph, which has a PASS firing vector with all ones in
it.
• All actors need to be implemented using combinational logic.
• In addition, the above method may result in circuits with a very long
combinational path.
• In the circuit above for example, the maximal operating clock frequency is
determined by the combined delay of the sort circuit and the diff circuit.
Hardware Implementation of Data Flow
Pipelining
• Pipelining of SDF graphs helps to break long combinational paths that may exist in circuits.
• Consider the example shown below.
• This is a data flow specification of a digital filter. It evaluates a weighted sum of samples of an
input stream, with the sum defined as

The critical path of this graph is associated with the one of the multiplier actors, c0 or c1, in series with
two addition actors
Hardware Implementation of Data Flow

• Pipeling allows additional tokens to be added (at the expense of increase latency) Here, one is
introduced by the in actor
• And then retiming is used to redistribute the tokens, as shown by the above marking, which is
produced after firing c0, c1 and c2
• Retiming creates additional registers and an additional pipeline stage
• The critical path is now reduced to two back‐to‐back adder actors
Hardware Implementation of Data Flow
• Note that it is not possible to introduce arbitrary initial tokens in a graph without
following the actor’s firing rules
• Doing so would likely change the behavior of the system
Analysis of Control Flow and Data Flow
Data and Control Edges of a C Program
• Fundamental to DFG model is the decomposition of a system into individual nodes
(actors), which communicates through unidirectional, point‐to‐point channels (queues).
• The resulting system model is represented as a graph.
• Such a data flow system is able to express concurrent computations that map easily into
hardware as well as into software.
• So, the data flow model of computation illustrates how we can build system models that
are equally well suited for hardware implementation as well as for software
implementation.
• Our objective in this chapter is to think of a C program in a similar target independent
fashion.
• For a software designer, a C program is always software.
• For a hardware–software codesigner however, a C program may be hardware or
software, depending on the requirements and needs of the application.
Data and Control Edges of a C Program
• Obviously, one cannot make a direct conversion of C into hardware – a major roadblock
is that hardware is parallel by nature, while C is sequential.

• But we can look at a C program as a high‐level description of the behavior of an


implementation, without deciding on the exact nature of the implementation.

• At that point, we become interested in analyzing the C program in terms of its


fundamental building blocks.

• These building blocks are the operations of the C program and the relations between
them.

• We define two types of relationships between the operations of a C program:


data edges and control edges.
Data and Control Edges of a C Program
• Data edge: is a relationship between operations where data produced by one operation is
consumed by another
• Control edge: is a relationship between operations that relates to the order in which the
operations are performed
• Consider for example the following C function, which finds the maximum of two variables.
• This function contains two assignment statements and an if‐then‐else branch.
• For the purpose of this analysis, we will equate statements in C with ‘operations’.
• In addition, we define the entry and exit points of the function as two additional operations.
• Therefore, the max function contains five operations:
Data and Control Edges of a C Program
• To find the control edges in this function, we need to find what
chains of operations can execute, based on the usual C semantics.
• For example, the operation 2 will always execute after operation
1. Therefore, there is a control edge from operation 1 to operation
2.
• If a > b is true, then operation 3 will follow operation 2, otherwise
operation 4 will follow operation 2.
• Therefore, there is a control edge from operation 2 to each of
operations 3 and 4.
• Finally, operation 5 will follow either execution of operation 3 or
4.
• There is a control edge from each of operations 3 and 4 to
operation 5.
Data and Control Edges of a C Program
• To find the data edges in this function, we examine the data production/ consumption patterns of
each operation.
• The data edges are defined between operations of corresponding production/consumption.
• For example, operation 1 defines the value of a and b.
• Several operations will make use of those values.
• The value of a is used by operations 2 and 3.
• Therefore, there is a data edge from operation 1 to operation 2,
as well as a data edge from operation 1 to operation 3.
Data and Control Edges of a C Program
Implementation Issues
• CFGs and DFGs capture the behavior of the C program graphically
• This leads naturally to the following question:
• What are the important parts of a C program that MUST be preserved in any
implementation of that program?
• Data edges reflect requirements on the flow of information
• Important note: If you change the flow of data, you change the meaning of the algorithm
• Control edges, on the other hand, provide a nice mechanism to break down the algorithm
into a sequence of operations (a recipe)
• They are not fundamental to preserving correct functional behavior in an implementation
• It follows then that data edges MUST be preserved while control edges can be removed
and/or manipulated
Data and Control Edges of a C Program
Implementation Issues
• Parallelism in the underlying architecture can be leveraged to remove control edges, e.g., superscalar
processors can execute instructions out‐of‐order
• On the other hand, parallel architectures MUST always preserve data dependencies otherwise, the
results will be erroneous

 A fully parallel hardware implementation of this program can


in fact carry out both additions in a single clock cycle
 The sequential order specified by the CFG is eliminated in
the hardware implementation
Data and Control Edges of a C Program
Construction of the Control Flow Graph
• Let’s define a systematic method to convert a C program to a CFG assuming:
• Each node in the graph represents a single operation (or C statement)
• Each edge of the graph represents an execution order for the two operations connected by that edge
• Since C executes sequentially, this conversion is straightforward in most cases
• The only exception occurs when multiple control edges originate from a single operation

This statement includes four distinct parts:


• loop initialization
• loop condition
• loop-counter increment operation
• body of the loop
Data and Control Edges of a C Program
Construction of the Control Flow Graph
The for loop introduces three nodes to the CFG

Dashed components, entry, exit and body, are other CFGs of the C
program which have single-entry and single-exit points
The do‐while loop and the while‐do loop are similar iterative structures
Data and Control Edges of a C Program
Construction of the Control Flow Graph
Consider the CFG for the GCD algorithm

• A control path is defined as a sequence of control edges that traverse the CFG

• For example, each non-terminating iteration of the while loop will follow the path 2->3->4->2 or else
2->3->5->2

• Control paths are useful in constructing the DFG


Data and Control Edges of a C Program
Construction of the Data Flow Graph
• Let’s also define a systematic method to convert a C program to a DFG assuming
• Each node in the graph represents a single operation (or C statement)
• Each edge of the graph represents a data dependency

• Note that the CFG and the DFG will contain the same set of nodes ‐‐ only the edges will
be different
• While it is possible to derive the DFG directly from a C program, it is easier to create the
CFG first and use it to derive the DFG
• The method involves tracing control paths in the CFG while simultaneously identifying
corresponding read and write operations of the variables
• Our analysis focuses on C programs that do NOT have arrays or pointers
Data and Control Edges of a C Program
Construction of the Data Flow Graph
• The procedure to recover the data edges related to assignment statements is as follows:
1. In the CFG, select a node where a variable is used as an operand in an expression.
• Mark that node as a read‐node.
2. Find the CFG nodes that assign that variable.
• Mark those nodes as write‐nodes.
3. If there exists a direct control path from a write‐node into a read‐node that does not pass
through another write‐node, then you have identified a data edge.
• The data edge originates at the write‐node and ends at the read‐node.
4. Repeat the previous steps for all variable‐read nodes.
• This procedure identifies all data edges related to assignment statements, but not those
originating from conditional expressions in control flow statements.
Data and Control Edges of a C Program
Construction of the Data Flow Graph
• Let’s derive the DFG of the GCD program
• We first pick a node where a variable is read

Consider statement 5:
• There are two variable-reads in this statement, one for a and one for b Consider b first
• Find all nodes that reference b by tracing backwards through predecessors of node 5 in the CFG -- this
produces the ordered sequence 3, 2, 1, 4, and 5
• Both nodes 1 and 5 write b and there is a direct path from 1 to 5 (e.g. 1, 2, 3,5), and from 5 to 5 (e.g. 5, 2, 3, 5)
• Therefore, we need to add data edges for b from 1 to 5 and from 5 to 5
Data and Control Edges of a C Program
Construction of the Data Flow Graph
Incoming data edges for
• A similar process can be carried out for variable‐read of a in node 5 in the GCD program
node 5
• Nodes 1 and 4 write into a and there is a direct control path
from 1 to 5 and from 4 to 5
• Hence, data edges are added for a from 1 to 5 and from 4 to
5
• To complete the set of data edges into node 5, we also need
to identify all conditional expressions that affect the
outcome of node 5
• From the CFG, node 5 depends on the condition evaluated
in node 3 (a > b) AND the condition evaluated in node 2 (a
!= b)
Data and Control Edges of a C Program
Construction of the Data Flow Graph
The final DFG for all nodes and all variable-reads for GCD is shown below

Note: this DFG leaves out the data edges originating from conditional expressions
Translating C to Hardware
• As mentioned, control and data flow analysis can be helpful in translating C into
hardware
• Translating data structures and pointers from C to hardware can get tricky
• For the purpose of this course, we restrict our analysis as follows
• Only scalar C code is used (no pointers, arrays or other data structures)
• We assume each C statement executes in a single clock cycle
• We first create the CFG and DFG for the C program
• The control edges translate to signals that control datapath operations
• The data edges define the interconnection of the datapath components
Translating C to Hardware
Data Path Design:
With the CFG and DFG available, the following rules will define the implementation of the hardware
datapath.
1. Find the C expression embedded in each node of the CFG, and create an equivalent
combinational circuit to implement the expression.
• For example, if a node in the CFG corresponds to the C statement a = b ‐ a;, then the C expression embedded in
that statement is b ‐ a.
• The combinational circuit required to implement this expression is a subtractor.
• Conditional expressions generate datapath elements, too.
• The outputs of these expressions become the flags used by the hardware controller of this datapath.
2. Each variable in the C program is translated into a register with a multiplexer in front of it.
• The multiplexer is needed when multiple sources may update the register.
• By default, the register will update itself.
• The selection signals of the multiplexer will be driven by the controller.
3. The datapath circuit and the register variables are connected based on the nodes and data
edges in the DFG.
• Each assignment operation connects a combinational circuit with a register.
• Each data edge connects a register with the input of a combinational circuit.
• Finally, we also connect the system‐inputs and system outputs to inputs of datapath circuits and register outputs,
respectively.
Translating C to Hardware
Data Path Design:
• The datapath and the registers are connected consistent with the DFG
• Assignments connect combinational circuit outputs to register inputs, while the
• data edges connect register outputs to combinational circuit inputs.
• System I/O is connected to datapath inputs and register outputs resp.
• Let’s convert the GCD program to a hardware implementation
• The variables a and b are assigned to registers
• The conditional expressions for the if and while stmts require an equality‐ and greater‐than
comparator circuit
• The subtractions b ‐ a and a ‐ b are implemented using subtractors
• The connectivity of the components is defined by the data edges of the DFG
• The resulting datapath has two data inputs (in_a and in_b), and one data output (out_a)
• The circuit needs two control variables, upd_a and upd_b (outputs of controller)
• and it produces two flags, flag_while and flag_if (inputs to controller)
Translating C to Hardware
Translating C to Hardware
• The directed edges in the DFG correspond to the connections in the schematic
• Schematic representations of a circuit are low‐level representations with lots of detail
(similar to assembly code in programming languages)
• We will learn how to create HLS and behavioral VHDL descriptions that synthesize to
schematics similar to the one shown here
Translating C to Hardware: Control Path Design
• Controller Design: The design of the controller can
be derived directly from the CFG and translated
into a finite state machines (FSM)
• How do we create the controller for this datapath
such that it implements the GCD algorithm?
• This control information is present in the C
program and is captured in the CFG.
• In fact, we can translate the CFG almost straight
into hardware by considering it to be a finite state
machine (FSM) specification.
• Each of the transitions in this FSM takes 1 clock
cycle to complete.
• The activities of the FSM are expressed as
condition/command tuples.
Translating C to Hardware: Control Path Design

• Unlike CFGs, the edges in FSMs are


labeled with condition/command
tuples
• The “_” in _ /run1 means don’t care,
i.e., the transition is unconditional

• For example, _ /run1 means that during this clock cycle, the value of the condition flags is a don’t‐
care, while the command for the datapath is the symbol run1.
• Similarly, flag_while/_ means that this transition is conditional on flag while being true, and that the
command for the datapath is a hold operation.
• A hold operation is one which does not change the state of the datapath, including registers.
• The command set for this FSM includes ( _, run1, run4,run5).
• Each of these symbols represents the execution of a particular node of the CFG.
• The datapath control signals can be created by additional decoding of these command signals.
Translating C to Hardware: Control Path Design

• Each clock cycle, the controller


generates a new command based on
the current state and the value of
flag_while and flag_if.
• The commands run1, run4, and run5
are decoded into upd_a and upd_b.
• The combination of the data path
and controller is referred to as a
finite state machine with datapath
(FSMD)
Translating C to Hardware: Control Path Design
The table shows an example execution, where each row of the table corresponds to one clock cycle:

Note that this solution is sub‐optimal, in paricular:


• The resulting implementation limits parallelism
• it executes a single C statement per clock cycle and does not share datapath operators.
• For example, only one subtractor is needed in the implementation because only one is ever used
in any given clock cycle
Single‐Assignment Programs
• Converting into hardware with one C‐statement/clock is not very efficient
• This one cycle‐per‐statement is similar to what microprocessors do when they
execute a program
• A more lofty goal is to devise a translation strategy that allows the execution of
multiple C statement/clock
• But our original variable‐to‐register mapping strategy creates a performance
bottleneck
• This is true because only one storage location exists for each variable and
therefore, sequential updates to it will each require one clock cycle
• We fix this problem by converting the C code to a single‐assignment program
• This is done by creating new variables for each sequential assignment statement
Single‐Assignment Programs
• Consider a simple example:
a = a + 1;
a = a * 3;
a = a ‐ 2;
• Our previous strategy requires 3 clock cycles to execute these
statements
Let’s re‐write this as:
a2 = a1 + 1;
a3 = a2 * 3;
a4 = a3 ‐ 2;
• This code allows a2 and a3 to be mapped to wires and a4 to a
register, reducing the clock cycle count to 1
Single‐Assignment Programs
• Note: care must be taken that all assignments are taken into account,
which might be difficult to determine
a = 0;
for (i = 1; i < 6; i++)
a = a + i;
• After conversion to single‐assignment, it remains unclear what
version of a should be read inside of the loop
a1 = 0;
for (i = 1; i < 6; i++)
a2 = a + i; // which version of a to read
• The answer is that both a1 and a2 are needed and it depends on the
iteration, i.e., a1 is needed on the first iteration and a2 on
subsequent iterations
Single‐Assignment Programs
• The solution is to introduce a new merge variable that selects from
the two versions that are available
a1 = 0;
for (i=0; i<5; i++)
{
a3 = merge(a1, a2); // merge two instances.
a2 = a3 + 1;
}
• In a hardware implementation, the merge operation is mapped into a
multiplexer, with the selection signal derived from the test of (i == 0)
• Using these transformations, we can reformulate any program into
single assignment form
Single‐Assignment Programs
• Consider the GCD program The equivalent single‐assignment form:
int gcd (int a, int b) int gcd (int a1, int b1)
{ {
while (a != b) while ((a3 = merge(a1, a2)) != (b3 = merge(b1, b2)))
{ {
if (a > b) if (a3 > b3)
a = a ‐ b; a2 = a3 ‐ b3;
else else
b2 = b3 ‐ a3;
b = b ‐ a;
}
}
return a2;
return a;
}
}
Single‐Assignment Programs
The implementation of this single-assignment version might look like

• Here, a2 and b2 are mapped into registers while the other variables are replaced with wires
• This type of manipulation allows multiple C statements to be executed per clock
Finite State Machine with
Datapath
Cycle‐Based Bit‐Parallel Hardware
• In this chapter, we develop a model to systematically describe custom hardware
consisting of a controller and a datapath.
• Deal with how to create datapath modules, control model and the combination of
control and datapath into an Finite State Machine with Datapath (FSMD).
• We will develop a cycle‐based hardware models.
• In such models, the behavior of a circuit is expressed in steps of a single clock cycle.
• This abstraction level is very common in digital design, and it is captured with the term
synchronous design.
Cycle‐Based Bit‐Parallel Hardware
Hardware Modules
• A hardware module defines a level of hierarchy for a hardware netlist. In order to
communicate across the levels of hierarchy, hardware modules define ports.

• Figure shows the 3‐bit counter, encapsulated as a module.


• There is a single input port, clr, which allows to synchronously clear the register.
• There is also a 3‐bit output port c that holds the current count value.
Hardware Modules
• The always block is included in a dp (datapath), which defines a list of in and out ports.
• There can be as many input and output ports as needed, and they can be created in any
order.
• Registers and signals are local to a single module and invisible outside of the module
boundary.
• Input ports and output ports are equivalent to wires and therefore behave identical to
signals.
• After hardware is encapsulated inside of a module, the module itself can be used as a
component in another hardware design.
• This principle is called structural hierarchy.
Hardware Modules
• Once a module has been included inside of another one
by means of the use statement, it cannot be included
again: each module can be used only once.
• However, it is easy to create a duplicate of an existing
module by means of a cloning statement.
• Program shows how to create 3‐bit counters, count0,
count1 and count2.
Finite State Machines
• The expressions that are part of an always block are evaluated at each clock cycle, and it
is not possible to conditionally evaluate an expression.
• Even the selection operator (c ? expr1 : expr2) will evaluate the true‐part as well as the
false part regardless of the condition value c.
• Assume that expr1 and expr2 would contain an expensive operator, then we would need
two copies of that operator to implement c ? expr1 : expr2.
• A control model, on the other hand, allows us to indicate what expressions should
execute during each clock cycle.
• Very simple control models will only select the sequence of expressions to execute over
multiple clock cycles.
• a common control model for hardware description, called Finite State Machine (FSM).
Finite State Machines
• We begin looking at the options for mapping algorithms into hardware
• We begin at the behavioral level of abstraction with a register‐transfer‐level (RTL)
description of hardware using VHDL
• At the heart of RTL descriptions is the finite state machine with datapath (FSMD)
• The finite state machine component is constructed (usually by the synthesis tools) to
realize the control flow components of the algorithm
• The datapath is synthesized into a set of mathematical hardware operations, e.g., add
• and multiply, that process the data
• We begin with algorithms that are largely characterized as control‐only algorithms
• We consider full‐blown FSMD in future lectures
• Our goal is to become proficient at translating software algorithms (and other highlevel
• specifications) into hardware implementations
Finite State Machines
• An FSM is designed to control the execution of a sequence of operations over multiple
clock cycles, which makes it ideal for emulating software execution.
• It usually incorporates decision constructs, including if‐else and case statements, which
are used extensively in software execution for implementing control flow.
• Advanced FSMs can be designed to implement pipelines, recursion, out‐of‐order
execution and exception handling
• An FSM is a sequential digital machine, which is defined using:
• A set of states
• A set of inputs and outputs
• A state transition function
• An output function
Finite State Machines
• An FSM has a current state, equal to an element from the set of
states.
• Each clock cycle, the state transition function selects the next value
for the current state, and the output function selects the value on the
output of the FSM.
• The state transition function and the output function are commonly
described in terms of a graph.
• In that case, the set of states becomes the set of nodes of the graph,
and the state transitions become edges in the graph.
Finite State Machines

• Both Mealy and Moore outputs can be present in an FSM, and usually are Moore outputs
depend only on the current state
• VHDL descriptions of the FFs and combinational logic (next state logic and output logic)
are separated into two process blocks using two segment style
Mealy State Machine
• The Output of the state machine depends on both present state and
current input.
• When the input changes, the output of the state machine updated
without waiting for change in clock input.
Moore State Machine
• The Output of the State machine depends only on present state. The output of state machine are
only updated at the clock edge
VHDL code for Sequence detector (101) using moore state machine
VHDL code for Sequence detector (101) using mealy state machine
Algorithmic State Machine
• Algorithmic State Machine is a representation of a Finite State Machine suitable for FSMs
with a larger number of inputs and outputs compared to FSMs expressed using state
diagrams and state tables.
• Algorithmic state machines can model both Mealy and Moore Finite State
Machines, they can also model machines that are of the mixed type
Moore Example Reset

Reset A

w= 1 Next state
Present Output
w= 0 Az = 0 Bz = 0 0
w
w= 0 state z 1
w = 0 w = 1
B
w= 0 w= 1
A A B 0

Cz = 1 B A C 0 0
w
C A C 1 1

C
w= 1
z
State Table
State Diagram
0 1
w

ASM
VHDL code for ASM
CASE y IS END IF ;
USE ieee.std_logic_1164.all ; WHEN A =>
END PROCESS ;
IF w = '1' THEN
ENTITY simple IS y <= B ;
PORT ( Clock : IN STD_LOGIC ; ELSE z <= '1' WHEN y = C ELSE '0' ;
Reset : IN STD_LOGIC ; y <= A ;
w : IN STD_LOGIC ; END IF ;
z : OUT STD_LOGIC ) ; WHEN B => END Behavior ;
END simple ; IF w = '1' THEN Reset

y <= C ;
ARCHITECTURE Behavior OF simple IS ELSE
A

TYPE State_type IS (A, B, C) ; y <= A ;


SIGNAL y : State_type ;
0

END IF ; w
1

BEGIN WHEN C => B

PROCESS ( Reset, Clock ) IF w = '1' THEN


BEGIN y <= C ; 0
w

IF Reset = '1' THEN ELSE


1

y <= A ; y <= A ; C

z
ELSIF rising_edge(Clock) THEN END IF ;
END CASE ; 0
w
1
ASM for mealy and moore combined machine
• Let us consider an example of a state graph of a sequential network shown below.
• This state graph has both Mealy and Moore outputs. The outputs Y1 and Y2 are the Mealy
outputs and so should be conditional outputs.
• The Ya, Yb, and Yc are the Moore outputs so they should be part of state box.
• Input X can either be “0” or “1” and hence it should be part of the decision box.
Verilog model for ASM
Design of Secure Memory Access Controller
• Let’s consider a control algorithm that is designed to provide a secure
access control mechanism to an on‐chip memory
• create a project with GPIO and BRAM IP blocks
• The general purpose I/O is configured to implement two memory‐
mapped 32‐bit registers located between the PS and PL sides of the
Zynq SoC
• The BRAM is configured as a stand‐alone memory and is embedded
within the PL side
• Our memory controller’s primary function is to allow C programs
running on the ARM microprocessor to load and unload the BRAM in
a restricted manner
Design of Secure Memory Access Controller
• Let’s start with at level of abstraction of the FSM running in the hardware, e.g.,

• One of the first issues we’ll need to deal with is setting up a communication mechanism
between the C program and the FSM
• The GPIOs are visible to both the C program and the VHDL, and will be used for this
purpose
Secure Memory Access Controller
• We will define the bit‐fields within the two 32‐bit GPIO registers as follows:

• Some of the high order 16‐bits are designated for control while all of the lower order
• 16‐bits are designated for data transfers
• Note that the GPIOs cross clock domains, with the PL side running at 50 or 100 MHz and
the PS side running at more than 600 MHz
• Therefore, we need a reliable protocol to allow transfers between PL and PS
Secure Memory Access Controller
• Synchronization between our VHDL controller (PL) and the C program (PS) is done using a 2‐
way handshake, which makes use of two signals
• LM_ULM_stopped and LM_ULM_
• The following defines the protocol for data transfers from the C program to BRAM:continue

• Data transfer from BRAM to C program is similar except for direction of data flow
• This protocol ensures a reliable communication channel between the PS and PL side
Secure Memory Access Controller
Low Level Algorithm for Secure Memory Access Controller
• We are now ready to describe the FSM at a lower level of detail
• The C program starts the secure memory access controller (SMAC)
• FSM, but other usage scenarios may have other state machines start SMAC
• In this case, the C program may need to inform SMAC that it has no read or write
requests for the memory at that point in time
• The C program operations in the following are designed to provide context, and
are not part of the FSM
Secure Memory Access Controller
Algorithm for Secure Memory Access Controller
1. C program checks ’ready’, sets ’load_unload’ flag and issues ’start’ signal
2. C program waits for ’stopped’
3. idle: SMAC waits in idle for ’start’
if ( start = ’1’ )
Store ’base address’ and ’upper limit’ registers
Check ’load_unload’, if 0, Goto load_mem
Check ’load_unload’, if 1, Goto unload_mem
4. load_mem: Process write requests from C program Assert ’stopped’ signal
If ’done’ is 0
Check ’continue’, if asserted, update BRAM
Assert ’PNL_BRAM_we’
Assign ’PNL_BRAM_din’ GPIO data
Goto wait_load_unload
else
Goto wait_done
Secure Memory Access Controller
5. unload_mem: Process read requests from C program
Drive GPIO register with ’PNL_BRAM_dout’ data
Assert ’stopped’
If ’done’ is 0
Check ’continue’, if asserted, C program has data
Goto wait_load_unload
else
Goto wait_done
6. wait_load_unload: Finish handshake and update addr
Check ’continue’, if 0
If ’done’, if 1
Goto wait_done
Elsif ’base addr’ = ’upper_limit’
Goto idle
Else
Inc ’base addr’
Goto load_mem if ’load_unload’ is 0
Goto unload_mem if ’load_unload’ is 1
7. wait_done: Wait for C program to deassert ’done’
Check ’done’, if 0
Goto idle
ASM For LoadUnloadMem
System‐on‐Chip architecture
Introduction
• Technological Advances
• today’s chip can contains more than hundred million transistors.
• transistor gate lengths are now in term of nano meters.
• approximately every 18 months the number of transistors on a chip doubles –
Moore’s law.
• The Consequences
• components connected on a Printed Circuit Board can now be integrated onto
single chip .
• hence the development of System‐On‐Chip design .

4 April 2025
What is SoC
• The VLSI manufacturing technology advances has made possible to put
millions of transistors on a single die.

• It enables designers to put systems‐on‐a‐chip that move everything from


the board onto the chip eventually.

• SoC is a high performance microprocessor, since we can program and


give instruction to the uP to do whatever you want to do.

• SoC is integration of heterogeneous or different types of silicon IPs on to


the same chip, like memory, uP, random logics, and analog circuitry.

4 April 2025
System on Chip

4 April 2025
What is SoC ?
• An IC that integrates multiple components of a system onto a single chip.
• SoC not only chip, but more on “system”.
• SoC = Chip + Software + Integration
• The SoC chip includes:
• Embedded processor
• ASIC Logics and analog circuitry
• Embedded memory
• The SoC Software includes:
• OS, compiler, simulator, firmware, driver, protocol stack Integrated
• development environment (debugger, linker, ICE) Application interface
• (C/C++, assembly)

4 April 2025
Evolution: Boards to SoC
• Evolution:
• IP based design
• Platform‐based design
• Some Challenges
• HW/SW Co‐design
• Integration of analog (RF) IPs
• Mixed Design
• Productivity
• Emerging new technologies
• Greater complexity
• Increased performance
• Higher density
• Lower power dissipation

4 April 2025
Migration from ASICs to SoCs

In the mid-1990s, ASIC technology evolved from a chip-set philosophy to an


embedded-cores-based system-on-a-chip concept.
• An SoC is an IC designed by stitching together
multiple stand-alone VLSI designs to provide full
functionality for an application.
• An SoC compose of predesigned models of
complex functions known as cores (terms such
as intellectual property block, virtual
components, and macros) that serve a variety
of applications.

4 April 2025
Three forms of SoC design
The scenario for SoC design is characterized by three forms:

1. ASIC vendor design: This refers to the design in which all the
components in the chip are designed as well as fabricated by
an ASIC vendor.
2. Integrated design: This refers to the design by an ASIC vendor
in which all components are not designed by that vendor. It
implies the use of cores obtained from some other source
such as a core/IP vendor or a foundry.
3. Desktop design: This refers to the design by a fabless
company that uses cores which for the most part have been
obtained from other source such as IP companies, EDA
companies, design services companies, or a foundry.

4 April 2025
A common set of problems facing everyone who is
designing complex chips

• Time-to-market pressures demand rapid development.


• Quality of results (performance, area, power) - key to market success.
• Increasing chip complexity makes verification more difficult.
• Deep submicron issues make timing closure more difficult.
• The development team has different levels and areas of expertise, and is often scattered
throughout the world.
• Design team members may have worked on similar designs in the past, but cannot reuse
these designs because the design flow, tools, and guidelines have changed.
• SoC designs include embedded processor cores, and thus a significant software
component, which leads to additional methodology, process, and organizational challenges.

Reusing macros (called “cores”, or IP) that have already been designed and
verified helps to address all of the problems above.

4 April 2025
Design for Reuse
To overcome the design gap, design reuse - the use of pre-designed
and pre-verified cores, or reuse of the existing designs becomes a
vital concept in design methodology.
An effective block-based design methodology requires an extensive
library of reusable blocks, or macros, and it is based on the following
principles:
 The macro must be extremely easy to integrate into the overall
chip design.
 The macro must be so robust that the integrator has to perform
essentially no functional verification of internals of the macro.

The challenge for designers is not whether to adopt reuse, but how to employ
it effectively.
4 April 2025
Intellectual Property
Utilizing the predesigned modules Resources vs. Number of Uses
enables:
 to avoid reinventing the wheel for
every new product,
 to accelerate the development of
new products,
 to assemble various blocks of a
large ASIC/SoC quite rapidly,
 to reduce the possibility of failure
based on design and verification of
a block for the first time.

These predesigned modules are commonly called


Intellectual Property (IP) cores or Virtual Components (VC).

4 April 2025
Intellectual Property Categories
IP cores are classified into three distinct categories:
Hard IP cores consist of hard layouts using particular physical design libraries and are
delivered in masked-level designed blocks (GDSII format). The integration of hard IP cores
is quite simple, but hard cores are technology dependent and provide minimum flexibility
and portability in reconfiguration and integration.
Soft IP cores are delivered as RTL VHDL/Verilog code to provide functional descriptions of
IPs. These cores offer maximum flexibility and reconfigurability to match the requirements
of a specific design application, but they must be synthesized, optimized, and verified by
their user before integration into designs.
Firm IP cores bring the best of both worlds and balance the high performance and
optimization properties of hard IPs with the flexibility of soft IPs. These cores are delivered
in form of targeted netlists to specific physical libraries after going through synthesis without
performing the physical layout.

4 April 2025
Comparison of Different IP Formats

IP Format Representation Optimization Technology Reusability

Hard GDSII Very High Technology Low


Dependent
Soft RTL Low Technology Very High
Independent
Firm Target Netlist High Technology High
Generic

4 April 2025
Typical SoC

4 April 2025
SoC Structure

4 April 2025
Multi‐Core (Processor) System‐on‐Chip
• Inter‐node communication between CPU/cores can be performed by
message passing or shared memory.
• Number of processors in the same chip‐die increases at each node (CMP
and MPSoC).
• Memory sharing will require: Shared Bus
• Large Multiplexers
• Cache coherence
• Not Scalable
• Message Passing: NOC: Network‐on‐Chip
• Scalable
• Require data transfer transactions
• Overhead of extra communication

4 April 2025
Buses to Networks
• Architectural paradigm shift: Replace wire spaghetti by network
• Usage paradigm shift: Pack everything in packets
• Organizational paradigm shift
• Confiscate communications from logic designers
• Create a new discipline, a new infrastructure responsibility

4 April 2025
System‐on‐Chip(SoC)

4 April 2025
Traditional SoC
• Variety of dedicated interfaces
• Design and verification complexity
• Unpredictable performance
• Many underutilized wires

DMA CPU DSP


Control
signals
CPU Bus
A
Bridge
B Peripheral Bus
IO IO IO
C
4 April 2025
NoC: A paradigm Shift in VLSI

From: Dedicated signal wires To: Shared network

s s s

Module

s s s
Module
Module

s s s

Point- Computing Network


To-point Module switch
Link

4 April 2025
NoC: A paradigm Shift in VLSI

NI DMA
CPU NI
Coproc NI NI DSP

switch switch NI Ethnt


I/O NI
switch
switch
NoC switch
NI DMA

switch
NI MPEG
DRAM NI

DRAM NI NI Ethnt
Accel NI

4 April 2025
NoC Operation Example

Interface
Network
CPU switch
switch
1. CPU request
2. Packetization and trans.

Interface
Network
switch I/O
3. Routing
4. Receipt and unpacketization (AHB, OCP, ... pinout)
5. Device response
6. Packetization and transmission
7. Routing
8. Receipt and unpacketization
4 April 2025
Network‐on‐chip (NoC)
What are NoC’s?

 “Network‐on‐a‐chip (NoC)” is a new paradigm for System‐on‐Chip (SoC)


design.
 Addresses global communication in SoC, involving
 (i) a move from computation‐centric to communication‐centric design and
 (ii) the implementation of scalable communication structures.
 The NoC solution brings a networking method to on‐chip communications
and claims roughly a threefold performance increase over conventional bus
systems.


4 April 2025
Why NoC’s emerged ???

• Growing Chip Density


• Demand for MultiProcessor (Many Core Systems) SoCs
• Demand for scalable, high performance and robust infrastructure for on‐
chip communication.

4 April 2025
On‐chip Interconnection Types

Wait Wait Wait

I/O
P1 P2 P3 M1 P4 M2
2

I/O I/O
P5 P6 M3 P7 M4
1 3

Wait Wait Wait

Shared bus

4 April 2025
On‐chip Interconnection Types
Wait

P3 M1 P4 M2

Wait
I/O

Bridge
P1 P2 I/O
2 M3 P7 M4
3

Wait

I/O
P5 P6
1

Wait Wait
Hierarchical bus
4 April 2025
On‐chip Interconnection Types
M2 M3

I/O
P1
1

M1 P6

P2 P5 Wait

Wait P3 P4

Bus matrix
4 April 2025
On‐chip Interconnection Types

Processing Router
element

Unidirectional
links

Network
Interface

Input
buffers

Network-on-Chip
4 April 2025
On‐chip Interconnection Types

4 April 2025
Why today's SoC's need a NoC interconnect fabric?

Benefits of NoCs :
• Reduce Wire Routing Congestion
• Ease Timing Closure
• Higher Operating Frequencies
• Change IP Easily

4 April 2025
Network‐on‐chip (NoC)
 Network‐on‐chip (NoC) is a reliable and scalable communication paradigm deemed
as an alternative to classic bus systems in modern systems‐on‐chip designs
 a move from computation‐centric to communication‐centric design

4 April 2025
SoC Concepts
• A system‐on‐chip architecture combines one or more microprocessors, an on‐chip bus
system, several dedicated coprocessors, and on‐chip memory, all on a single chip.
• An SoC architecture provides general‐purpose computing capabilities along with a few
highly specialized functions, adapted to a particular design domain.
SoC Concepts
• A processor with a specific configuration of peripherals is also called a platform.
• Just like a personal computer is a platform for general‐purpose computing, a system‐on‐chip is a
platform for domain specialized computing.
• Examples of application domains are mobile telephony, video processing, or high‐speed
networking.
• The set of applications in the video‐processing domain for example could include image
transcoding, image compression and decompression, image color transformations, and so forth.
• The specialization of the platform ensures that its processing efficiency is higher compared to that
of general‐purpose solutions.
• Entire families of microcontrollers are defined using a single type of microprocessor integrated
with different kinds of peripherals (to serve one market segment).
• This type of SoC is called a derivative platform.
• For example, an SoC for automative applications may contain different peripherals than an SoC
for cellphones, even though both may be based on ARM.
SoC Concepts
• SoC architecture can be analyzed along 4 orthogonal dimensions:
• Control, Communication, Computation, and Storage.
• The role of central controller is given to the microprocessor, who is responsible of issuing
control signals to, and collecting status signals from, the various components in the
system.
• The microprocessor may or may not have a local instruction memory.
• In case it does not have a local instruction memory, caches may be utilized to improve
instruction memory bandwidth.
• The SoC implements communication using system‐wide busses.
• Each bus is a bundle of signals including address, data, control, and synchronization
signals.
SoC Concepts
• The data transfers on a bus are expressed as read‐ and write‐operations with a particular
memory address.

• The bus control lines indicate the nature of the transfer (read/write, size, source,
destination), while the synchronization signals ensure that the sender and receiver on the
bus are aligned in time during a data transfer.

• Each component connected to a bus will respond to a particular range of memory


addresses.

• The ensemble of components can thus be represented in an address map, a list of all
system‐bus addresses relevant in the system‐on‐chip.
SoC Concepts
• It is common to split SoC busses into segments.
• Each segment connects a limited number of components, grouped according to their
communication needs.
• In the example, a high‐speed communication bus is used to interconnect the microprocessor,
a high‐speed memory interface, and a Direct Memory Access (DMA) controller.
• A DMA is a device specialized in performing block‐transfers on the bus, for example to copy
one memory region to another.
• Next to a high‐speed communication bus, you may also find a peripheral bus, intended for
lower‐speed components such as a timer and input–output peripherals.
• Segmented busses are interconnected with a bus bridge, a component that translates bus
transfers from one segment to another segment.
• A bus bridge will only selectively translate transfers from one bus segment to the other.
• This selection is done based on the address map.
SoC Concepts
• Therefore, bus segmentation increases the available communication parallelism
• The bus control lines of each bus segment are under command of the bus master.
• The control lines are controlled by a bus master.
• The microprocessor and the ‘DMA controller’ are examples of bus masters.
• Other components, the bus slaves, will follow the directions of the bus master.
• Each bus segment can contain one or more bus masters.
• In case there are multiple masters, the identity of the bus master can be rotated among
bus‐master components at run time.
• In that case, a bus arbiter will be needed to decide which component can become a bus
master for a given bus transfer.
SoC Interfaces for Custom Hardware
• The shaded areas in the SoC block diagram correspond to places where a designer could
integrate custom hardware.
• “custom hardware module” means a dedicated digital machine described as an FSMD or
as a microprogrammed machine.
• Eventually, all custom hardware will be under control of the central processor in the SoC.
• The SoC architecture offers several possible hardware software interfaces to attach
custom hardware modules.
• Three approaches can be distinguished in the SoC block diagram .
SoC Interfaces for Custom Hardware
• The most general approach is to integrate a custom hardware module as a standard peripheral on
a system bus.
• The microprocessor communicates with the custom hardware module by means of read/write
memory accesses.
• The memory addresses occupied by the custom hardware module cannot be used for other
purposes (i.e., as addressable memory).
• For the memory addresses occupied by the custom hardware module, the microprocessors’ cache
has no meaning, and the caching effect is unwanted.
• Microcontroller chips with many different peripherals typically use this memory‐mapped strategy
to attach peripherals.
• The strong point of this approach is that a universal communication mechanism (memory
read/write operations) can be used for a wide range of custom hardware modules.
• The corresponding disadvantage, of course, is that such a bus‐based approach to integrate
hardware is not very scalable in terms of performance
SoC Interfaces for Custom Hardware
• A second mechanism is to attach custom hardware through a local bus system or
coprocessor interface provided by the microprocessor.
• In this case, the communication between the hardware module and the microprocessor
will follow a dedicated protocol, defined by the local bus system or coprocessor
interface.
• In comparison to system‐bus interfaces, coprocessor interfaces have a high bandwidth
and a low latency.
• The microprocessor may also provide a dedicated set of instructions to communicate
over this interface.
• Typical coprocessor interfaces do not involve a memory addresses.
• This type of coprocessor obviously requires a microprocessor with a coprocessor‐ or
local‐bus interface.
SoC Interfaces for Custom Hardware
• Microprocessors may also provide a means to integrate a custom‐hardware datapath
inside of the microprocessor.
• The instruction set of the microprocessor is then extended with additional, new
instructions to drive this custom hardware.
• The communication channel between the custom datapath and the processor is typically
through the processor register file, resulting in a very high communication bandwidth.
• However, the very tight integration of custom hardware with a microprocessor also
means that the traditional bottlenecks of the microprocessor are also a bottleneck for
the custom‐hardware modules.
• If the microprocessor is stalled because of external events (such as memory‐access
bandwidth), the custom data‐datapath is stalled as well.
Four Design Principles in SoC Architecture
• A SoC is very specific to an application domain.
• Are there any guiding design principles that are relevant to the design of any SoC?
• The following are the four design principles that govern the majority of modern SoC architectures.
I. Heterogeneous and distributed communications
II. Heterogeneous and distributed data processing
III. Heterogeneous and distributed storage
IV. Hierarchical control.
Heterogeneous and Distributed Data Processing
• A first prominent characteristic of an SoC architecture is heterogeneous and distributed
data processing.

• An SoC may contain multiple independent (distributed) computational units.

• Moreover, these units can be heterogenous and include FSMD, microprogrammed


engines, or microprocessors.

• Three forms of data‐processing parallelism.


I. The first is word level parallelism, which enables the parallel processing of multiple bits in
a word.
II. The second is operation‐level parallelism, which allows multiple instructions to be
executed simultaneously.
III. The third is task‐level parallelism, which allows multiple independent threads of control to
be executed independently.
Heterogeneous and Distributed Data Processing
• Each of the computational units in an SoC can be specialized to a particular function.

• The overall SoC therefore includes a collection of heterogeneous computational units.

• For example, a digital signal processing chip in a camera may contain specialized units to
perform image‐processing.

• Computational specialization is the key to obtain an efficient chip.

• In addition, the presence of all forms of parallelism (word‐level, operation‐level, task‐


level) ensures that an SoC can fully exploit the technology.
Heterogeneous and Distributed Data Processing
• In fact, integrated circuit technology is extremely effective to provide computational
parallelism.

• Consider the following numerical example.

• A 1‐bit full‐adder cell can be implemented in about 28 transistors.

• The Intel Core 2 processor contains 291 million transistors in 65 nm CMOS technology.

• This is sufficient to implement 325,000 32‐bit adders.

• Assuming a core clock frequency of 2.1GHz, we thus find that the silicon used to create a
Core 2 can theoretically implement 682,000 Giga‐operations per second.

• We call this number the intrinsic computational efficiency of silicon.

291 10
𝐸 2.1𝐺 682000𝐺𝑜𝑝𝑠
28 32
Heterogeneous and Distributed Data Processing
• The actual Core 2 architecture handles around 9.24 instructions per clock cycle, in a
single core and in the most optimal case.
• The actual efficiency of the 2.1GHz Core 2 therefore is 19.4Giga‐operations per second.
• We make the (strong) approximation that these 9.24 instructions each correspond to a
32‐bit addition, and call the resulting throughput the actual Core2 efficiency.
• The ratio of the intrinsic Core2 efficiency over the actual Core2 efficiency illustrates the
efficiency of silicon technology compared to the efficiency of a processor core
architecture.

𝐸 682000
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 35,150
𝐸 19.4

• Therefore, bare silicon can implement computations 35,000 times more efficient than a
Core2!
• it demonstrates why specialization of silicon using multiple, independent computational
units is so attractive.
Heterogeneous and Distributed Communications
• The central bus in a system‐on‐chip is a critical resource.

• It is shared by many components in an SoC.

• One approach to prevent this resource from becoming a bottleneck is to split the bus
into multiple bus segments using bus bridges.

• The bus bridge is therefore a mechanism to create distributed on‐chip communications.

• The on‐chip communication requirements typically show large variations over an SoC.

• Therefore, the SoC interconnection mechanisms should be heterogeneous as well.

• There may be shared busses, point‐to‐point connections, serial connections, and parallel
connections.
Heterogeneous and Distributed Communications
• Heterogeneous and distributed SoC communications enable a designer to exploit the on‐
chip communication bandwidth.
• In modern technology, this bandwidth is extremely high.
• Example:
• In a 90nm 6‐layer metal processor, we can reasonably assume that metal layers will be
used as follows:
• Two metal layers are used for power and ground, respectively,
• Two metal layers are used to route wires in the X direction,
• Two metal layers are used to route wires in the Y direction.
• The density of wires in a 90 nm process is 4 wires per micron and the bit frequency is at
500 MHz.
• Consequently, in a chip of 10 millimeter on the side, we will have 40,000 wires per
• Such a chip can transport 80,000 bits in any direction at a frequency of 500MHz.
• This corresponds to 40 terabits per second
Heterogeneous and Distributed Communications
• On‐chip bandwidth is thus cheap. The challenge is to efficiently organize it.
• Note that the same is not true for off‐chip bandwidth, i.e., off‐chip bandwidth is very
expensive.
• Paul Franzon describes an example of a chip that would produce 8 Tbps off‐chip bandwidth.
• The best links nowadays produce 20 Gbps by means of differential signaling.
• Here, each signal is produced in direct as well as complementary form, and thus consumes
two chip pins.
• At 20 Gbps, we would need 400 pairs to produce 8 Tbps, which requires 800 pins.
• In addition, you would need to add 800 pins to provide ground and power, just to provide
adequate drive capability and signal integrity.
• The 8 Tbps chip thus requires over 1600 pins, which indicates the package limits its
practicality.
• In addition, assuming about 30 mW per pair, the chip would consume 12 W for I/O
operations alone.
Heterogeneous and Distributed Storage
• A System‐on‐Chip contains multiple types of memories distributed across multiple
locations of a chip.
• For example,
• Microprocessors/FSMD/etc. contain one or more registers
• Microprocessors contain instruction‐caches and data‐caches (static RAM)
• Microprocessors and microprogrammed engines contain local instruction memories
• Dedicated buffer memories may be present to support a specific application, For
example, a video buffer in a camera, a scratchpad memory next to a processor
• Distributed storage can significantly complicate the concept of a centralized memory
address space which is common in classic computing architectures.
• Often, local memories are just local buffer storage, invisible for the other components of
the system.
Heterogeneous and Distributed Storage
• Therefore, local storage helps to increase parallelism in the overall system.
• However, very often, distributed memories need to maintain or communicate common
data sets.

• Consider the system above, that consists of a CPU with a vector multiplier coprocessor
(implemented as FSMD).
• The coprocessor can calculate very quickly the inner product of two vectors stored in a
local data buffer.
Heterogeneous and Distributed Storage
• Thus, for two arrays u and v stored in the data buffer, the vector multiplier evaluates:

• Now consider how this system operates when the CPU performs a matrix multiplication
on matrices which are stored in the data memory.
• A matrix multiplication in software can be written as three nested loops:

• In order to use the vector multiplier, we need to copy a row of a and a column of b to u
and v in the data buffer attached to the multiplier.
Heterogeneous and Distributed Storage
• We then perform the vector‐multiplication, and transmit the result to be stored in c[i][j].
• The C program on the CPU might look like:

• A potential problem here is the excessive data movement in the system.


• In addition, this data movement is sequential to the execution of the coprocessor.
• The speedup of the system is defined by the combined execution time of coprocessor
execution time and the data movement time.
Heterogeneous and Distributed Storage
• The fundamental cause of the excessive data movement is the presence of distributed
storage in the SoC.
• When selecting coprocessor functions in a SoC, it is very important to keep the data
movement issue in mind.
Hierarchy of Control
• The final concept in the architecture of an SoC is the
hierarchy of control among components.
• A hierarchy of control means that the entire SoC
operates as a single logical entity.
• This implies that all components in an SoC will need
to synchronize at some point.
• In the vector‐multiplier example given above, the
vector‐multiplier hardware needs to be synchronized
to the software program.
• A typical exchange of commands between the CPU
and a Vector Multiplier is shown in the figure
• Time runs from top to bottom, and the activities in
each of the CPU and Vector Multiplier are annotated
on this time axis.
Hierarchy of Control
• In this organization, the CPU and the Vector Multiplier each work in turn (there is no
overlapped operation), but the CPU maintains centralized control.
• The vector multiplier waits for data and commands to be provided from the CPU.
• The design of a good control hierarchy is a challenging problem.
• On the one hand, it should exploit the distributed nature of the SoC as good as possible
• This implies doing many things in parallel.
• On the other hand, it should minimize the number of conflicts that arise as a result of
running things in parallel.
• Such conflicts can be the result of overloading the available bus‐system or memory
bandwidth, or overscheduling a coprocessor.
Portable Multimedia System
• Given the four properties mentioned above (distributed data processing, distributed
communications, distributed storage, hierarchy of control)

• Let us look at an example of a hardware/software platform, and spot the interesting


elements as well as the potential bottlenecks.

• The block diagram below is a digital media processor by Texas Instruments.

• It is intended to be used for still images, video, and audio in portable (battery operated)
devices.

• This chip, the TMS320DM310, is manufactured in 130 nm CMOS.


• The entire chip consumes no more than 250mW in default‐preview mode and 400mW
when video encoding and decoding is operational
Portable Multimedia System
• The chip supports a number of device modes.
• Each mode corresponds to typical user activity.
• Live preview of images (coming from the CMOS imager) on the video display.
• Live‐video conversion to a compressed format (MPEG, MJPEG) and streaming of the result
into an external memory.
• Still‐image capturing of a high‐resolution image and conversion to JPEG.
• Live audio capturing and audio compression to MP3, WMA, or AAC.
• Video decode and playback of a recorded stream onto the video display.
• Still image decode and playback of a stored image onto the video display.
• Audio decode and playback.
• Photo printing of a stored image into a format suitable for a photo printer.
Portable Multimedia System (TMS320DM310 Functional Block Diagram)
Portable Multimedia System (TMS320DM310 Functional Block Diagram)
Portable Multimedia System
• The chip contains four subsystems, roughly corresponding to each quadrant
in the figure.
Imaging/Video Subsystem
• The imaging/video subsystem contains a CCD interface, capable of sampling up to 40 MHz at 12
bits per pixel.
• The CCD sensor needs to provide high‐resolution still images (2 to 5 Mpixels) as well as moving
images (up to 30 frames/s at 640x480 pixels (SD resolution ) or 1920 x 1080 pixels (HD
resolution)).
• Most CCD sensors record only a single color per pixel.
• Typically there are 25% red pixels, 25% blue pixels and 50% green pixels.
• This means that, before images can be processed, the missing pixels need to be filled in
(interpolated).
• This is done by the preview engine, and is a typical example of streaming and dedicated
processing.
Portable Multimedia System

Imaging/Video Subsystem
• The video subsystem also contains a video encoder, capable of merging two video
streams on screen, and providing picture‐in‐picture functionality.
• The video encoder also contains menu subsystem functionality.
• The output of the video encoder goes to an attached LCD or a TV.
• The video encoder provides approximately 100 operations per pixel, while the power
budget of the entire video subsystem is less than 100 mW.
• These numbers are clearly out of range for a software‐driven processor.
Portable Multimedia System
DSP Subsystem
• The DSP subsystem is created on top of a 72 MHz C54x processor with 128 Kbytes of RAM.
• The DSP processor performs the main processing and incorporates the control logic for the wide
range of signal processing algorithms.
• Signal processing algorithms include MPEG‐1, MPEG‐2, MPEG‐4, WMV, H.263, H.264, JPEG,
JPEG2K, M‐JPEG, MP3, AAC, WMA.
• A coprocessor subsystem delivers additional computing power for the cases where the DSP falls
short.
• There is a DMA engine that moves data back and forth between the memory attached to the
DSP and the coprocessors.
Portable Multimedia System
DSP Subsystem

• There are three coprocessors.


• A SIMD‐type of coprocessor to provide vector‐processing for image processing algorithms.
• A quantization coprocessor to perform quantization in various image encoding algorithms.
• A coprocessor that performs Huffman encoding for image encoding standards.
• The coprocessor subsystem increases to overall processing parallelism of the chip, as they can
work concurrently with the DSP processor.
• This allows the system clock to be decreased.
Portable Multimedia System
The ARM Subsystem
• Serves as the overall system manager that synchronizes and controls the different
subcomponents of the system.
• It also provides interfaces for data I/O, and user interface support.
Portable Multimedia System
• Each of the four properties discussed in the previous section can be identified in this chip.
• The SoC contains heterogeneous and distributed processing. There is hardwired processing
(video subsystem), signal processing (DSP), and general‐purpose processing on an ARM
processor.
• All of this processing can have overlapped activity.
• The SoC contains heterogeneous and distributed interconnect.
• Instead of a single central bus, there is a central “switchbox” that multiplexes accesses to the off‐
chip memory.
• Where needed, additional dedicated interconnections are implemented.
• Some examples of dedicated interconnections include the bus between the DSP and its
instruction memory, the bus between the ARM and its instruction memory, and the bus between
the coprocessors and their image buffers.
Portable Multimedia System
• The SoC contains heterogeneous and distributed storage. The bulk of the memory is contained
within an off‐chip SDRAM module, but there are also dedicated instruction memories attached
to the TI DSP and the ARM, and there are dedicated data memories acting as small dedicated
buffers.
• Finally, there is a hierarchy of control that ensures the overall parallelism in the architecture is
optimal.
• The ARM will start/stop components and control data streams depending on the mode of the
device.
• The DM310 chip is an excellent example of the balancing effort required to support real‐time
video and audio in a portable device.

You might also like